The doc page https://huggingface.co/docs/diffusers/conceptual/evaluation currently suggests using HEIM for diffusion model evaluation.
Unfortunately, HEIM is no longer officially supported or actively maintained (see stanford-crfm/helm#2161, stanford-crfm/helm#3741) and is currently not in a state that would function properly off the shelf.
From my experience, it is currently quite the pain to get some version of HEIM to somewhat function (it did involve some major manual rewrites and tons of dependency conflicts), and still I ran into trouble even reproducing the standard leaderboards.
I would hence recommend to remove HEIM from the list of suggested evaluation frameworks.
The doc page https://huggingface.co/docs/diffusers/conceptual/evaluation currently suggests using HEIM for diffusion model evaluation.
Unfortunately, HEIM is no longer officially supported or actively maintained (see stanford-crfm/helm#2161, stanford-crfm/helm#3741) and is currently not in a state that would function properly off the shelf.
From my experience, it is currently quite the pain to get some version of HEIM to somewhat function (it did involve some major manual rewrites and tons of dependency conflicts), and still I ran into trouble even reproducing the standard leaderboards.
I would hence recommend to remove HEIM from the list of suggested evaluation frameworks.