Instead of using pre-built containers, you can run evaluations with your own Python virtual environment by passing --venv-path.
-
Create a venv with Python 3.12:
uv venv --python 3.12 /path/to/.venv
-
Install lm-eval and lmms-eval dependencies:
uv pip install --python /path/to/.venv/bin/python -r requirements-venv.txt
This installs
lm-eval,torch,transformers,accelerate,datasets<4.0.0, andlmms-eval. -
Install lighteval as isolated tool (avoids datasets version conflict):
UV_TOOL_DIR=/path/to/.uv-tools UV_TOOL_BIN_DIR=/path/to/.venv/bin \ uv tool install --python 3.12 \ --with "langcodes[data]" --with "pillow" \ "lighteval[multilingual] @ git+https://github.com/huggingface/lighteval.git"
# Text evaluation
oellm schedule-eval \
--models HuggingFaceTB/SmolLM2-135M-Instruct \
--task-groups open-sci-0.01 \
--venv-path /path/to/.venv
# Image evaluation (lmms-eval)
oellm schedule-eval \
--models path/to/vlm \
--task-groups image-vqa \
--venv-path /path/to/.venvlm-eval requires datasets<4.0.0 while lighteval requires datasets>=4.0.0. Installing lighteval as an isolated uv tool (like the containers do) avoids this conflict. lmms-eval is compatible with datasets<4.0.0 and can be installed alongside lm-eval in the same venv.
| Package | Install method | Reason |
|---|---|---|
lm-eval, torch, transformers, accelerate, datasets<4.0.0, lmms-eval |
uv pip install -r requirements-venv.txt |
lm-eval + image eval, compatible dataset pin |
lighteval[multilingual] |
uv tool install (isolated) |
Requires datasets>=4.0.0 — must be isolated |
lm-eval requires datasets<4.0.0 while lighteval requires datasets>=4.0.0. Installing lighteval as an isolated uv tool (like the containers do) avoids this conflict.
dclm-core-22 needs lm-eval==0.4.9.2 (v0.4.10+ breaks agieval_lsat_ar in few-shot). Use requirements-venv-dclm.txt instead of the default requirements:
uv venv --python 3.12 dclm-core-venv
uv pip install --python dclm-core-venv/bin/python -r requirements-venv-dclm.txtoellm schedule-eval \
--models Qwen/Qwen3-0.6B-Base \
--task-groups dclm-core-22 \
--venv-path dclm-core-venv \
--skip-checks trueThe reasoning task group includes 6 benchmarks: GSM8k, IFEval, and MBPP run via lm-eval-harness, while GPQADiamond, MATH500, and LiveCodeBench run via evalchemy.
Note: The evalchemy versions of GPQA and MATH500 differ from lm-eval-harness. Evalchemy uses free-form generation with CoT reasoning instead of log-likelihood scoring.
We use Ali's fork which includes a fix to randomize GPQA answer ordering to eliminate positional bias, along with context window safety fixes. The PR is yet to be merged upstream.
-
Clone the repo at the pinned commit:
git clone https://github.com/Ali-Elganzory/evalchemy.git evalchemy cd evalchemy && git checkout 54ac97648230c4c3a22c3a2b93068b5a4e573f8d && cd ..
-
Create a venv and install dependencies:
uv venv --python 3.12 evalchemy-venv uv pip install --python evalchemy-venv/bin/python -r requirements-venv-evalchemy.txt
-
Run with
EVALCHEMY_DIRpointing to the cloned repo:export HF_ALLOW_CODE_EVAL=1 # required by MBPP EVALCHEMY_DIR=$(pwd)/evalchemy oellm schedule-eval \ --models HuggingFaceTB/SmolLM2-135M \ --task-groups reasoning \ --venv-path evalchemy-venv \ --skip-checks true
Note:
HF_ALLOW_CODE_EVAL=1is required because MBPP (run via lm-eval-harness) uses HuggingFace'scode_evalmetric which executes model-generated code. The evalchemy benchmarks (GPQADiamond, MATH500, LiveCodeBench) do not require this variable as they handle code execution safely through internal guards.