Using Your Own Virtual Environment

Overview

Instead of using pre-built containers, you can run evaluations with your own Python virtual environment by passing --venv-path.

Setup

Create a venv with Python 3.12:
```
uv venv --python 3.12 /path/to/.venv
```
Install lm-eval and lmms-eval dependencies:
```
uv pip install --python /path/to/.venv/bin/python -r requirements-venv.txt
```
This installs lm-eval, torch, transformers, accelerate, datasets<4.0.0, and lmms-eval.

Install lighteval as isolated tool (avoids datasets version conflict):

UV_TOOL_DIR=/path/to/.uv-tools UV_TOOL_BIN_DIR=/path/to/.venv/bin \
  uv tool install --python 3.12 \
    --with "langcodes[data]" --with "pillow" \
    "lighteval[multilingual] @ git+https://github.com/huggingface/lighteval.git"

Usage

# Text evaluation
oellm schedule-eval \
    --models HuggingFaceTB/SmolLM2-135M-Instruct \
    --task-groups open-sci-0.01 \
    --venv-path /path/to/.venv

# Image evaluation (lmms-eval)
oellm schedule-eval \
    --models path/to/vlm \
    --task-groups image-vqa \
    --venv-path /path/to/.venv

Why Multiple Install Steps?

lm-eval requires datasets<4.0.0 while lighteval requires datasets>=4.0.0. Installing lighteval as an isolated uv tool (like the containers do) avoids this conflict. lmms-eval is compatible with datasets<4.0.0 and can be installed alongside lm-eval in the same venv.

Dependency Summary

Package	Install method	Reason
`lm-eval`, `torch`, `transformers`, `accelerate`, `datasets<4.0.0`, `lmms-eval`	`uv pip install -r requirements-venv.txt`	lm-eval + image eval, compatible dataset pin
`lighteval[multilingual]`	`uv tool install` (isolated)	Requires `datasets>=4.0.0` — must be isolated

lm-eval requires datasets<4.0.0 while lighteval requires datasets>=4.0.0. Installing lighteval as an isolated uv tool (like the containers do) avoids this conflict.

DCLM-core-22

dclm-core-22 needs lm-eval==0.4.9.2 (v0.4.10+ breaks agieval_lsat_ar in few-shot). Use requirements-venv-dclm.txt instead of the default requirements:

uv venv --python 3.12 dclm-core-venv
uv pip install --python dclm-core-venv/bin/python -r requirements-venv-dclm.txt

oellm schedule-eval \
    --models Qwen/Qwen3-0.6B-Base \
    --task-groups dclm-core-22 \
    --venv-path dclm-core-venv \
    --skip-checks true

Evalchemy (reasoning)

The reasoning task group includes 6 benchmarks: GSM8k, IFEval, and MBPP run via lm-eval-harness, while GPQADiamond, MATH500, and LiveCodeBench run via evalchemy.

Note: The evalchemy versions of GPQA and MATH500 differ from lm-eval-harness. Evalchemy uses free-form generation with CoT reasoning instead of log-likelihood scoring.

We use Ali's fork which includes a fix to randomize GPQA answer ordering to eliminate positional bias, along with context window safety fixes. The PR is yet to be merged upstream.

Clone the repo at the pinned commit:

git clone https://github.com/Ali-Elganzory/evalchemy.git evalchemy
cd evalchemy && git checkout 54ac97648230c4c3a22c3a2b93068b5a4e573f8d && cd ..

Create a venv and install dependencies:

uv venv --python 3.12 evalchemy-venv
uv pip install --python evalchemy-venv/bin/python -r requirements-venv-evalchemy.txt

Run with EVALCHEMY_DIR pointing to the cloned repo:

export HF_ALLOW_CODE_EVAL=1  # required by MBPP
EVALCHEMY_DIR=$(pwd)/evalchemy oellm schedule-eval \
    --models HuggingFaceTB/SmolLM2-135M \
    --task-groups reasoning \
    --venv-path evalchemy-venv \
    --skip-checks true

Note: HF_ALLOW_CODE_EVAL=1 is required because MBPP (run via lm-eval-harness) uses HuggingFace's code_eval metric which executes model-generated code. The evalchemy benchmarks (GPQADiamond, MATH500, LiveCodeBench) do not require this variable as they handle code execution safely through internal guards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Your Own Virtual Environment

Overview

Setup

Usage

Why Multiple Install Steps?

Dependency Summary

DCLM-core-22

Evalchemy (reasoning)

FilesExpand file tree

VENV.md

Latest commit

History

VENV.md

File metadata and controls

Using Your Own Virtual Environment

Overview

Setup

Usage

Why Multiple Install Steps?

Dependency Summary

DCLM-core-22

Evalchemy (reasoning)