CoEval supports two modes for supplying evaluation datapoints:
Synthetic (generative) mode Benchmark-sourced mode
────────────────────────── ──────────────────────
Phase 1: teachers → attributes Phase 1: SKIP (static from attribute_map.yaml)
Phase 2: teachers → rubric Phase 2: SKIP (static from task YAML)
Phase 3: teachers → datapoints ──vs── Phase 3: SKIP (pre-emitted from benchmark)
Phase 4: students → responses Phase 4: students → responses (identical)
Phase 5: judges → scores Phase 5: judges → scores (identical)
[Extra] metric computation → benchmark_native_score
[Extra] paper_tables.py → Spearman ρ tables
The interface: benchmark is a special virtual interface that replays pre-ingested responses from real datasets. No LLM API calls are made for models using this interface. The benchmark teacher's responses are loaded from pre-ingested JSONL files created by coeval ingest or setup scripts.
This approach decouples the benchmark creation step from the model evaluation step. Once you have a published benchmark package, anyone can run phases 4 and 5 against it to evaluate any model — without needing access to your original teacher models or API keys.
Benchmark teachers are completely skipped during Phase 3 — their (prompt, reference-response) pairs are loaded directly from pre-ingested JSONL files. Only Phase 4 (student response collection) and Phase 5 (judge evaluation) make LLM API calls. This makes benchmark-sourced experiments significantly cheaper than synthetic ones.
The two modes are fully compatible: Phases 4 and 5 behave identically regardless of how the Phase 3 datapoints were sourced.
| Dataset | Task | Split | Items | Ground-Truth Metric |
|---|---|---|---|---|
xsum |
text_summarization |
validation | 11,332 | BERTScore-F1 vs. gold summary |
codesearchnet |
code_explanation |
validation | ~10K (Python) | BLEU-4 vs. reference docstring |
aeslc |
email_composition |
validation | ~1K | BERTScore-F1 vs. reference email |
wikitablequestions |
data_interpretation |
validation | 2,831 | Exact-match accuracy |
arc-challenge |
science_reasoning |
test | 1,172 | Exact-match accuracy |
race |
reading_comprehension |
test | ~4.9K (high-school) | Exact-match accuracy |
sciq |
science_qa |
test | 1,000 | Exact-match accuracy |
math |
math_problem_solving |
test | 5,000 | Exact-match on extracted answer |
mbpp |
code_generation |
test | 374 | BLEU-4 vs. canonical solution |
bigbench_hard |
reasoning_and_logic |
train | ~6,700 (27 sub-tasks) | Exact-match accuracy |
logiqa |
logical_reasoning |
test | ~2,600 | Exact-match accuracy |
winogrande |
commonsense_reasoning |
validation | ~1,300 | Exact-match accuracy |
multinli |
natural_language_inference |
validation_matched | ~9,800 | Exact-match accuracy |
copa |
causal_reasoning |
validation | 100 | Exact-match accuracy |
cosmos_qa |
narrative_reasoning |
validation | ~2,900 | Exact-match accuracy |
bbq |
bias_evaluation_qa |
test | ~58,000 | Exact-match accuracy |
trivia_qa |
knowledge_retrieval |
validation | ~11,300 | Exact-match accuracy |
squad_v2 |
reading_comprehension_qa |
validation | ~11,900 | Exact-match / F1 |
nq_open |
open_domain_qa |
validation | ~3,600 | Exact-match accuracy |
narrativeqa |
narrative_qa |
test | ~10,600 | BLEU-4 (free-form) |
cnn_dailymail |
news_summarization |
test | ~11,500 | BERTScore-F1 |
samsum |
dialogue_summarization |
test | 819 | BERTScore-F1 |
fever |
fact_verification |
labelled_dev | ~19,000 | Exact-match accuracy |
scifact |
scientific_claim_verification |
train | 809 | Exact-match accuracy |
mgsm |
multilingual_math |
test | 250 (en) | Exact-match accuracy |
mathqa |
math_word_problems_mc |
test | ~2,900 | Exact-match accuracy |
Two ingestion systems: The 26 datasets above use
Public/benchmark/loaders/and are set up withpython -m benchmark.setup_mixed(xsum, codesearchnet, aeslc, wikitablequestions),python -m benchmark.setup_education(arc-challenge, race, sciq), orpython -m benchmark.emit_datapoints --dataset <name>. A separate set of built-in CLI adapters (Code/runner/benchmarks/registry.py) is available viacoeval ingest:mmlu,hellaswag,truthfulqa,humaneval,medqa,gsm8k.
Loader files:
| Loader file | Dataset | Domain |
|---|---|---|
Public/benchmark/loaders/xsum.py |
XSum | Text summarization |
Public/benchmark/loaders/codesearchnet.py |
CodeSearchNet | Code explanation |
Public/benchmark/loaders/aeslc.py |
AESLC | Email composition |
Public/benchmark/loaders/wikitablequestions.py |
WikiTableQuestions | Data interpretation |
Public/benchmark/loaders/arc_challenge.py |
ARC-Challenge | Science reasoning (MCQ) |
Public/benchmark/loaders/race.py |
RACE | Reading comprehension (MCQ) |
Public/benchmark/loaders/sciq.py |
SciQ | Science questions (MCQ) |
Public/benchmark/loaders/math_dataset.py |
MATH (Hendrycks et al.) | Competition mathematics |
Public/benchmark/loaders/mbpp.py |
MBPP | Python code generation |
Public/benchmark/loaders/bigbench_hard.py |
BIG-Bench Hard (27 sub-tasks) | Reasoning & logic |
Public/benchmark/loaders/logiqa.py |
LogiQA | Logical reasoning (MCQ) |
Public/benchmark/loaders/winogrande.py |
WinoGrande | Commonsense reasoning |
Public/benchmark/loaders/multinli.py |
MultiNLI | Natural language inference |
Public/benchmark/loaders/copa.py |
COPA | Causal commonsense reasoning |
Public/benchmark/loaders/cosmos_qa.py |
CosmosQA | Narrative commonsense comprehension |
Public/benchmark/loaders/bbq.py |
BBQ | Bias-aware QA |
Public/benchmark/loaders/trivia_qa.py |
TriviaQA | Knowledge retrieval |
Public/benchmark/loaders/squad_v2.py |
SQuAD 2.0 | Extractive + unanswerable RC |
Public/benchmark/loaders/nq_open.py |
Natural Questions (open) | Open-domain QA |
Public/benchmark/loaders/narrativeqa.py |
NarrativeQA | Long-story comprehension |
Public/benchmark/loaders/cnn_dailymail.py |
CNN/DailyMail | News summarization |
Public/benchmark/loaders/samsum.py |
SAMSum | Dialogue summarization |
Public/benchmark/loaders/fever.py |
FEVER | Fact verification |
Public/benchmark/loaders/scifact.py |
SciFact | Scientific claim verification |
Public/benchmark/loaders/mgsm.py |
MGSM (en) | Multilingual math (English) |
Public/benchmark/loaders/mathqa.py |
MathQA | Math word problems (MCQ) |
This section catalogues widely-used public benchmarks by domain. CoEval can supply these as virtual teachers via interface: benchmark once ingested with coeval ingest. The ✅ / ❌ column reflects current support status.
| Benchmark | Task Type | Size | Primary Metric | CoEval | Link |
|---|---|---|---|---|---|
| MMLU | 57-subject MCQ (STEM, humanities, social science) | 14K | Accuracy | ✅ mmlu |
arxiv |
| HumanEval | Python function completion from docstring | 164 | Pass@1 | ✅ humaneval |
arxiv |
| GSM8K | Grade-school math word problems | 8.5K | Accuracy | ✅ gsm8k |
arxiv |
| HellaSwag | Sentence completion / commonsense NLI | 70K | Accuracy | ✅ hellaswag |
arxiv |
| TruthfulQA | Truthfulness — MC + generation | 817 | Accuracy / BLEURT | ✅ truthfulqa |
arxiv |
| BIG-Bench Hard | 23 challenging multi-step reasoning tasks | ~6K | Accuracy / EM | ✅ bigbench_hard |
arxiv |
| MATH | Competition math (algebra → calculus) | 12.5K | Accuracy (symbolic) | ✅ math |
arxiv |
| ARC-Challenge | Science MCQ — hard subset | 1.17K | Accuracy | ✅ arc-challenge |
arxiv |
| GPQA | Graduate-level science MCQ (Diamond set) | 448 | Accuracy | ❌ planned | arxiv |
| MBPP | Python beginner coding problems | 374 | Pass@1 | ✅ mbpp |
arxiv |
Reading the table: ✅
name= loader inPublic/benchmark/loaders/; usepython -m benchmark.emit_datapoints --dataset name. ✅coeval ingest= CLI adapter inCode/runner/benchmarks/registry.py. ❌ planned = loader not yet implemented; custom JSONL ingestion viacoeval ingest --datasetis still possible.
| Benchmark | Description | CoEval |
|---|---|---|
| BIG-Bench Hard | 23 tasks requiring chain-of-thought reasoning | ✅ bigbench_hard |
| ARC-Challenge | AI2 science reasoning — harder subset | ✅ arc-challenge |
| LogiQA | Reading comprehension with formal logic | ✅ logiqa |
| ReClor | Logical reasoning from law-school exams | ❌ restricted |
| AGIEval | Human-centric reasoning: SAT, GRE, LSAT, etc. | ❌ |
| Benchmark | Description | CoEval |
|---|---|---|
| GSM8K | 8.5K grade-school word problems | ✅ gsm8k |
| MATH | 12.5K competition problems (Hendrycks et al.) | ✅ math |
| MGSM | Multilingual grade-school math (11 languages) | ✅ mgsm |
| MathQA | Multiple-choice math word problems | ✅ mathqa |
| NumGLUE | Numerical reasoning across 8 NLU tasks | ❌ |
| Benchmark | Description | CoEval |
|---|---|---|
| HumanEval | 164 Python function completions | ✅ humaneval |
| MBPP | 374 Python beginner problems | ✅ mbpp |
| CodeSearchNet | Code-to-docstring generation (Python) | ✅ pre-ingested |
| SWE-bench | Real GitHub issue resolution | ❌ |
| DS-1000 | 1 000 data-science code completions | ❌ |
| Benchmark | Description | CoEval |
|---|---|---|
| HellaSwag | 70K sentence-completion / commonsense NLI items | ✅ hellaswag |
| WinoGrande | Large-scale Winograd schema challenge | ✅ winogrande |
| SuperGLUE | 8-task NLU suite (NLI, QA, coreference) | ❌ |
| MultiNLI | Multi-genre natural language inference | ✅ multinli |
| COPA | Cause-and-effect commonsense reasoning | ✅ copa |
| Benchmark | Description | CoEval |
|---|---|---|
| TriviaQA | 95K trivia questions with evidence passages | ✅ trivia_qa |
| Natural Questions | 307K Google search queries + Wikipedia answers | ✅ nq_open |
| WikiTableQuestions | Table-grounded free-form QA | ✅ pre-ingested |
| SQuAD 2.0 | Extractive + unanswerable span QA | ✅ squad_v2 |
| WebQuestions | 3K Freebase entity questions | ❌ |
| Benchmark | Description | CoEval |
|---|---|---|
| RACE | 100K MCQ from English exams (middle + high school) | ✅ race |
| SciQ | 14K elementary/middle school science MCQ | ✅ sciq |
| QuALITY | Long-document multiple-choice QA | ❌ |
| CosmosQA | Narrative-based commonsense comprehension | ✅ cosmos_qa |
| NarrativeQA | Full-story comprehension with free-form answers | ✅ narrativeqa |
| Benchmark | Description | CoEval |
|---|---|---|
| MMLU | 57-subject academic knowledge across domains | ✅ mmlu |
| MedQA (USMLE) | USMLE medical licensing exam questions | ✅ medqa |
| KOLA | Knowledge-oriented LLM assessment | ❌ |
| FActScore | Factual precision metric for open-ended generation | ❌ |
| FEVER | Fact verification against Wikipedia | ✅ fever |
| Benchmark | Description | CoEval |
|---|---|---|
| XSum | 11K BBC articles → 1-sentence summaries | ✅ pre-ingested |
| AESLC | Email subject-line composition | ✅ pre-ingested |
| CNN/DailyMail | News summarization with bullet highlights | ✅ cnn_dailymail |
| SAMSum | Dialogue summarization | ✅ samsum |
| MeetingBank | Meeting transcript summarization | ❌ proprietary |
| Benchmark | Description | CoEval |
|---|---|---|
| TruthfulQA | 817 questions probing hallucination tendencies | ✅ truthfulqa |
| BOLD | Bias in open-ended language generation | ❌ |
| BBQ | Bias benchmark for question answering | ✅ bbq |
| WinoBias | Gender bias in coreference resolution | ❌ |
| RealToxicityPrompts | Toxicity measurement in generation | ❌ |
| Benchmark | Description | CoEval |
|---|---|---|
| MedQA (USMLE) | 4-option MCQ from USMLE Steps 1–3 | ✅ medqa |
| SciQ | Science MCQ with supporting passages | ✅ sciq |
| ARC-Challenge | AI2 science exam MCQ (harder 1.17K items) | ✅ arc-challenge |
| GPQA | PhD-level science MCQ (chemistry, biology, physics) | ❌ planned |
| SciFact | Scientific claim verification against abstracts | ✅ scifact |
| Status | Datasets |
|---|---|
✅ Pre-ingested (loaders in Public/benchmark/loaders/) |
XSum, CodeSearchNet, AESLC, WikiTableQuestions, ARC-Challenge, RACE, SciQ, MATH, MBPP, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, NarrativeQA, CNN/DailyMail, SAMSum, FEVER, SciFact, MGSM, MathQA |
✅ Available via coeval ingest |
MMLU, HellaSwag, TruthfulQA, HumanEval, MedQA, GSM8K |
| ❌ Planned (loader not yet implemented; custom JSONL ingestion possible) | GPQA |
| ❌ Not yet supported | All others — contribute a loader! See Writing a Loader |
The following benchmarks are on the roadmap. No loader exists yet, but you can still use them by converting to JSONL and running coeval ingest --dataset your_file.jsonl.
| Benchmark | Priority | Blocker | Notes |
|---|---|---|---|
| GPQA | Medium | Restricted access (Diamond set) | Requires NDA agreement from maintainers |
To add a loader, follow the Writing a Loader guide and submit a PR to Public/benchmark/loaders/.
# Run once before using Runs/mixed/mixed.yaml
python -m benchmark.setup_mixedThis ingests all four benchmarks (10 items each) into a new run folder.
python -m benchmark.setup_education# All four benchmarks, 620 items each, into a new run folder
python -m benchmark.emit_datapoints \
--run-id paper-eval-v1 \
--sample-size 620
# Single dataset
python -m benchmark.emit_datapoints \
--dataset xsum \
--run-id paper-eval-v1 \
--sample-size 620
# Custom output directory
python -m benchmark.emit_datapoints \
--dataset codesearchnet \
--out-dir ./benchmark/runs/my-run/phase3_datapoints \
--sample-size 300 \
--split testOutput files are written to:
benchmark/runs/{run-id}/phase3_datapoints/
text_summarization.benchmark_xsum.datapoints.jsonl
code_explanation.benchmark_codesearchnet.datapoints.jsonl
email_composition.benchmark_aeslc.datapoints.jsonl
data_interpretation.benchmark_wikitablequestions.datapoints.jsonl
Stratified sampling: By default, the loader applies stratified sampling across all attribute value combinations in the attribute map. This ensures the selected items cover the benchmark's difficulty/domain distribution rather than clustering at its mode.
Full benchmark (e.g., 11,332 XSum items)
↓ attribute inference (complexity × domain = 4 × 6 = 24 strata)
↓ stratified sample: ≈ 26 items per stratum
↓ 620 items selected with near-uniform stratum coverage
The coeval ingest command converts an external JSONL dataset into Phase 3 format:
coeval ingest \
--dataset path/to/my_benchmark.jsonl \
--run-id my-experiment-v1 \
--task text_summarization \
--teacher-id my-benchmarkThe input JSONL must have at minimum prompt and reference_response fields. Additional fields are preserved as passthrough metadata.
Each line in a Phase 3 datapoints file is a JSON object:
{
"id": "text_summarization__gpt-4o-mini__00001",
"task_id": "text_summarization",
"teacher_model_id": "gpt-4o-mini",
"sampled_target_attributes": {
"article_length": "medium",
"domain": "technology"
},
"prompt": "Summarise the following article in 1–3 sentences:\n\nApple announced today that...",
"reference_response": "Apple unveiled a new chip architecture that doubles inference throughput while reducing power draw by 30 percent.",
"generated_at": "2025-03-01T14:23:11Z"
}For benchmark-sourced records, additional fields are included:
{
"benchmark_id": "xsum",
"benchmark_split": "validation",
"benchmark_native_id": "29750436",
"benchmark_native_score": null
}benchmark_native_score is null at emit time and filled in by the metric computation step.
A loader is a Python module with a load(task_id, n_items) function:
# Public/benchmark/loaders/my_dataset.py
from .base import BenchmarkLoader
class MyDatasetLoader(BenchmarkLoader):
benchmark_id = "my_dataset"
task_id = "my_task"
default_split = "test"
def _load_dataset(self, split: str, seed: int, sample_size: int):
from datasets import load_dataset
ds = load_dataset("author/my_dataset", split=split)
return ds.shuffle(seed=seed).select(range(sample_size))
def _to_record(self, raw, idx: int) -> dict:
return {
"prompt": f"Answer this question: {raw['question']}",
"reference_response": raw["answer"],
"sampled_target_attributes": self._infer_attributes(raw),
}
def _infer_attributes(self, raw: dict) -> dict:
return {
"difficulty": "hard" if len(raw["question"]) > 100 else "easy",
}Register it in Public/benchmark/loaders/__init__.py:
_REGISTRY["my_dataset"] = (
"benchmark.loaders.my_dataset.MyDatasetLoader",
"Public/benchmark/configs/my_dataset_attribute_map.yaml",
)Create Public/benchmark/configs/my_dataset_attribute_map.yaml and add the dataset to Public/benchmark/emit_datapoints.py's _DATASETS dict.
experiment:
id: paper-eval-v1
storage_folder: ./benchmark/runs
phases:
attribute_mapping: Keep # skip: attributes defined statically in YAML
rubric_mapping: Keep # skip: rubric defined statically in YAML
data_generation: Keep # skip: datapoints pre-emitted by emit_datapoints.py
response_collection: New # run students
evaluation: New # run judgesThe benchmark model acts as teacher:
models:
- name: benchmark # resolves to the default ingested dataset
interface: benchmark
roles: [teacher]
- name: gpt-4o-mini
interface: openai
roles: [student, judge]The name field should match the teacher_model_id values in the JSONL files. For example, benchmark-xsum with task text_summarization will load:
phase3_datapoints/text_summarization.benchmark-xsum.datapoints.jsonl
models:
- name: benchmark
interface: benchmark
roles: [teacher]
- name: gpt-4o-mini
interface: openai
parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
roles: [student, judge]
role_parameters:
judge: { temperature: 0.0, max_tokens: 128 }
tasks:
- name: text_summarization
description: Produce a concise one-sentence summary of a news article.
output_description: A single sentence of 15–25 words capturing the article's main point.
target_attributes:
article_length: [short, medium, long]
domain: [politics, sports, technology, science]
sampling: { target: [1,1], nuance: [0,0], total: 30 }
rubric:
relevance: "The summary accurately reflects the article's main claim."
conciseness: "The summary is free of redundant or filler language."
fluency: "The summary reads as natural, grammatically correct English."
evaluation_mode: single
experiment:
id: xsum-benchmark-v1
storage_folder: ./benchmark/runs
batch:
openai:
response_collection: true
evaluation: trueSetup and run:
python -m benchmark.emit_datapoints --dataset xsum --run-id xsum-benchmark-v1 --sample-size 30
coeval run --config xsum-benchmark-v1.yaml --continue
coeval analyze all --run ./benchmark/runs/xsum-benchmark-v1 --out ./reports# Runs/mixed/mixed.yaml
experiment_id: mixed-benchmark-v1
phases: [3, 4, 5]
models:
- name: benchmark
interface: benchmark
roles: [teacher]
- name: gpt-4o-mini
interface: openai
roles: [student, judge]
- name: gpt-3.5-turbo
interface: openai
roles: [student]
tasks:
- id: xsum_summarization
description: Summarise a BBC news article in one sentence.
items_per_teacher: 10
- id: code_explanation
description: Explain what a code function does.
items_per_teacher: 10
batch_api:
enabled: true
output_dir: benchmark/runs/experiment_id: multi-teacher-benchmark-v1
phases: [1, 2, 3]
models:
- name: gpt-4o
interface: openai
roles: [teacher]
- name: claude-3-5-sonnet
interface: anthropic
roles: [teacher]
- name: gemini-1.5-pro
interface: gemini
roles: [teacher]
tasks:
- id: scientific_qa
description: >
Answer a graduate-level science question with a thorough, accurate explanation.
target_attributes:
discipline: [physics, chemistry, biology, earth_science]
difficulty: [undergraduate, graduate, research]
items_per_teacher: 25
output_dir: benchmark/runs/models:
- name: arc-challenge
interface: benchmark
roles: [teacher]
- name: race-high
interface: benchmark
roles: [teacher]
- name: sciq
interface: benchmark
roles: [teacher]
- name: gpt-4o-mini
interface: openai
parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
roles: [student, judge]
- name: gpt-4o
interface: openai
parameters: { model: gpt-4o, temperature: 0.7, max_tokens: 512 }
roles: [student, judge]
role_parameters:
judge: { temperature: 0.0, max_tokens: 128 }
tasks:
- name: arc_science_reasoning
description: Answer a multiple-choice science question by selecting A, B, C, or D.
output_description: A single letter — A, B, C, or D.
target_attributes:
grade_band: [grade_3_5, grade_6_8, grade_9_10]
knowledge_type: [factual, conceptual, procedural]
sampling: { target: [1,1], nuance: [0,0], total: 30 }
rubric:
correctness: "The selected answer is the correct option."
evaluation_mode: single
experiment:
id: education-benchmark-v1
storage_folder: ./benchmark/runs
batch:
openai:
response_collection: true
evaluation: true
estimate_samples: 0Setup and run:
python -m benchmark.setup_education # ingest ARC, RACE-High, SciQ (30 items each)
coeval run --config Runs/education/education.yaml --continueAfter Phase 3 datapoints are emitted, populate the benchmark_native_score field in each JSONL record by computing the task's ground-truth metric. These scores are used for calibration and Spearman ρ correlation analysis — they are not incorporated into the EES ranking score.
# Requires: pip install bert-score nltk
python -m benchmark.compute_scores \
--run benchmark/runs/paper-eval-v1
# Single dataset
python -m benchmark.compute_scores \
--run benchmark/runs/paper-eval-v1 \
--datasets xsum
# Override default metric
python -m benchmark.compute_scores \
--run benchmark/runs/paper-eval-v1 \
--metric bertscore
# Idempotent: already-scored records are skipped unless --force
python -m benchmark.compute_scores \
--run benchmark/runs/paper-eval-v1 \
--force --dry-runAvailable metrics:
| Metric flag | Used for | Library |
|---|---|---|
bertscore |
XSum, AESLC, CNN/DailyMail, SAMSum | bert-score |
bleu |
CodeSearchNet, MBPP, NarrativeQA | nltk |
exact_match |
WikiTableQuestions, ARC-Challenge, RACE, SciQ, MATH, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, FEVER, SciFact, MGSM, MathQA | built-in |
rouge_l |
XSum (alternative) | rouge-score |
Each benchmark has a default metric defined in Public/benchmark/compute_scores.py's BENCHMARK_METRIC dict. You can override with --metric if you want to compare alternatives.
This fills benchmark_native_score in the Phase 3 JSONL files. These scores are used for two purposes:
- Calibration (
Code/analyzer/calibration.py): Fits an OLS linear mapping from judge ensemble scores to benchmark-native ground truth — useful for detecting judge bias or drift. - Spearman ρ tables (
Code/analyzer/paper_tables.py): Computes rank correlation between judge scores and ground-truth metric — validates how well the ensemble captures model quality.
Note:
benchmark_native_scoreis not incorporated into the EES (Evaluation Ensemble Score) that drives model rankings. It is a separate validation signal used for calibration and correlation analysis only.
For classification and information-extraction tasks where the correct output is a discrete label, use label_attributes for judge-free exact-match evaluation:
tasks:
- id: sentiment_classification
label_attributes: [sentiment]
...When label_attributes is set, Phase 5 uses exact-match label evaluation instead of an LLM judge — no judge model is required, no judge bias is introduced.
If someone shares an exported benchmark package with you, follow these steps:
mkdir -p benchmark/runs/my-repro-v1/phase3_datapoints
cp exports/summarization-benchmark-v1/datapoints/*.jsonl \
benchmark/runs/my-repro-v1/phase3_datapoints/experiment_id: my-repro-v1
phases: [4, 5]
models:
- name: gpt-4o-mini
interface: benchmark
roles: [teacher]
- name: gpt-4o
interface: openai
roles: [student]
- name: claude-3-5-haiku
interface: anthropic
roles: [student]
- name: gpt-4o-mini
interface: openai
roles: [judge]
tasks:
- id: text_summarization
description: >
Summarise a news article in 1–3 concise sentences.
items_per_teacher: 80
output_dir: benchmark/runs/coeval run --config benchmark/configs/repro-summarization.yamlCoEval will skip Phase 3 generation entirely (since phases: [4, 5] is set) and instead load the pre-existing JSONL files from phase3_datapoints/.
A reusable benchmark package is a directory containing:
- The Phase 3 JSONL datapoint files
- The Phase 2 rubric JSON files
- A
benchmark_info.yamlmanifest
Manifest example:
name: summarization-benchmark-v1
version: "1.0"
description: >
A synthetic news summarization benchmark generated by CoEval using
GPT-4o-mini as teacher. Covers 4 domains and 3 article lengths.
created_at: "2026-03-01"
created_by: "your-team@example.com"
coeval_version: "0.3.0"
license: CC-BY-4.0
citation: >
If you use this benchmark, please cite: <your citation here>
tasks:
- id: text_summarization
description: >
Summarise a news article in 1–3 concise sentences.
datapoints_file: datapoints/text_summarization.gpt-4o-mini.datapoints.jsonl
rubric_file: rubrics/text_summarization.rubric.json
item_count: 80
teacher_model: gpt-4o-mini
target_attributes:
article_length: [short, medium, long]
domain: [technology, politics, science, business]Best practices:
- Lock your Phase 3 data before publishing — treat JSONL files as immutable after publication
- Always include the rubric JSON — judges need it to score responses fairly
- Document the teacher model — the quality of
reference_responsevalues depends heavily on it - Use
--continuefor large exports to resume interrupted generation - Version your benchmark packages (e.g.,
my-benchmark-v1,my-benchmark-v2)
| File | Description |
|---|---|
Runs/mixed/mixed.yaml |
Mixed benchmark (OpenAI models + real datasets, ~$0.03) |
Runs/education/education.yaml |
Education benchmark: 3 real-dataset tasks + synthetic tasks, 6 models |
Runs/paper/paper_benchmarks.yaml |
Paper evaluation config: 8 students, 3 judges, all 4 benchmark tasks |
Runs/paper/paper_dual_track.yaml |
Dual-track paper config: benchmark + generative teacher ablation |
Q: What benchmark datasets are available out of the box?
A: Twenty-six datasets have pre-ingested loaders in Public/benchmark/loaders/: python -m benchmark.setup_mixed ingests XSum, CodeSearchNet, AESLC, and WikiTableQuestions; python -m benchmark.setup_education ingests ARC-Challenge, RACE-High, and SciQ; python -m benchmark.emit_datapoints --dataset <name> ingests any of the remaining 19 datasets (MATH, MBPP, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, NarrativeQA, CNN/DailyMail, SAMSum, FEVER, SciFact, MGSM, MathQA). An additional six datasets — MMLU, HellaSwag, TruthfulQA, HumanEval, MedQA, and GSM8K — are available via coeval ingest (built-in CLI adapters).
Q: What does coeval ingest do?
A: coeval ingest converts an external JSONL dataset into Phase 3 datapoint format, writing files to benchmark/runs/{run-id}/phase3_datapoints/. The input JSONL must have at minimum prompt and reference_response fields. Once ingested, the dataset can be used as a virtual teacher with interface: benchmark — no LLM API calls are made for Phase 3.
Q: How does the interface: benchmark virtual teacher work?
A: A model with interface: benchmark is skipped entirely during Phase 3. Instead of generating datapoints via LLM calls, CoEval loads pre-ingested JSONL files from the phase3_datapoints/ folder. Phases 4 and 5 then run normally — only student responses and judge evaluations require API calls.
Q: What is stratified sampling and why does it matter? A: When emitting datapoints from a large benchmark, CoEval applies stratified sampling across all attribute value combinations. This ensures that the selected items cover the full difficulty/domain distribution of the dataset rather than clustering at the most common values. For example, 620 XSum items are drawn from 24 strata (4 complexity levels × 6 domain values) with roughly equal representation per stratum.
Q: How do I reproduce someone else's published benchmark results?
A: Place their exported Phase 3 JSONL files in your phase3_datapoints/ folder, create a config with interface: benchmark as the teacher, and run coeval run with phases set to skip Phase 3 (attribute_mapping: Keep, rubric_mapping: Keep, data_generation: Keep). Your student and judge models run against the original benchmark items without regenerating anything.
Q: How do I write a custom dataset loader for a new benchmark?
A: Create a Python module in Public/benchmark/loaders/ that subclasses BenchmarkLoader and implements _load_dataset() and _to_record(). Register it in Public/benchmark/loaders/__init__.py with a dataset ID and attribute map path, then add it to Public/benchmark/emit_datapoints.py's _DATASETS dict. The loader is then available to python -m benchmark.emit_datapoints and setup scripts.