Skip to content

Latest commit

 

History

History
790 lines (611 loc) · 32.7 KB

File metadata and controls

790 lines (611 loc) · 32.7 KB

Benchmark Datasets

← Running · Reports →


What Are Benchmark Teachers?

CoEval supports two modes for supplying evaluation datapoints:

Synthetic (generative) mode                Benchmark-sourced mode
──────────────────────────                 ──────────────────────
Phase 1: teachers → attributes             Phase 1: SKIP (static from attribute_map.yaml)
Phase 2: teachers → rubric                 Phase 2: SKIP (static from task YAML)
Phase 3: teachers → datapoints  ──vs──    Phase 3: SKIP (pre-emitted from benchmark)
Phase 4: students → responses              Phase 4: students → responses  (identical)
Phase 5: judges   → scores                 Phase 5: judges   → scores     (identical)
                                           [Extra] metric computation → benchmark_native_score
                                           [Extra] paper_tables.py    → Spearman ρ tables

The interface: benchmark is a special virtual interface that replays pre-ingested responses from real datasets. No LLM API calls are made for models using this interface. The benchmark teacher's responses are loaded from pre-ingested JSONL files created by coeval ingest or setup scripts.

This approach decouples the benchmark creation step from the model evaluation step. Once you have a published benchmark package, anyone can run phases 4 and 5 against it to evaluate any model — without needing access to your original teacher models or API keys.

Benchmark teachers are completely skipped during Phase 3 — their (prompt, reference-response) pairs are loaded directly from pre-ingested JSONL files. Only Phase 4 (student response collection) and Phase 5 (judge evaluation) make LLM API calls. This makes benchmark-sourced experiments significantly cheaper than synthetic ones.

The two modes are fully compatible: Phases 4 and 5 behave identically regardless of how the Phase 3 datapoints were sourced.


Pre-Ingested Datasets

Dataset Task Split Items Ground-Truth Metric
xsum text_summarization validation 11,332 BERTScore-F1 vs. gold summary
codesearchnet code_explanation validation ~10K (Python) BLEU-4 vs. reference docstring
aeslc email_composition validation ~1K BERTScore-F1 vs. reference email
wikitablequestions data_interpretation validation 2,831 Exact-match accuracy
arc-challenge science_reasoning test 1,172 Exact-match accuracy
race reading_comprehension test ~4.9K (high-school) Exact-match accuracy
sciq science_qa test 1,000 Exact-match accuracy
math math_problem_solving test 5,000 Exact-match on extracted answer
mbpp code_generation test 374 BLEU-4 vs. canonical solution
bigbench_hard reasoning_and_logic train ~6,700 (27 sub-tasks) Exact-match accuracy
logiqa logical_reasoning test ~2,600 Exact-match accuracy
winogrande commonsense_reasoning validation ~1,300 Exact-match accuracy
multinli natural_language_inference validation_matched ~9,800 Exact-match accuracy
copa causal_reasoning validation 100 Exact-match accuracy
cosmos_qa narrative_reasoning validation ~2,900 Exact-match accuracy
bbq bias_evaluation_qa test ~58,000 Exact-match accuracy
trivia_qa knowledge_retrieval validation ~11,300 Exact-match accuracy
squad_v2 reading_comprehension_qa validation ~11,900 Exact-match / F1
nq_open open_domain_qa validation ~3,600 Exact-match accuracy
narrativeqa narrative_qa test ~10,600 BLEU-4 (free-form)
cnn_dailymail news_summarization test ~11,500 BERTScore-F1
samsum dialogue_summarization test 819 BERTScore-F1
fever fact_verification labelled_dev ~19,000 Exact-match accuracy
scifact scientific_claim_verification train 809 Exact-match accuracy
mgsm multilingual_math test 250 (en) Exact-match accuracy
mathqa math_word_problems_mc test ~2,900 Exact-match accuracy

Two ingestion systems: The 26 datasets above use Public/benchmark/loaders/ and are set up with python -m benchmark.setup_mixed (xsum, codesearchnet, aeslc, wikitablequestions), python -m benchmark.setup_education (arc-challenge, race, sciq), or python -m benchmark.emit_datapoints --dataset <name>. A separate set of built-in CLI adapters (Code/runner/benchmarks/registry.py) is available via coeval ingest: mmlu, hellaswag, truthfulqa, humaneval, medqa, gsm8k.

Loader files:

Loader file Dataset Domain
Public/benchmark/loaders/xsum.py XSum Text summarization
Public/benchmark/loaders/codesearchnet.py CodeSearchNet Code explanation
Public/benchmark/loaders/aeslc.py AESLC Email composition
Public/benchmark/loaders/wikitablequestions.py WikiTableQuestions Data interpretation
Public/benchmark/loaders/arc_challenge.py ARC-Challenge Science reasoning (MCQ)
Public/benchmark/loaders/race.py RACE Reading comprehension (MCQ)
Public/benchmark/loaders/sciq.py SciQ Science questions (MCQ)
Public/benchmark/loaders/math_dataset.py MATH (Hendrycks et al.) Competition mathematics
Public/benchmark/loaders/mbpp.py MBPP Python code generation
Public/benchmark/loaders/bigbench_hard.py BIG-Bench Hard (27 sub-tasks) Reasoning & logic
Public/benchmark/loaders/logiqa.py LogiQA Logical reasoning (MCQ)
Public/benchmark/loaders/winogrande.py WinoGrande Commonsense reasoning
Public/benchmark/loaders/multinli.py MultiNLI Natural language inference
Public/benchmark/loaders/copa.py COPA Causal commonsense reasoning
Public/benchmark/loaders/cosmos_qa.py CosmosQA Narrative commonsense comprehension
Public/benchmark/loaders/bbq.py BBQ Bias-aware QA
Public/benchmark/loaders/trivia_qa.py TriviaQA Knowledge retrieval
Public/benchmark/loaders/squad_v2.py SQuAD 2.0 Extractive + unanswerable RC
Public/benchmark/loaders/nq_open.py Natural Questions (open) Open-domain QA
Public/benchmark/loaders/narrativeqa.py NarrativeQA Long-story comprehension
Public/benchmark/loaders/cnn_dailymail.py CNN/DailyMail News summarization
Public/benchmark/loaders/samsum.py SAMSum Dialogue summarization
Public/benchmark/loaders/fever.py FEVER Fact verification
Public/benchmark/loaders/scifact.py SciFact Scientific claim verification
Public/benchmark/loaders/mgsm.py MGSM (en) Multilingual math (English)
Public/benchmark/loaders/mathqa.py MathQA Math word problems (MCQ)

Public LLM Benchmark Reference

This section catalogues widely-used public benchmarks by domain. CoEval can supply these as virtual teachers via interface: benchmark once ingested with coeval ingest. The ✅ / ❌ column reflects current support status.

Top 10 Generic / Cross-Domain Benchmarks

Benchmark Task Type Size Primary Metric CoEval Link
MMLU 57-subject MCQ (STEM, humanities, social science) 14K Accuracy mmlu arxiv
HumanEval Python function completion from docstring 164 Pass@1 humaneval arxiv
GSM8K Grade-school math word problems 8.5K Accuracy gsm8k arxiv
HellaSwag Sentence completion / commonsense NLI 70K Accuracy hellaswag arxiv
TruthfulQA Truthfulness — MC + generation 817 Accuracy / BLEURT truthfulqa arxiv
BIG-Bench Hard 23 challenging multi-step reasoning tasks ~6K Accuracy / EM bigbench_hard arxiv
MATH Competition math (algebra → calculus) 12.5K Accuracy (symbolic) math arxiv
ARC-Challenge Science MCQ — hard subset 1.17K Accuracy arc-challenge arxiv
GPQA Graduate-level science MCQ (Diamond set) 448 Accuracy ❌ planned arxiv
MBPP Python beginner coding problems 374 Pass@1 mbpp arxiv

Reading the table:name = loader in Public/benchmark/loaders/; use python -m benchmark.emit_datapoints --dataset name. ✅ coeval ingest = CLI adapter in Code/runner/benchmarks/registry.py. ❌ planned = loader not yet implemented; custom JSONL ingestion via coeval ingest --dataset is still possible.


Top Benchmarks by Domain

1. Reasoning & Logic

Benchmark Description CoEval
BIG-Bench Hard 23 tasks requiring chain-of-thought reasoning bigbench_hard
ARC-Challenge AI2 science reasoning — harder subset arc-challenge
LogiQA Reading comprehension with formal logic logiqa
ReClor Logical reasoning from law-school exams ❌ restricted
AGIEval Human-centric reasoning: SAT, GRE, LSAT, etc.

2. Mathematics

Benchmark Description CoEval
GSM8K 8.5K grade-school word problems gsm8k
MATH 12.5K competition problems (Hendrycks et al.) math
MGSM Multilingual grade-school math (11 languages) mgsm
MathQA Multiple-choice math word problems mathqa
NumGLUE Numerical reasoning across 8 NLU tasks

3. Code Generation

Benchmark Description CoEval
HumanEval 164 Python function completions humaneval
MBPP 374 Python beginner problems mbpp
CodeSearchNet Code-to-docstring generation (Python) ✅ pre-ingested
SWE-bench Real GitHub issue resolution
DS-1000 1 000 data-science code completions

4. Language Understanding

Benchmark Description CoEval
HellaSwag 70K sentence-completion / commonsense NLI items hellaswag
WinoGrande Large-scale Winograd schema challenge winogrande
SuperGLUE 8-task NLU suite (NLI, QA, coreference)
MultiNLI Multi-genre natural language inference multinli
COPA Cause-and-effect commonsense reasoning copa

5. Open-Domain Question Answering

Benchmark Description CoEval
TriviaQA 95K trivia questions with evidence passages trivia_qa
Natural Questions 307K Google search queries + Wikipedia answers nq_open
WikiTableQuestions Table-grounded free-form QA ✅ pre-ingested
SQuAD 2.0 Extractive + unanswerable span QA squad_v2
WebQuestions 3K Freebase entity questions

6. Reading Comprehension

Benchmark Description CoEval
RACE 100K MCQ from English exams (middle + high school) race
SciQ 14K elementary/middle school science MCQ sciq
QuALITY Long-document multiple-choice QA
CosmosQA Narrative-based commonsense comprehension cosmos_qa
NarrativeQA Full-story comprehension with free-form answers narrativeqa

7. Knowledge & Factual Recall

Benchmark Description CoEval
MMLU 57-subject academic knowledge across domains mmlu
MedQA (USMLE) USMLE medical licensing exam questions medqa
KOLA Knowledge-oriented LLM assessment
FActScore Factual precision metric for open-ended generation
FEVER Fact verification against Wikipedia fever

8. Summarization & Generation

Benchmark Description CoEval
XSum 11K BBC articles → 1-sentence summaries ✅ pre-ingested
AESLC Email subject-line composition ✅ pre-ingested
CNN/DailyMail News summarization with bullet highlights cnn_dailymail
SAMSum Dialogue summarization samsum
MeetingBank Meeting transcript summarization ❌ proprietary

9. Truthfulness & Safety

Benchmark Description CoEval
TruthfulQA 817 questions probing hallucination tendencies truthfulqa
BOLD Bias in open-ended language generation
BBQ Bias benchmark for question answering bbq
WinoBias Gender bias in coreference resolution
RealToxicityPrompts Toxicity measurement in generation

10. Medical & Scientific Reasoning

Benchmark Description CoEval
MedQA (USMLE) 4-option MCQ from USMLE Steps 1–3 medqa
SciQ Science MCQ with supporting passages sciq
ARC-Challenge AI2 science exam MCQ (harder 1.17K items) arc-challenge
GPQA PhD-level science MCQ (chemistry, biology, physics) ❌ planned
SciFact Scientific claim verification against abstracts scifact

CoEval Support Summary

Status Datasets
Pre-ingested (loaders in Public/benchmark/loaders/) XSum, CodeSearchNet, AESLC, WikiTableQuestions, ARC-Challenge, RACE, SciQ, MATH, MBPP, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, NarrativeQA, CNN/DailyMail, SAMSum, FEVER, SciFact, MGSM, MathQA
Available via coeval ingest MMLU, HellaSwag, TruthfulQA, HumanEval, MedQA, GSM8K
Planned (loader not yet implemented; custom JSONL ingestion possible) GPQA
Not yet supported All others — contribute a loader! See Writing a Loader

Planned Benchmarks

The following benchmarks are on the roadmap. No loader exists yet, but you can still use them by converting to JSONL and running coeval ingest --dataset your_file.jsonl.

Benchmark Priority Blocker Notes
GPQA Medium Restricted access (Diamond set) Requires NDA agreement from maintainers

To add a loader, follow the Writing a Loader guide and submit a PR to Public/benchmark/loaders/.


Setup

Mixed benchmark (XSum, CodeSearchNet, AESLC, WikiTableQuestions)

# Run once before using Runs/mixed/mixed.yaml
python -m benchmark.setup_mixed

This ingests all four benchmarks (10 items each) into a new run folder.

Education benchmark (ARC-Challenge, RACE-High, SciQ)

python -m benchmark.setup_education

Emitting datapoints directly

# All four benchmarks, 620 items each, into a new run folder
python -m benchmark.emit_datapoints \
    --run-id paper-eval-v1 \
    --sample-size 620

# Single dataset
python -m benchmark.emit_datapoints \
    --dataset xsum \
    --run-id paper-eval-v1 \
    --sample-size 620

# Custom output directory
python -m benchmark.emit_datapoints \
    --dataset codesearchnet \
    --out-dir ./benchmark/runs/my-run/phase3_datapoints \
    --sample-size 300 \
    --split test

Output files are written to:

benchmark/runs/{run-id}/phase3_datapoints/
  text_summarization.benchmark_xsum.datapoints.jsonl
  code_explanation.benchmark_codesearchnet.datapoints.jsonl
  email_composition.benchmark_aeslc.datapoints.jsonl
  data_interpretation.benchmark_wikitablequestions.datapoints.jsonl

Stratified sampling: By default, the loader applies stratified sampling across all attribute value combinations in the attribute map. This ensures the selected items cover the benchmark's difficulty/domain distribution rather than clustering at its mode.

Full benchmark (e.g., 11,332 XSum items)
  ↓ attribute inference (complexity × domain = 4 × 6 = 24 strata)
  ↓ stratified sample: ≈ 26 items per stratum
  ↓ 620 items selected with near-uniform stratum coverage

coeval ingest

The coeval ingest command converts an external JSONL dataset into Phase 3 format:

coeval ingest \
  --dataset path/to/my_benchmark.jsonl \
  --run-id my-experiment-v1 \
  --task text_summarization \
  --teacher-id my-benchmark

The input JSONL must have at minimum prompt and reference_response fields. Additional fields are preserved as passthrough metadata.

Phase 3 JSONL Format Reference

Each line in a Phase 3 datapoints file is a JSON object:

{
  "id": "text_summarization__gpt-4o-mini__00001",
  "task_id": "text_summarization",
  "teacher_model_id": "gpt-4o-mini",
  "sampled_target_attributes": {
    "article_length": "medium",
    "domain": "technology"
  },
  "prompt": "Summarise the following article in 1–3 sentences:\n\nApple announced today that...",
  "reference_response": "Apple unveiled a new chip architecture that doubles inference throughput while reducing power draw by 30 percent.",
  "generated_at": "2025-03-01T14:23:11Z"
}

For benchmark-sourced records, additional fields are included:

{
  "benchmark_id":          "xsum",
  "benchmark_split":       "validation",
  "benchmark_native_id":   "29750436",
  "benchmark_native_score": null
}

benchmark_native_score is null at emit time and filled in by the metric computation step.

Writing a Loader for a New Dataset

A loader is a Python module with a load(task_id, n_items) function:

# Public/benchmark/loaders/my_dataset.py
from .base import BenchmarkLoader

class MyDatasetLoader(BenchmarkLoader):
    benchmark_id = "my_dataset"
    task_id = "my_task"
    default_split = "test"

    def _load_dataset(self, split: str, seed: int, sample_size: int):
        from datasets import load_dataset
        ds = load_dataset("author/my_dataset", split=split)
        return ds.shuffle(seed=seed).select(range(sample_size))

    def _to_record(self, raw, idx: int) -> dict:
        return {
            "prompt": f"Answer this question: {raw['question']}",
            "reference_response": raw["answer"],
            "sampled_target_attributes": self._infer_attributes(raw),
        }

    def _infer_attributes(self, raw: dict) -> dict:
        return {
            "difficulty": "hard" if len(raw["question"]) > 100 else "easy",
        }

Register it in Public/benchmark/loaders/__init__.py:

_REGISTRY["my_dataset"] = (
    "benchmark.loaders.my_dataset.MyDatasetLoader",
    "Public/benchmark/configs/my_dataset_attribute_map.yaml",
)

Create Public/benchmark/configs/my_dataset_attribute_map.yaml and add the dataset to Public/benchmark/emit_datapoints.py's _DATASETS dict.


Running Benchmark Experiments

Config with interface: benchmark

experiment:
  id: paper-eval-v1
  storage_folder: ./benchmark/runs
  phases:
    attribute_mapping:   Keep   # skip: attributes defined statically in YAML
    rubric_mapping:      Keep   # skip: rubric defined statically in YAML
    data_generation:     Keep   # skip: datapoints pre-emitted by emit_datapoints.py
    response_collection: New    # run students
    evaluation:          New    # run judges

The benchmark model acts as teacher:

models:
  - name: benchmark          # resolves to the default ingested dataset
    interface: benchmark
    roles: [teacher]

  - name: gpt-4o-mini
    interface: openai
    roles: [student, judge]

The name field should match the teacher_model_id values in the JSONL files. For example, benchmark-xsum with task text_summarization will load:

phase3_datapoints/text_summarization.benchmark-xsum.datapoints.jsonl

Example A: XSum text summarization benchmark

models:
  - name: benchmark
    interface: benchmark
    roles: [teacher]

  - name: gpt-4o-mini
    interface: openai
    parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
    roles: [student, judge]
    role_parameters:
      judge: { temperature: 0.0, max_tokens: 128 }

tasks:
  - name: text_summarization
    description: Produce a concise one-sentence summary of a news article.
    output_description: A single sentence of 15–25 words capturing the article's main point.
    target_attributes:
      article_length: [short, medium, long]
      domain: [politics, sports, technology, science]
    sampling: { target: [1,1], nuance: [0,0], total: 30 }
    rubric:
      relevance:    "The summary accurately reflects the article's main claim."
      conciseness:  "The summary is free of redundant or filler language."
      fluency:      "The summary reads as natural, grammatically correct English."
    evaluation_mode: single

experiment:
  id: xsum-benchmark-v1
  storage_folder: ./benchmark/runs
  batch:
    openai:
      response_collection: true
      evaluation: true

Setup and run:

python -m benchmark.emit_datapoints --dataset xsum --run-id xsum-benchmark-v1 --sample-size 30
coeval run --config xsum-benchmark-v1.yaml --continue
coeval analyze all --run ./benchmark/runs/xsum-benchmark-v1 --out ./reports

Example B: Mixed benchmark (public dataset teacher + OpenAI students)

# Runs/mixed/mixed.yaml
experiment_id: mixed-benchmark-v1
phases: [3, 4, 5]

models:
  - name: benchmark
    interface: benchmark
    roles: [teacher]

  - name: gpt-4o-mini
    interface: openai
    roles: [student, judge]

  - name: gpt-3.5-turbo
    interface: openai
    roles: [student]

tasks:
  - id: xsum_summarization
    description: Summarise a BBC news article in one sentence.
    items_per_teacher: 10

  - id: code_explanation
    description: Explain what a code function does.
    items_per_teacher: 10

batch_api:
  enabled: true

output_dir: benchmark/runs/

Example C: Multi-teacher benchmark generation

experiment_id: multi-teacher-benchmark-v1
phases: [1, 2, 3]

models:
  - name: gpt-4o
    interface: openai
    roles: [teacher]

  - name: claude-3-5-sonnet
    interface: anthropic
    roles: [teacher]

  - name: gemini-1.5-pro
    interface: gemini
    roles: [teacher]

tasks:
  - id: scientific_qa
    description: >
      Answer a graduate-level science question with a thorough, accurate explanation.
    target_attributes:
      discipline: [physics, chemistry, biology, earth_science]
      difficulty: [undergraduate, graduate, research]
    items_per_teacher: 25

output_dir: benchmark/runs/

Example D: Multi-dataset education benchmark (ARC-Challenge, RACE-High, SciQ)

models:
  - name: arc-challenge
    interface: benchmark
    roles: [teacher]

  - name: race-high
    interface: benchmark
    roles: [teacher]

  - name: sciq
    interface: benchmark
    roles: [teacher]

  - name: gpt-4o-mini
    interface: openai
    parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
    roles: [student, judge]

  - name: gpt-4o
    interface: openai
    parameters: { model: gpt-4o, temperature: 0.7, max_tokens: 512 }
    roles: [student, judge]
    role_parameters:
      judge: { temperature: 0.0, max_tokens: 128 }

tasks:
  - name: arc_science_reasoning
    description: Answer a multiple-choice science question by selecting A, B, C, or D.
    output_description: A single letter — A, B, C, or D.
    target_attributes:
      grade_band: [grade_3_5, grade_6_8, grade_9_10]
      knowledge_type: [factual, conceptual, procedural]
    sampling: { target: [1,1], nuance: [0,0], total: 30 }
    rubric:
      correctness: "The selected answer is the correct option."
    evaluation_mode: single

experiment:
  id: education-benchmark-v1
  storage_folder: ./benchmark/runs
  batch:
    openai:
      response_collection: true
      evaluation: true
  estimate_samples: 0

Setup and run:

python -m benchmark.setup_education    # ingest ARC, RACE-High, SciQ (30 items each)
coeval run --config Runs/education/education.yaml --continue

Benchmark-Native Scores

After Phase 3 datapoints are emitted, populate the benchmark_native_score field in each JSONL record by computing the task's ground-truth metric. These scores are used for calibration and Spearman ρ correlation analysis — they are not incorporated into the EES ranking score.

# Requires: pip install bert-score nltk
python -m benchmark.compute_scores \
    --run benchmark/runs/paper-eval-v1

# Single dataset
python -m benchmark.compute_scores \
    --run benchmark/runs/paper-eval-v1 \
    --datasets xsum

# Override default metric
python -m benchmark.compute_scores \
    --run benchmark/runs/paper-eval-v1 \
    --metric bertscore

# Idempotent: already-scored records are skipped unless --force
python -m benchmark.compute_scores \
    --run benchmark/runs/paper-eval-v1 \
    --force --dry-run

Available metrics:

Metric flag Used for Library
bertscore XSum, AESLC, CNN/DailyMail, SAMSum bert-score
bleu CodeSearchNet, MBPP, NarrativeQA nltk
exact_match WikiTableQuestions, ARC-Challenge, RACE, SciQ, MATH, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, FEVER, SciFact, MGSM, MathQA built-in
rouge_l XSum (alternative) rouge-score

Each benchmark has a default metric defined in Public/benchmark/compute_scores.py's BENCHMARK_METRIC dict. You can override with --metric if you want to compare alternatives.

This fills benchmark_native_score in the Phase 3 JSONL files. These scores are used for two purposes:

  1. Calibration (Code/analyzer/calibration.py): Fits an OLS linear mapping from judge ensemble scores to benchmark-native ground truth — useful for detecting judge bias or drift.
  2. Spearman ρ tables (Code/analyzer/paper_tables.py): Computes rank correlation between judge scores and ground-truth metric — validates how well the ensemble captures model quality.

Note: benchmark_native_score is not incorporated into the EES (Evaluation Ensemble Score) that drives model rankings. It is a separate validation signal used for calibration and correlation analysis only.

Label Evaluation (Classification Tasks)

For classification and information-extraction tasks where the correct output is a discrete label, use label_attributes for judge-free exact-match evaluation:

tasks:
  - id: sentiment_classification
    label_attributes: [sentiment]
    ...

When label_attributes is set, Phase 5 uses exact-match label evaluation instead of an LLM judge — no judge model is required, no judge bias is introduced.


Reproducing Published Results

If someone shares an exported benchmark package with you, follow these steps:

Step 1: Place the Datapoint Files

mkdir -p benchmark/runs/my-repro-v1/phase3_datapoints

cp exports/summarization-benchmark-v1/datapoints/*.jsonl \
   benchmark/runs/my-repro-v1/phase3_datapoints/

Step 2: Create a Config Using interface: benchmark

experiment_id: my-repro-v1
phases: [4, 5]

models:
  - name: gpt-4o-mini
    interface: benchmark
    roles: [teacher]

  - name: gpt-4o
    interface: openai
    roles: [student]

  - name: claude-3-5-haiku
    interface: anthropic
    roles: [student]

  - name: gpt-4o-mini
    interface: openai
    roles: [judge]

tasks:
  - id: text_summarization
    description: >
      Summarise a news article in 1–3 concise sentences.
    items_per_teacher: 80

output_dir: benchmark/runs/

Step 3: Run Phases 4 and 5

coeval run --config benchmark/configs/repro-summarization.yaml

CoEval will skip Phase 3 generation entirely (since phases: [4, 5] is set) and instead load the pre-existing JSONL files from phase3_datapoints/.

Exporting Your Own Benchmark Package

A reusable benchmark package is a directory containing:

  1. The Phase 3 JSONL datapoint files
  2. The Phase 2 rubric JSON files
  3. A benchmark_info.yaml manifest

Manifest example:

name: summarization-benchmark-v1
version: "1.0"
description: >
  A synthetic news summarization benchmark generated by CoEval using
  GPT-4o-mini as teacher. Covers 4 domains and 3 article lengths.
created_at: "2026-03-01"
created_by: "your-team@example.com"
coeval_version: "0.3.0"
license: CC-BY-4.0
citation: >
  If you use this benchmark, please cite: <your citation here>

tasks:
  - id: text_summarization
    description: >
      Summarise a news article in 1–3 concise sentences.
    datapoints_file: datapoints/text_summarization.gpt-4o-mini.datapoints.jsonl
    rubric_file: rubrics/text_summarization.rubric.json
    item_count: 80
    teacher_model: gpt-4o-mini
    target_attributes:
      article_length: [short, medium, long]
      domain: [technology, politics, science, business]

Best practices:

  • Lock your Phase 3 data before publishing — treat JSONL files as immutable after publication
  • Always include the rubric JSON — judges need it to score responses fairly
  • Document the teacher model — the quality of reference_response values depends heavily on it
  • Use --continue for large exports to resume interrupted generation
  • Version your benchmark packages (e.g., my-benchmark-v1, my-benchmark-v2)

Example Configurations

File Description
Runs/mixed/mixed.yaml Mixed benchmark (OpenAI models + real datasets, ~$0.03)
Runs/education/education.yaml Education benchmark: 3 real-dataset tasks + synthetic tasks, 6 models
Runs/paper/paper_benchmarks.yaml Paper evaluation config: 8 students, 3 judges, all 4 benchmark tasks
Runs/paper/paper_dual_track.yaml Dual-track paper config: benchmark + generative teacher ablation

Frequently Asked Questions

Q: What benchmark datasets are available out of the box? A: Twenty-six datasets have pre-ingested loaders in Public/benchmark/loaders/: python -m benchmark.setup_mixed ingests XSum, CodeSearchNet, AESLC, and WikiTableQuestions; python -m benchmark.setup_education ingests ARC-Challenge, RACE-High, and SciQ; python -m benchmark.emit_datapoints --dataset <name> ingests any of the remaining 19 datasets (MATH, MBPP, BIG-Bench Hard, LogiQA, WinoGrande, MultiNLI, COPA, CosmosQA, BBQ, TriviaQA, SQuAD 2.0, NQ Open, NarrativeQA, CNN/DailyMail, SAMSum, FEVER, SciFact, MGSM, MathQA). An additional six datasets — MMLU, HellaSwag, TruthfulQA, HumanEval, MedQA, and GSM8K — are available via coeval ingest (built-in CLI adapters).

Q: What does coeval ingest do? A: coeval ingest converts an external JSONL dataset into Phase 3 datapoint format, writing files to benchmark/runs/{run-id}/phase3_datapoints/. The input JSONL must have at minimum prompt and reference_response fields. Once ingested, the dataset can be used as a virtual teacher with interface: benchmark — no LLM API calls are made for Phase 3.

Q: How does the interface: benchmark virtual teacher work? A: A model with interface: benchmark is skipped entirely during Phase 3. Instead of generating datapoints via LLM calls, CoEval loads pre-ingested JSONL files from the phase3_datapoints/ folder. Phases 4 and 5 then run normally — only student responses and judge evaluations require API calls.

Q: What is stratified sampling and why does it matter? A: When emitting datapoints from a large benchmark, CoEval applies stratified sampling across all attribute value combinations. This ensures that the selected items cover the full difficulty/domain distribution of the dataset rather than clustering at the most common values. For example, 620 XSum items are drawn from 24 strata (4 complexity levels × 6 domain values) with roughly equal representation per stratum.

Q: How do I reproduce someone else's published benchmark results? A: Place their exported Phase 3 JSONL files in your phase3_datapoints/ folder, create a config with interface: benchmark as the teacher, and run coeval run with phases set to skip Phase 3 (attribute_mapping: Keep, rubric_mapping: Keep, data_generation: Keep). Your student and judge models run against the original benchmark items without regenerating anything.

Q: How do I write a custom dataset loader for a new benchmark? A: Create a Python module in Public/benchmark/loaders/ that subclasses BenchmarkLoader and implements _load_dataset() and _to_record(). Register it in Public/benchmark/loaders/__init__.py with a dataset ID and attribute map path, then add it to Public/benchmark/emit_datapoints.py's _DATASETS dict. The loader is then available to python -m benchmark.emit_datapoints and setup scripts.


← Running · Reports →