You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Treat benchmark dataset generation as a first-class, reproducible contract by mandating splits.yaml for all finished benchmarks and unifying generator entrypoints around it. Dataset cases are organized under dataset/cases/<split>/<case_id>/, even for benchmarks that only define a single split (e.g. test).
Why
Parameters that control dataset construction are currently hardcoded or scattered across ad-hoc CLI flags; readers must reverse-engineer generator code to understand how cases are built.
A single, mandatory splits.yaml surfaces those parameters explicitly and makes extending or modifying a dataset an editing task rather than a code task.
A consistent split layout prevents ad-hoc per-benchmark conventions from diverging.
Generators that are driven by YAML can reproduce the full dataset, a single split, or a custom split without requiring manual reorganization.
Scope
Define the split layout convention: dataset/cases/<split>/<case_id>/ (replaces the flat dataset/cases/<case_id>/ for finished benchmarks).
Define the mandatory splits.yaml location: benchmarks/<name>/splits.yaml (committed, canonical).
splits.yaml is mandatory for every finished benchmark; generators fail when invoked without a YAML path.
Generator entrypoint consistently takes a splits.yaml path as its required argument.
Split layout (dataset/cases/<split>/<case_id>/) is documented as the canonical dataset layout.
Finished benchmarks have migrated to the new contract.
scripts/validate_benchmark_contract.py enforces the new rules.
The verifier contract is unchanged.
Flat layout is removed
Notes
Splits are a dataset organization and generator-contract convention. The verifier should remain fully agnostic to whether cases come from a split subdirectory or the flat root.
The reference implementation should be chosen from a benchmark whose generator is already stable.
A shared top-level shape like splits: makes sense across benchmarks, but the exact per-split parameters should remain benchmark-owned where needed, rather than forcing one universal schema across all generators.
Summary
Treat benchmark dataset generation as a first-class, reproducible contract by mandating
splits.yamlfor all finished benchmarks and unifying generator entrypoints around it. Dataset cases are organized underdataset/cases/<split>/<case_id>/, even for benchmarks that only define a single split (e.g.test).Why
splits.yamlsurfaces those parameters explicitly and makes extending or modifying a dataset an editing task rather than a code task.Scope
dataset/cases/<split>/<case_id>/(replaces the flatdataset/cases/<case_id>/for finished benchmarks).splits.yamllocation:benchmarks/<name>/splits.yaml(committed, canonical).python benchmarks/<name>/generator.py path/to/splits.yamlsplits.yaml; operational flags like--helpmay remain as optional CLI behavior.splits.yamlschemas:dataset/index.json:example_smoke_casefield containing a relative path:"test/case_0001".docs/benchmark_contract.mdto cover the new contract.scripts/validate_benchmark_contract.py:splits.yaml.Out Of Scope
git mved to new directories to ensure cases don't drift.Acceptance Criteria
splits.yamlis mandatory for every finished benchmark; generators fail when invoked without a YAML path.splits.yamlpath as its required argument.dataset/cases/<split>/<case_id>/) is documented as the canonical dataset layout.scripts/validate_benchmark_contract.pyenforces the new rules.Notes
splits:makes sense across benchmarks, but the exact per-split parameters should remain benchmark-owned where needed, rather than forcing one universal schema across all generators.