This repository currently lacks continuous integration. As the benchmark suite grows and more benchmarks undergo refactoring, we need automated validation that PRs do not break verifier correctness or violate repository conventions.
Motivation
The repository follows an end-to-end benchmark pattern: generators, datasets, tests, verifiers, and optional visualizers. CI should enforce the public contract for finished benchmarks while leaving unfinished benchmarks exempt until they are explicitly promoted in benchmarks/finished_benchmarks.json.
Acceptance Criteria
Out of Scope (future issues)
- Visualizer validation
- Enforcing unfinished benchmarks
- Full benchmark README/docs sync
- Requiring dataset example solutions for every finished benchmark
This repository currently lacks continuous integration. As the benchmark suite grows and more benchmarks undergo refactoring, we need automated validation that PRs do not break verifier correctness or violate repository conventions.
Motivation
The repository follows an end-to-end benchmark pattern: generators, datasets, tests, verifiers, and optional visualizers. CI should enforce the public contract for finished benchmarks while leaving unfinished benchmarks exempt until they are explicitly promoted in
benchmarks/finished_benchmarks.json.Acceptance Criteria
mainuv run pytest tests/docs/benchmark_contract.mdbenchmarks/finished_benchmarks.jsonREADME.mdgenerator.pyorgenerator/run.py)verifier.pyorverifier/run.py)dataset/cases/<case_id>/...dataset/index.jsonis optionaldataset/source_data/is not trackedsys.pathhacksfrom benchmarks.importsOut of Scope (future issues)