Harden benchmark generator contract with mandatory splits.yaml and split-aware dataset layout

## Summary

Treat benchmark dataset generation as a first-class, reproducible contract by mandating `splits.yaml` for all finished benchmarks and unifying generator entrypoints around it. Dataset cases are organized under `dataset/cases/<split>/<case_id>/`, even for benchmarks that only define a single split (e.g. `test`).

## Why

- Parameters that control dataset construction are currently hardcoded or scattered across ad-hoc CLI flags; readers must reverse-engineer generator code to understand how cases are built.
- A single, mandatory `splits.yaml` surfaces those parameters explicitly and makes extending or modifying a dataset an editing task rather than a code task.
- A consistent split layout prevents ad-hoc per-benchmark conventions from diverging.
- Generators that are driven by YAML can reproduce the full dataset, a single split, or a custom split without requiring manual reorganization.

## Scope

- Define the **split layout convention**: `dataset/cases/<split>/<case_id>/` (replaces the flat `dataset/cases/<case_id>/` for finished benchmarks).
- Define the **mandatory `splits.yaml`** location: `benchmarks/<name>/splits.yaml` (committed, canonical).
- Specify the **generator CLI contract**:
  - Entrypoint: `python benchmarks/<name>/generator.py path/to/splits.yaml`
  - Running without arguments is an error (prints usage).
  - All dataset-construction arguments move into `splits.yaml`; operational flags like `--help` may remain as optional CLI behavior.
- Specify two supported `splits.yaml` schemas:
  1. **Split parameters** — for algorithmic generators that build cases per split:
     ```yaml
     splits:
       easy:
         seed: 42
         case_count: 5
         max_satellites: 3
       hard:
         seed: 142
         case_count: 5
         max_satellites: 12
     ```
  2. **Split assignments** — for fixed-case benchmarks that bucket existing cases:
     ```yaml
     splits:
       test:
         - case_001
         - case_002
       train:
         - case_003
         - case_004
     ```
- Update `dataset/index.json`:
  - Add optional `example_smoke_case` field containing a relative path: `"test/case_0001"`.
- Update the benchmark template and `docs/benchmark_contract.md` to cover the new contract.
- Update `scripts/validate_benchmark_contract.py`:
  - Validate split directories match `splits.yaml`.
  - Update smoke-test case resolution.
  - Enforce the new generator CLI shape.
- Apply the new contract to all benchmark generators

## Out Of Scope

- Changing the verifier interface — verifiers operate on individual case directories and do not need to be split-aware.
- Solution code or baselines.
- Regenerating cases — existing cases should be manually `git mv`ed to new directories to ensure cases don't drift.
- Updating i18n docs (will be handled in issue #27 .

## Acceptance Criteria

- [ ] `splits.yaml` is mandatory for every finished benchmark; generators fail when invoked without a YAML path.
- [ ] Generator entrypoint consistently takes a `splits.yaml` path as its required argument.
- [ ] Split layout (`dataset/cases/<split>/<case_id>/`) is documented as the canonical dataset layout.
- [ ] Finished benchmarks have migrated to the new contract.
- [ ] `scripts/validate_benchmark_contract.py` enforces the new rules.
- [ ] The verifier contract is unchanged.
- [ ] Flat layout is removed

## Notes

- Splits are a dataset organization and generator-contract convention. The verifier should remain fully agnostic to whether cases come from a split subdirectory or the flat root.
- The reference implementation should be chosen from a benchmark whose generator is already stable.
- A shared top-level shape like `splits:` makes sense across benchmarks, but the exact per-split parameters should remain benchmark-owned where needed, rather than forcing one universal schema across all generators.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden benchmark generator contract with mandatory splits.yaml and split-aware dataset layout #19

Summary

Why

Scope

Out Of Scope

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Harden benchmark generator contract with mandatory splits.yaml and split-aware dataset layout #19

Description

Summary

Why

Scope

Out Of Scope

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions