Skip to content

Harden benchmark generator contract with mandatory splits.yaml and split-aware dataset layout #19

@Mtrya

Description

@Mtrya

Summary

Treat benchmark dataset generation as a first-class, reproducible contract by mandating splits.yaml for all finished benchmarks and unifying generator entrypoints around it. Dataset cases are organized under dataset/cases/<split>/<case_id>/, even for benchmarks that only define a single split (e.g. test).

Why

  • Parameters that control dataset construction are currently hardcoded or scattered across ad-hoc CLI flags; readers must reverse-engineer generator code to understand how cases are built.
  • A single, mandatory splits.yaml surfaces those parameters explicitly and makes extending or modifying a dataset an editing task rather than a code task.
  • A consistent split layout prevents ad-hoc per-benchmark conventions from diverging.
  • Generators that are driven by YAML can reproduce the full dataset, a single split, or a custom split without requiring manual reorganization.

Scope

  • Define the split layout convention: dataset/cases/<split>/<case_id>/ (replaces the flat dataset/cases/<case_id>/ for finished benchmarks).
  • Define the mandatory splits.yaml location: benchmarks/<name>/splits.yaml (committed, canonical).
  • Specify the generator CLI contract:
    • Entrypoint: python benchmarks/<name>/generator.py path/to/splits.yaml
    • Running without arguments is an error (prints usage).
    • All dataset-construction arguments move into splits.yaml; operational flags like --help may remain as optional CLI behavior.
  • Specify two supported splits.yaml schemas:
    1. Split parameters — for algorithmic generators that build cases per split:
      splits:
        easy:
          seed: 42
          case_count: 5
          max_satellites: 3
        hard:
          seed: 142
          case_count: 5
          max_satellites: 12
    2. Split assignments — for fixed-case benchmarks that bucket existing cases:
      splits:
        test:
          - case_001
          - case_002
        train:
          - case_003
          - case_004
  • Update dataset/index.json:
    • Add optional example_smoke_case field containing a relative path: "test/case_0001".
  • Update the benchmark template and docs/benchmark_contract.md to cover the new contract.
  • Update scripts/validate_benchmark_contract.py:
    • Validate split directories match splits.yaml.
    • Update smoke-test case resolution.
    • Enforce the new generator CLI shape.
  • Apply the new contract to all benchmark generators

Out Of Scope

  • Changing the verifier interface — verifiers operate on individual case directories and do not need to be split-aware.
  • Solution code or baselines.
  • Regenerating cases — existing cases should be manually git mved to new directories to ensure cases don't drift.
  • Updating i18n docs (will be handled in issue CI: Add i18n Documentation Sync Check #27 .

Acceptance Criteria

  • splits.yaml is mandatory for every finished benchmark; generators fail when invoked without a YAML path.
  • Generator entrypoint consistently takes a splits.yaml path as its required argument.
  • Split layout (dataset/cases/<split>/<case_id>/) is documented as the canonical dataset layout.
  • Finished benchmarks have migrated to the new contract.
  • scripts/validate_benchmark_contract.py enforces the new rules.
  • The verifier contract is unchanged.
  • Flat layout is removed

Notes

  • Splits are a dataset organization and generator-contract convention. The verifier should remain fully agnostic to whether cases come from a split subdirectory or the flat root.
  • The reference implementation should be chosen from a benchmark whose generator is already stable.
  • A shared top-level shape like splits: makes sense across benchmarks, but the exact per-split parameters should remain benchmark-owned where needed, rather than forcing one universal schema across all generators.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions