Skip to content

Add CI workflow with benchmark convention enforcement #24

@Mtrya

Description

@Mtrya

This repository currently lacks continuous integration. As the benchmark suite grows and more benchmarks undergo refactoring, we need automated validation that PRs do not break verifier correctness or violate repository conventions.

Motivation

The repository follows an end-to-end benchmark pattern: generators, datasets, tests, verifiers, and optional visualizers. CI should enforce the public contract for finished benchmarks while leaving unfinished benchmarks exempt until they are explicitly promoted in benchmarks/finished_benchmarks.json.

Acceptance Criteria

  • GitHub Actions workflow runs on PRs and pushes to main
  • All tests pass: uv run pytest tests/
  • Public benchmark contract doc exists at docs/benchmark_contract.md
  • Finished benchmark allowlist exists at benchmarks/finished_benchmarks.json
  • Benchmark structure validation script passes for finished benchmarks only:
    • each finished benchmark has README.md
    • generator exists (generator.py or generator/run.py)
    • verifier exists (verifier.py or verifier/run.py)
    • dataset follows dataset/cases/<case_id>/...
    • dataset/index.json is optional
    • dataset/source_data/ is not tracked
  • Finished benchmark verifiers can load canonical cases without error
  • Finished benchmark public generator/verifier code contains no sys.path hacks
  • Finished benchmark public generator/verifier code contains no from benchmarks. imports
  • Dedicated reproducibility workflow exists for finished benchmarks whose metadata enables generator-owned output checks

Out of Scope (future issues)

  • Visualizer validation
  • Enforcing unfinished benchmarks
  • Full benchmark README/docs sync
  • Requiring dataset example solutions for every finished benchmark

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions