Add CI workflow with benchmark convention enforcement

This repository currently lacks continuous integration. As the benchmark suite grows and more benchmarks undergo refactoring, we need automated validation that PRs do not break verifier correctness or violate repository conventions.

## Motivation

The repository follows an end-to-end benchmark pattern: generators, datasets, tests, verifiers, and optional visualizers. CI should enforce the public contract for finished benchmarks while leaving unfinished benchmarks exempt until they are explicitly promoted in `benchmarks/finished_benchmarks.json`.

## Acceptance Criteria

- [ ] GitHub Actions workflow runs on PRs and pushes to `main`
- [ ] All tests pass: `uv run pytest tests/`
- [ ] Public benchmark contract doc exists at `docs/benchmark_contract.md`
- [ ] Finished benchmark allowlist exists at `benchmarks/finished_benchmarks.json`
- [ ] Benchmark structure validation script passes for finished benchmarks only:
  - each finished benchmark has `README.md`
  - generator exists (`generator.py` or `generator/run.py`)
  - verifier exists (`verifier.py` or `verifier/run.py`)
  - dataset follows `dataset/cases/<case_id>/...`
  - `dataset/index.json` is optional
  - `dataset/source_data/` is not tracked
- [ ] Finished benchmark verifiers can load canonical cases without error
- [ ] Finished benchmark public generator/verifier code contains no `sys.path` hacks
- [ ] Finished benchmark public generator/verifier code contains no `from benchmarks.` imports
- [ ] Dedicated reproducibility workflow exists for finished benchmarks whose metadata enables generator-owned output checks

## Out of Scope (future issues)

- Visualizer validation
- Enforcing unfinished benchmarks
- Full benchmark README/docs sync
- Requiring dataset example solutions for every finished benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CI workflow with benchmark convention enforcement #24

Motivation

Acceptance Criteria

Out of Scope (future issues)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add CI workflow with benchmark convention enforcement #24

Description

Motivation

Acceptance Criteria

Out of Scope (future issues)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions