Reproducing Results

Paper Results

Main Results (Table 1 & 2)

.venv/bin/python scripts/aggregate_results.py \
    --filter v3 \
    --agents claude-code,codex-cli,opencode-claude,opencode-openai

Outputs to experiments/results/:

tables.md - Markdown tables
table1_main_results.json - Aggregated by agent
table2_benchmark_breakdown.json - Per-benchmark breakdown

Iteration Results (Table 3)

.venv/bin/python scripts/aggregate_iteration_results.py \
    --experiment-dir experiments/20260111_133202_iteration-n5-all-per-iter \
    --output-dir experiments/results

Outputs:

table3_iteration.md - N=1 vs N=5 comparison
table3_iteration.tex - LaTeX version
iteration_curves.csv - Per-round pass rates

Experiment Runs

Experiments run 3 times for statistical reliability:

Experiment	Agent	Description
`test-v3-claude`	Claude Code	v3 anti-cheat template testing
`test-v3-codex`	Codex CLI	v3 anti-cheat template testing
`v3-opencode-claude-full`	OpenCode (Claude)	Full benchmark with OpenCode + Claude Opus 4.5
`v3-opencode-openai-full`	OpenCode (OpenAI)	Full benchmark with OpenCode + GPT-5.2

Benchmark Statistics

LOC Definition

Lines of Code (LOC) counts all code files in the source repository:

Calculated from fresh clones of source repos at specific versions (documented in metadata.yml)
Includes all programming languages and test files in the source repository
Excludes only build artifacts: node_modules, target, build, dist, vendor, .git, __pycache__
Uses cloc tool's "code" count (excludes comments and blank lines)

Source Repository Versions

Each benchmark's metadata.yml contains the exact commit used:

provenance:
  source_repo: "https://github.com/jqlang/jq"
  source_commit: 71c2ab509a8628dbbad4bc7b3f98a64aa90d3297  # Required
  source_version: jq-1.7.1  # Optional - tag/version if available

The source_commit is the authoritative identifier for reproducibility. The source_version is optional and included when cloning from a release tag.

Verify LOC from Source

To reproduce LOC statistics, clone the source repo at the documented version:

# Example: jq-gojq (147,358 LOC)
git clone --branch jq-1.7.1 --recurse-submodules --depth 1 \
    https://github.com/jqlang/jq /tmp/jq-verify

cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
     /tmp/jq-verify

# Example: gitleaks (22,358 LOC)
git clone --branch v8.30.0 --depth 1 \
    https://github.com/gitleaks/gitleaks /tmp/gitleaks-verify

cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
     /tmp/gitleaks-verify

Note: Use --recurse-submodules for repos with submodules (e.g., jq includes oniguruma).

Verify Test Counts

Test counts use pytest --collect-only to count actual test cases (including parameterized tests):

source .venv/bin/activate

# Example: jmespath (888 parameterized tests)
pytest --collect-only -q dataset/jmespath/tests/ \
    --work-dir=/tmp --run-cmd=true

Regenerate Metadata

To recalculate LOC and test counts for all benchmarks:

# Show calculated values without saving
python3 scripts/generate_metadata.py --all --calc-only

# Regenerate metadata.yml files
python3 scripts/generate_metadata.py --all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Results

Paper Results

Main Results (Table 1 & 2)

Iteration Results (Table 3)

Experiment Runs

Benchmark Statistics

LOC Definition

Source Repository Versions

Verify LOC from Source

Verify Test Counts

Regenerate Metadata

FilesExpand file tree

REPRODUCE.md

Latest commit

History

REPRODUCE.md

File metadata and controls

Reproducing Results

Paper Results

Main Results (Table 1 & 2)

Iteration Results (Table 3)

Experiment Runs

Benchmark Statistics

LOC Definition

Source Repository Versions

Verify LOC from Source

Verify Test Counts

Regenerate Metadata