Skip to content

Latest commit

 

History

History
112 lines (78 loc) · 3.46 KB

File metadata and controls

112 lines (78 loc) · 3.46 KB

Reproducing Results

Paper Results

Main Results (Table 1 & 2)

.venv/bin/python scripts/aggregate_results.py \
    --filter v3 \
    --agents claude-code,codex-cli,opencode-claude,opencode-openai

Outputs to experiments/results/:

  • tables.md - Markdown tables
  • table1_main_results.json - Aggregated by agent
  • table2_benchmark_breakdown.json - Per-benchmark breakdown

Iteration Results (Table 3)

.venv/bin/python scripts/aggregate_iteration_results.py \
    --experiment-dir experiments/20260111_133202_iteration-n5-all-per-iter \
    --output-dir experiments/results

Outputs:

  • table3_iteration.md - N=1 vs N=5 comparison
  • table3_iteration.tex - LaTeX version
  • iteration_curves.csv - Per-round pass rates

Experiment Runs

Experiments run 3 times for statistical reliability:

Experiment Agent Description
test-v3-claude Claude Code v3 anti-cheat template testing
test-v3-codex Codex CLI v3 anti-cheat template testing
v3-opencode-claude-full OpenCode (Claude) Full benchmark with OpenCode + Claude Opus 4.5
v3-opencode-openai-full OpenCode (OpenAI) Full benchmark with OpenCode + GPT-5.2

Benchmark Statistics

LOC Definition

Lines of Code (LOC) counts all code files in the source repository:

  • Calculated from fresh clones of source repos at specific versions (documented in metadata.yml)
  • Includes all programming languages and test files in the source repository
  • Excludes only build artifacts: node_modules, target, build, dist, vendor, .git, __pycache__
  • Uses cloc tool's "code" count (excludes comments and blank lines)

Source Repository Versions

Each benchmark's metadata.yml contains the exact commit used:

provenance:
  source_repo: "https://github.com/jqlang/jq"
  source_commit: 71c2ab509a8628dbbad4bc7b3f98a64aa90d3297  # Required
  source_version: jq-1.7.1  # Optional - tag/version if available

The source_commit is the authoritative identifier for reproducibility. The source_version is optional and included when cloning from a release tag.

Verify LOC from Source

To reproduce LOC statistics, clone the source repo at the documented version:

# Example: jq-gojq (147,358 LOC)
git clone --branch jq-1.7.1 --recurse-submodules --depth 1 \
    https://github.com/jqlang/jq /tmp/jq-verify

cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
     /tmp/jq-verify

# Example: gitleaks (22,358 LOC)
git clone --branch v8.30.0 --depth 1 \
    https://github.com/gitleaks/gitleaks /tmp/gitleaks-verify

cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
     /tmp/gitleaks-verify

Note: Use --recurse-submodules for repos with submodules (e.g., jq includes oniguruma).

Verify Test Counts

Test counts use pytest --collect-only to count actual test cases (including parameterized tests):

source .venv/bin/activate

# Example: jmespath (888 parameterized tests)
pytest --collect-only -q dataset/jmespath/tests/ \
    --work-dir=/tmp --run-cmd=true

Regenerate Metadata

To recalculate LOC and test counts for all benchmarks:

# Show calculated values without saving
python3 scripts/generate_metadata.py --all --calc-only

# Regenerate metadata.yml files
python3 scripts/generate_metadata.py --all