.venv/bin/python scripts/aggregate_results.py \
--filter v3 \
--agents claude-code,codex-cli,opencode-claude,opencode-openaiOutputs to experiments/results/:
tables.md- Markdown tablestable1_main_results.json- Aggregated by agenttable2_benchmark_breakdown.json- Per-benchmark breakdown
.venv/bin/python scripts/aggregate_iteration_results.py \
--experiment-dir experiments/20260111_133202_iteration-n5-all-per-iter \
--output-dir experiments/resultsOutputs:
table3_iteration.md- N=1 vs N=5 comparisontable3_iteration.tex- LaTeX versioniteration_curves.csv- Per-round pass rates
Experiments run 3 times for statistical reliability:
| Experiment | Agent | Description |
|---|---|---|
test-v3-claude |
Claude Code | v3 anti-cheat template testing |
test-v3-codex |
Codex CLI | v3 anti-cheat template testing |
v3-opencode-claude-full |
OpenCode (Claude) | Full benchmark with OpenCode + Claude Opus 4.5 |
v3-opencode-openai-full |
OpenCode (OpenAI) | Full benchmark with OpenCode + GPT-5.2 |
Lines of Code (LOC) counts all code files in the source repository:
- Calculated from fresh clones of source repos at specific versions (documented in
metadata.yml) - Includes all programming languages and test files in the source repository
- Excludes only build artifacts:
node_modules,target,build,dist,vendor,.git,__pycache__ - Uses
cloctool's "code" count (excludes comments and blank lines)
Each benchmark's metadata.yml contains the exact commit used:
provenance:
source_repo: "https://github.com/jqlang/jq"
source_commit: 71c2ab509a8628dbbad4bc7b3f98a64aa90d3297 # Required
source_version: jq-1.7.1 # Optional - tag/version if availableThe source_commit is the authoritative identifier for reproducibility. The source_version is optional and included when cloning from a release tag.
To reproduce LOC statistics, clone the source repo at the documented version:
# Example: jq-gojq (147,358 LOC)
git clone --branch jq-1.7.1 --recurse-submodules --depth 1 \
https://github.com/jqlang/jq /tmp/jq-verify
cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
/tmp/jq-verify
# Example: gitleaks (22,358 LOC)
git clone --branch v8.30.0 --depth 1 \
https://github.com/gitleaks/gitleaks /tmp/gitleaks-verify
cloc --exclude-dir=node_modules,target,build,dist,vendor,.git,__pycache__ \
/tmp/gitleaks-verifyNote: Use --recurse-submodules for repos with submodules (e.g., jq includes oniguruma).
Test counts use pytest --collect-only to count actual test cases (including parameterized tests):
source .venv/bin/activate
# Example: jmespath (888 parameterized tests)
pytest --collect-only -q dataset/jmespath/tests/ \
--work-dir=/tmp --run-cmd=trueTo recalculate LOC and test counts for all benchmarks:
# Show calculated values without saving
python3 scripts/generate_metadata.py --all --calc-only
# Regenerate metadata.yml files
python3 scripts/generate_metadata.py --all