RepoBench benchmarks AI coding agents on your own repo. Locally. Reproducibly. In CI.
Public leaderboards are useful, but they do not tell you whether Codex, Claude Code, Cursor, Copilot, or an internal agent can safely fix your failing tests, respect your conventions, and stay inside your review budget. RepoBench gives each agent the same task, same base commit, same verification commands, and a clean scorecard.
npx repobench init
npx repobench run --task .repobench/tasks/sample.yml --agents codex,claude,cursor,copilot --open-report
npx repobench report <run-id>- Runs each agent in an isolated
git worktree. - Exports each agent result as a patch.
- Applies every patch to a clean verifier worktree.
- Runs deterministic checks: install, lint, test, build, Playwright, or any custom command.
- Writes Markdown, HTML, JSON, logs, events, patches, and per-agent scores.
- Uses no LLM judge by default.
npm install -g repobenchFor local development from this repo:
npm ci
npm run build
node dist/src/cli.js --helprepobench init creates .repobench.yml:
version: 1
repo: .
base: HEAD
runsDir: .repobench/runs
sandbox: local
agents:
codex:
command: codex exec --full-auto --sandbox workspace-write
claude:
command: claude -p --verbose --output-format stream-json --permission-mode bypassPermissions
cursor:
command: cursor-agent -p --output-format stream-json
copilot:
command: copilot -p
checks:
install: npm ci
lint: npm run lint --if-present
test: npm test
build: npm run build --if-present
limits:
timeoutMinutes: 45
maxBudgetUsd: 5Agent commands receive the task prompt as the final shell argument. If a command contains {prompt}, RepoBench replaces that token instead. RepoBench also sets REPOBENCH, REPOBENCH_RUN_ID, and REPOBENCH_AGENT.
id: sample-bugfix
title: Fix the sample failing test
prompt: |
Fix the failing test without changing the public API.
base: HEAD
checks:
test: npm testTask-level checks override config-level checks with the same name.
Each run writes:
.repobench/runs/<run-id>/
manifest.json
events.jsonl
agents/<agent>/patch.diff
agents/<agent>/stdout.log
agents/<agent>/stderr.log
agents/<agent>/score.json
report.md
report.html
report.json
The default score is deterministic:
- patch applies: 40 points
- verification checks: 45 points
- agent command exits cleanly: 15 points
The examples/toy-repo directory is a tiny repo with a failing test. To try RepoBench without real agent credentials, configure a mock agent command that edits the file:
agents:
mock-pass:
command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/pass.mjs
mock-noop:
command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/noop.mjsThen run:
repobench run --task .repobench/tasks/sample.yml --agents mock-pass,mock-noop --open-reportThis repo also ships checked-in reports from real RepoBench runs:
- Self-dogfood run against a seeded RepoBench regression: docs/dogfood-report.md
- Tiny Codex vs Claude Code toy run: docs/real-agent-report.md
- Deterministic mock-agent run: docs/demo-report.md
Regenerate the mock report with:
npm run demo- uses: Atomics-hub/RepoBench@v1
with:
task: .repobench/tasks/sample.yml
agents: codex,claudeFor public forks, do not expose model/provider keys to untrusted pull requests. Prefer trusted events, manual dispatch, or self-hosted runners.
The default sandbox is local. RepoBench also accepts sandbox: docker for users who provide an image with the needed agent CLIs and project toolchain:
sandbox: docker
docker:
image: ghcr.io/your-org/repobench-agent:latest
network: bridge
cpus: 2
memory: 4gDocker runs with dropped capabilities, no-new-privileges, a process limit, CPU/memory limits, and only the active worktree mounted at /workspace.
SWE-bench tells you what models can do on public benchmark tasks. RepoBench tells you which agent deserves access to your repo.
The paid path later is private scheduled evals, hosted dashboards, policy packs, enterprise audit logs, and “Agent Readiness Bakeoff” reports. The OSS core stays local-first.
