Stop trusting public benchmarks. Test agents on your own codebase.

RepoBench benchmarks AI coding agents on your own repo. Locally. Reproducibly. In CI.

Public leaderboards are useful, but they do not tell you whether Codex, Claude Code, Cursor, Copilot, or an internal agent can safely fix your failing tests, respect your conventions, and stay inside your review budget. RepoBench gives each agent the same task, same base commit, same verification commands, and a clean scorecard.

npx repobench init
npx repobench run --task .repobench/tasks/sample.yml --agents codex,claude,cursor,copilot --open-report
npx repobench report <run-id>

What It Does

Runs each agent in an isolated git worktree.
Exports each agent result as a patch.
Applies every patch to a clean verifier worktree.
Runs deterministic checks: install, lint, test, build, Playwright, or any custom command.
Writes Markdown, HTML, JSON, logs, events, patches, and per-agent scores.
Uses no LLM judge by default.

Install

npm install -g repobench

For local development from this repo:

npm ci
npm run build
node dist/src/cli.js --help

Configure

repobench init creates .repobench.yml:

version: 1
repo: .
base: HEAD
runsDir: .repobench/runs
sandbox: local

agents:
  codex:
    command: codex exec --full-auto --sandbox workspace-write
  claude:
    command: claude -p --verbose --output-format stream-json --permission-mode bypassPermissions
  cursor:
    command: cursor-agent -p --output-format stream-json
  copilot:
    command: copilot -p

checks:
  install: npm ci
  lint: npm run lint --if-present
  test: npm test
  build: npm run build --if-present

limits:
  timeoutMinutes: 45
  maxBudgetUsd: 5

Agent commands receive the task prompt as the final shell argument. If a command contains {prompt}, RepoBench replaces that token instead. RepoBench also sets REPOBENCH, REPOBENCH_RUN_ID, and REPOBENCH_AGENT.

Tasks

id: sample-bugfix
title: Fix the sample failing test
prompt: |
  Fix the failing test without changing the public API.
base: HEAD
checks:
  test: npm test

Task-level checks override config-level checks with the same name.

Reports

Each run writes:

.repobench/runs/<run-id>/
  manifest.json
  events.jsonl
  agents/<agent>/patch.diff
  agents/<agent>/stdout.log
  agents/<agent>/stderr.log
  agents/<agent>/score.json
  report.md
  report.html
  report.json

The default score is deterministic:

patch applies: 40 points
verification checks: 45 points
agent command exits cleanly: 15 points

60-Second Demo

The examples/toy-repo directory is a tiny repo with a failing test. To try RepoBench without real agent credentials, configure a mock agent command that edits the file:

agents:
  mock-pass:
    command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/pass.mjs
  mock-noop:
    command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/noop.mjs

Then run:

repobench run --task .repobench/tasks/sample.yml --agents mock-pass,mock-noop --open-report

This repo also ships checked-in reports from real RepoBench runs:

Self-dogfood run against a seeded RepoBench regression: docs/dogfood-report.md
Tiny Codex vs Claude Code toy run: docs/real-agent-report.md
Deterministic mock-agent run: docs/demo-report.md

Regenerate the mock report with:

npm run demo

GitHub Action

- uses: Atomics-hub/RepoBench@v1
  with:
    task: .repobench/tasks/sample.yml
    agents: codex,claude

For public forks, do not expose model/provider keys to untrusted pull requests. Prefer trusted events, manual dispatch, or self-hosted runners.

Docker Sandbox

The default sandbox is local. RepoBench also accepts sandbox: docker for users who provide an image with the needed agent CLIs and project toolchain:

sandbox: docker
docker:
  image: ghcr.io/your-org/repobench-agent:latest
  network: bridge
  cpus: 2
  memory: 4g

Docker runs with dropped capabilities, no-new-privileges, a process limit, CPU/memory limits, and only the active worktree mounted at /workspace.

Why RepoBench

SWE-bench tells you what models can do on public benchmark tasks. RepoBench tells you which agent deserves access to your repo.

The paid path later is private scheduled evals, hosted dashboards, policy packs, enterprise audit logs, and “Agent Readiness Bakeoff” reports. The OSS core stays local-first.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
examples/toy-repo		examples/toy-repo
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stop trusting public benchmarks. Test agents on your own codebase.

What It Does

Install

Configure

Tasks

Reports

60-Second Demo

GitHub Action

Docker Sandbox

Why RepoBench

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stop trusting public benchmarks. Test agents on your own codebase.

What It Does

Install

Configure

Tasks

Reports

60-Second Demo

GitHub Action

Docker Sandbox

Why RepoBench

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages