Skip to content

Atomics-hub/RepoBench

Repository files navigation

Stop trusting public benchmarks. Test agents on your own codebase.

RepoBench benchmarks AI coding agents on your own repo. Locally. Reproducibly. In CI.

Public leaderboards are useful, but they do not tell you whether Codex, Claude Code, Cursor, Copilot, or an internal agent can safely fix your failing tests, respect your conventions, and stay inside your review budget. RepoBench gives each agent the same task, same base commit, same verification commands, and a clean scorecard.

RepoBench report comparing Claude Code and Codex on RepoBench itself

npx repobench init
npx repobench run --task .repobench/tasks/sample.yml --agents codex,claude,cursor,copilot --open-report
npx repobench report <run-id>

What It Does

  • Runs each agent in an isolated git worktree.
  • Exports each agent result as a patch.
  • Applies every patch to a clean verifier worktree.
  • Runs deterministic checks: install, lint, test, build, Playwright, or any custom command.
  • Writes Markdown, HTML, JSON, logs, events, patches, and per-agent scores.
  • Uses no LLM judge by default.

Install

npm install -g repobench

For local development from this repo:

npm ci
npm run build
node dist/src/cli.js --help

Configure

repobench init creates .repobench.yml:

version: 1
repo: .
base: HEAD
runsDir: .repobench/runs
sandbox: local

agents:
  codex:
    command: codex exec --full-auto --sandbox workspace-write
  claude:
    command: claude -p --verbose --output-format stream-json --permission-mode bypassPermissions
  cursor:
    command: cursor-agent -p --output-format stream-json
  copilot:
    command: copilot -p

checks:
  install: npm ci
  lint: npm run lint --if-present
  test: npm test
  build: npm run build --if-present

limits:
  timeoutMinutes: 45
  maxBudgetUsd: 5

Agent commands receive the task prompt as the final shell argument. If a command contains {prompt}, RepoBench replaces that token instead. RepoBench also sets REPOBENCH, REPOBENCH_RUN_ID, and REPOBENCH_AGENT.

Tasks

id: sample-bugfix
title: Fix the sample failing test
prompt: |
  Fix the failing test without changing the public API.
base: HEAD
checks:
  test: npm test

Task-level checks override config-level checks with the same name.

Reports

Each run writes:

.repobench/runs/<run-id>/
  manifest.json
  events.jsonl
  agents/<agent>/patch.diff
  agents/<agent>/stdout.log
  agents/<agent>/stderr.log
  agents/<agent>/score.json
  report.md
  report.html
  report.json

The default score is deterministic:

  • patch applies: 40 points
  • verification checks: 45 points
  • agent command exits cleanly: 15 points

60-Second Demo

The examples/toy-repo directory is a tiny repo with a failing test. To try RepoBench without real agent credentials, configure a mock agent command that edits the file:

agents:
  mock-pass:
    command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/pass.mjs
  mock-noop:
    command: node /absolute/path/to/RepoBench/tests/fixtures/mock-agents/noop.mjs

Then run:

repobench run --task .repobench/tasks/sample.yml --agents mock-pass,mock-noop --open-report

This repo also ships checked-in reports from real RepoBench runs:

Regenerate the mock report with:

npm run demo

GitHub Action

- uses: Atomics-hub/RepoBench@v1
  with:
    task: .repobench/tasks/sample.yml
    agents: codex,claude

For public forks, do not expose model/provider keys to untrusted pull requests. Prefer trusted events, manual dispatch, or self-hosted runners.

Docker Sandbox

The default sandbox is local. RepoBench also accepts sandbox: docker for users who provide an image with the needed agent CLIs and project toolchain:

sandbox: docker
docker:
  image: ghcr.io/your-org/repobench-agent:latest
  network: bridge
  cpus: 2
  memory: 4g

Docker runs with dropped capabilities, no-new-privileges, a process limit, CPU/memory limits, and only the active worktree mounted at /workspace.

Why RepoBench

SWE-bench tells you what models can do on public benchmark tasks. RepoBench tells you which agent deserves access to your repo.

The paid path later is private scheduled evals, hosted dashboards, policy packs, enterprise audit logs, and “Agent Readiness Bakeoff” reports. The OSS core stays local-first.