RepoMod-Bench

1. Overview

RepoMod-Bench evaluates AI agents' ability to translate software projects between languages/frameworks while maintaining functional equivalence. It supports:

21 benchmarks across CLI tools and REST APIs
8 programming languages: C, C++, Go, Java, JavaScript, Python, Rust, TypeScript
1.6M lines of code total, with repositories ranging from 14 to 211K LOC
11,616 test cases for implementation-agnostic evaluation

Each benchmark has:

workspace/src/ - Source implementation (given to agent)
workspace/dst/ - Target implementation (agent generates this)
workspace/prompt.md - Translation instructions
tests/ - Hidden pytest tests (only used during evaluation)

2. Setup

# Clone the repository
git clone <repo-url>
cd mcode-benchmark

# Create virtual environment (requires Python 3.12+)
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Copy environment template and add your API key
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY

# Copy config template and select benchmarks to run
cp config.toml.example config.toml
# Edit config.toml and uncomment desired benchmark IDs

Requirements:

Docker and Docker Compose
Python 3.12+
uv (recommended) or pip

3. Running Agent to Generate Destination

Option A: Automated via run_agent.py

# Set API key in .env or environment
echo 'ANTHROPIC_API_KEY=sk-...' > .env

# Configure benchmarks in config.toml
# Run agent (default timeout: 3600s)
python3 run_agent.py --timeout 3600

This automatically:

Starts Docker container
Pipes prompt.md to Claude headless
Agent writes code to workspace/dst/
Logs saved to logs/agent-/

Option B: Interactive via dev.sh

# Start container for a specific benchmark
./dev.sh charcoal-cli

# Inside container, run any agent manually:
cd /workspace
IS_SANDBOX=true cat prompt.md | claude -p --dangerously-skip-permissions

# Or use other tools (aider, cursor, etc.)
# Exit shell when done - container stops automatically

This gives full interactive control inside the dev environment at /workspace.

4. Running Tests Against Implementations

# Configure which benchmarks to run in config.toml
# Edit the ids array to select benchmarks (see table below for IDs)

# Test source only (default)
python3 run_benchmarks.py --test src

# Test destination only
python3 run_benchmarks.py --test dst

# Test both
python3 run_benchmarks.py --test both

Results are saved to results/results.jsonl and logs to logs/.

5. Running Experiments (Reproducible)

For reproducible experiments that combine agent runs with testing, use run_experiment.py:

Experiment Configuration

Create a YAML config file (e.g., my_experiment.yml):

name: my-experiment
agent: claude-code          # Agent from agents.yml
benchmarks:                 # Benchmarks to run
  - hello-world-api
  - task-management-api
  - charcoal-cli
timeout: 3600               # Timeout per iteration (seconds)
parallel: 3                 # Number of parallel workers
template_version: v3        # Prompt template version (optional)
description: "My experiment description"

Available agents (defined in agents.yml):

claude-code - Claude Code CLI with Claude Opus 4.5
codex-cli - OpenAI Codex CLI with GPT-5.2
opencode-claude - OpenCode with Claude Opus 4.5
opencode-openai - OpenCode with GPT-5.2
gemini-cli - Gemini CLI with Gemini 3 Flash

Run Experiment

python3 run_experiment.py my_experiment.yml

This:

Creates experiments/<timestamp>_<name>/ directory
Copies workspaces for isolation
Runs agent on each benchmark
Runs tests automatically
Saves results to summary.jsonl

Multi-Iteration Experiments

For experiments that measure improvement over multiple iterations:

name: iteration-experiment
agent: claude-code
benchmarks:
  - jq-gojq
  - charcoal-cli
iterations: 5              # Run 5 iterations per benchmark
test_each_iteration: true  # Test after each iteration (for accuracy curves)
timeout: 10800             # Higher timeout for multi-iteration

Output Structure

experiments/
└── 20260115_120000_my-experiment/
    ├── experiment.yml       # Config snapshot
    ├── agents.yml           # Agent config snapshot
    ├── summary.jsonl        # Results for all benchmarks
    ├── hello-world-api/
    │   ├── workspace/       # Isolated workspace copy
    │   └── agent_log.jsonl  # Agent execution log
    └── task-management-api/
        └── ...

Experiment Configs Used in Paper

The following experiment configs in experiments/ were used to generate paper results:

Config File	Agent	Description
`opencode-claude-full.yml`	OpenCode	All benchmarks with Claude Opus 4.5
`opencode-openai-full.yml`	OpenCode	All benchmarks with GPT-5.2
`test-v3-claude.yml`	Claude Code	Test runs with v3 template
`test-v3-codex.yml`	Codex CLI	Test runs with v3 template
`iteration-n5-all-per-iter.yml`	Claude Code	Multi-iteration experiment (5 iterations)

Available Benchmarks

Benchmark	Source	Target	LOC	Tests	Description
hello-world-api	Python	Java	14	6	Minimal REST API
task-mgmt-api	Go	Python	762	10	CRUD task manager
bcal	C	Go	2.5K	73	Byte calculator
tokei	Rust	Go	10.1K	196	Code statistics tool
toml	Go	Python	13.9K	647	TOML parser
charcoal-cli	TypeScript	Python	15.8K	195	Git workflow tool
httpie-xh	Python	Rust	19.2K	101	HTTP client
jmespath	Go	Rust	19.3K	888	JSON query language
gitleaks	Go	Rust	22.4K	35	Secret scanner
ledger	C++	Go	50.0K	483	Accounting
wabt	C++	Rust	54.8K	433	WebAssembly toolkit
taskwarrior	C++	Rust	55.3K	912	Task management
lightningcss	Rust	Go	61.7K	1,779	CSS parser/minifier
bc	C	Rust	117K	1,938	Precision calculator
hugo	Go	Rust	122K	74	Static site generator
jq-gojq	C	Go	147K	430	JSON processor
pdfcpu	Go	Rust	160K	141	PDF processor
uncrustify	C++	Rust	162K	2,024	Code beautifier
prettier	JavaScript	Rust	175K	539	Code formatter
verible	C++	Rust	191K	148	SystemVerilog tools
qalculate	C++	Go	211K	564	Math calculator

Total: 1.6M LOC, 11,616 tests

Selecting and Adding New Benchmarks

Use Claude Code commands to streamline the process:

# Step 1: Evaluate if a repo meets selection criteria
/evaluate-benchmark https://github.com/user/repo

# Step 2: If approved, add it as a benchmark
/add-benchmark https://github.com/user/repo <target-language>

For detailed criteria and manual process, see:

SELECTING_BENCHMARKS.md - Selection criteria
ADDING_BENCHMARKS.md - Manual setup instructions

Reproducing Results

See REPRODUCE.md for instructions on:

Regenerating paper results (Tables 1-3)
Verifying benchmark statistics (LOC, test counts)

Citation

If you use RepoMod-Bench in your research, please cite: Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, and Antoine Raux. 2026. RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26).

Contributing to RepoMod-Bench

We welcome contributions to expand the diversity and scale of RepoMod-Bench. By submitting a Pull Request, you agree that your contributions will be licensed under the project's Apache License 2.0. When adding repositories, make sure to respect Upstream Licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepoMod-Bench

1. Overview

2. Setup

3. Running Agent to Generate Destination

Option A: Automated via run_agent.py

Option B: Interactive via dev.sh

4. Running Tests Against Implementations

5. Running Experiments (Reproducible)

Experiment Configuration

Run Experiment

Multi-Iteration Experiments

Output Structure

Experiment Configs Used in Paper

Available Benchmarks

Selecting and Adding New Benchmarks

Reproducing Results

Citation

Contributing to RepoMod-Bench

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RepoMod-Bench

1. Overview

2. Setup

3. Running Agent to Generate Destination

Option A: Automated via run_agent.py

Option B: Interactive via dev.sh

4. Running Tests Against Implementations

5. Running Experiments (Reproducible)

Experiment Configuration

Run Experiment

Multi-Iteration Experiments

Output Structure

Experiment Configs Used in Paper

Available Benchmarks

Selecting and Adding New Benchmarks

Reproducing Results

Citation

Contributing to RepoMod-Bench