Skip to content

Latest commit

 

History

History
246 lines (185 loc) · 8.08 KB

File metadata and controls

246 lines (185 loc) · 8.08 KB

RepoMod-Bench

1. Overview

RepoMod-Bench evaluates AI agents' ability to translate software projects between languages/frameworks while maintaining functional equivalence. It supports:

  • 21 benchmarks across CLI tools and REST APIs
  • 8 programming languages: C, C++, Go, Java, JavaScript, Python, Rust, TypeScript
  • 1.6M lines of code total, with repositories ranging from 14 to 211K LOC
  • 11,616 test cases for implementation-agnostic evaluation

Each benchmark has:

  • workspace/src/ - Source implementation (given to agent)
  • workspace/dst/ - Target implementation (agent generates this)
  • workspace/prompt.md - Translation instructions
  • tests/ - Hidden pytest tests (only used during evaluation)

2. Setup

# Clone the repository
git clone <repo-url>
cd mcode-benchmark

# Create virtual environment (requires Python 3.12+)
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Copy environment template and add your API key
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY

# Copy config template and select benchmarks to run
cp config.toml.example config.toml
# Edit config.toml and uncomment desired benchmark IDs

Requirements:

  • Docker and Docker Compose
  • Python 3.12+
  • uv (recommended) or pip

3. Running Agent to Generate Destination

Option A: Automated via run_agent.py

# Set API key in .env or environment
echo 'ANTHROPIC_API_KEY=sk-...' > .env

# Configure benchmarks in config.toml
# Run agent (default timeout: 3600s)
python3 run_agent.py --timeout 3600

This automatically:

  1. Starts Docker container
  2. Pipes prompt.md to Claude headless
  3. Agent writes code to workspace/dst/
  4. Logs saved to logs/agent-/

Option B: Interactive via dev.sh

# Start container for a specific benchmark
./dev.sh charcoal-cli

# Inside container, run any agent manually:
cd /workspace
IS_SANDBOX=true cat prompt.md | claude -p --dangerously-skip-permissions

# Or use other tools (aider, cursor, etc.)
# Exit shell when done - container stops automatically

This gives full interactive control inside the dev environment at /workspace.

4. Running Tests Against Implementations

# Configure which benchmarks to run in config.toml
# Edit the ids array to select benchmarks (see table below for IDs)

# Test source only (default)
python3 run_benchmarks.py --test src

# Test destination only
python3 run_benchmarks.py --test dst

# Test both
python3 run_benchmarks.py --test both

Results are saved to results/results.jsonl and logs to logs/.

5. Running Experiments (Reproducible)

For reproducible experiments that combine agent runs with testing, use run_experiment.py:

Experiment Configuration

Create a YAML config file (e.g., my_experiment.yml):

name: my-experiment
agent: claude-code          # Agent from agents.yml
benchmarks:                 # Benchmarks to run
  - hello-world-api
  - task-management-api
  - charcoal-cli
timeout: 3600               # Timeout per iteration (seconds)
parallel: 3                 # Number of parallel workers
template_version: v3        # Prompt template version (optional)
description: "My experiment description"

Available agents (defined in agents.yml):

  • claude-code - Claude Code CLI with Claude Opus 4.5
  • codex-cli - OpenAI Codex CLI with GPT-5.2
  • opencode-claude - OpenCode with Claude Opus 4.5
  • opencode-openai - OpenCode with GPT-5.2
  • gemini-cli - Gemini CLI with Gemini 3 Flash

Run Experiment

python3 run_experiment.py my_experiment.yml

This:

  1. Creates experiments/<timestamp>_<name>/ directory
  2. Copies workspaces for isolation
  3. Runs agent on each benchmark
  4. Runs tests automatically
  5. Saves results to summary.jsonl

Multi-Iteration Experiments

For experiments that measure improvement over multiple iterations:

name: iteration-experiment
agent: claude-code
benchmarks:
  - jq-gojq
  - charcoal-cli
iterations: 5              # Run 5 iterations per benchmark
test_each_iteration: true  # Test after each iteration (for accuracy curves)
timeout: 10800             # Higher timeout for multi-iteration

Output Structure

experiments/
└── 20260115_120000_my-experiment/
    ├── experiment.yml       # Config snapshot
    ├── agents.yml           # Agent config snapshot
    ├── summary.jsonl        # Results for all benchmarks
    ├── hello-world-api/
    │   ├── workspace/       # Isolated workspace copy
    │   └── agent_log.jsonl  # Agent execution log
    └── task-management-api/
        └── ...

Experiment Configs Used in Paper

The following experiment configs in experiments/ were used to generate paper results:

Config File Agent Description
opencode-claude-full.yml OpenCode All benchmarks with Claude Opus 4.5
opencode-openai-full.yml OpenCode All benchmarks with GPT-5.2
test-v3-claude.yml Claude Code Test runs with v3 template
test-v3-codex.yml Codex CLI Test runs with v3 template
iteration-n5-all-per-iter.yml Claude Code Multi-iteration experiment (5 iterations)

Available Benchmarks

Benchmark Source Target LOC Tests Description
hello-world-api Python Java 14 6 Minimal REST API
task-mgmt-api Go Python 762 10 CRUD task manager
bcal C Go 2.5K 73 Byte calculator
tokei Rust Go 10.1K 196 Code statistics tool
toml Go Python 13.9K 647 TOML parser
charcoal-cli TypeScript Python 15.8K 195 Git workflow tool
httpie-xh Python Rust 19.2K 101 HTTP client
jmespath Go Rust 19.3K 888 JSON query language
gitleaks Go Rust 22.4K 35 Secret scanner
ledger C++ Go 50.0K 483 Accounting
wabt C++ Rust 54.8K 433 WebAssembly toolkit
taskwarrior C++ Rust 55.3K 912 Task management
lightningcss Rust Go 61.7K 1,779 CSS parser/minifier
bc C Rust 117K 1,938 Precision calculator
hugo Go Rust 122K 74 Static site generator
jq-gojq C Go 147K 430 JSON processor
pdfcpu Go Rust 160K 141 PDF processor
uncrustify C++ Rust 162K 2,024 Code beautifier
prettier JavaScript Rust 175K 539 Code formatter
verible C++ Rust 191K 148 SystemVerilog tools
qalculate C++ Go 211K 564 Math calculator

Total: 1.6M LOC, 11,616 tests


Selecting and Adding New Benchmarks

Use Claude Code commands to streamline the process:

# Step 1: Evaluate if a repo meets selection criteria
/evaluate-benchmark https://github.com/user/repo

# Step 2: If approved, add it as a benchmark
/add-benchmark https://github.com/user/repo <target-language>

For detailed criteria and manual process, see:


Reproducing Results

See REPRODUCE.md for instructions on:

  • Regenerating paper results (Tables 1-3)
  • Verifying benchmark statistics (LOC, test counts)

Citation

If you use RepoMod-Bench in your research, please cite: Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, and Antoine Raux. 2026. RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26).

Contributing to RepoMod-Bench

We welcome contributions to expand the diversity and scale of RepoMod-Bench. By submitting a Pull Request, you agree that your contributions will be licensed under the project's Apache License 2.0. When adding repositories, make sure to respect Upstream Licenses.