This guide explains how to add a new benchmark to the MCode Benchmark framework.
Each benchmark evaluates AI agents' ability to translate software from one language/framework to another. The framework supports two types:
| Type | Interface | Examples |
|---|---|---|
| server | HTTP request/response | Flask → Gin, Express → Spring Boot |
| cli | stdin/stdout/args/exit code | jq → gojq, httpie → xh |
python3 scripts/init_benchmark.py \
--name "my-benchmark" \
--type server \
--source python \
--target go \
--description "Brief description" \
--source-desc "Flask-based REST API" \
--target-desc "Gin-based REST API" \
--source-build "pip install -r requirements.txt" \
--source-run "python app.py" \
--target-build "go build -o app" \
--target-run "./app"Place your complete, working source code in dataset/my-benchmark/workspace/src/.
Create implementation-agnostic tests in dataset/my-benchmark/tests/.
Edit dataset/my-benchmark/Dockerfile.dev to install required language runtimes.
# Generate prompt.md
python3 generate_prompts.py --benchmark my-benchmark
# Calculate metrics
python3 scripts/generate_metadata.py --benchmark my-benchmark
# Verify source passes tests
python3 run_benchmarks.py --test srcAfter initialization, you'll have:
dataset/my-benchmark/
├── benchmark.yml # Main configuration (single source of truth)
├── metadata.yml # Benchmark metadata (auto-calculated)
├── Dockerfile.dev # Multi-runtime development container
├── docker-compose.dev.yml # Container orchestration
├── workspace/
│ ├── src/ # Source implementation (given to agent)
│ ├── dst/ # Target implementation (agent generates)
│ └── prompt.md # Generated instructions for agent
├── tests/ # Hidden from agent during evaluation
│ ├── conftest.py # Pytest fixtures
│ ├── test_api.py # Test suite
│ └── requirements.txt # Test dependencies
└── reference/ # Reference implementation (optional, hidden)
Key principle: Tests in tests/ are copied to workspace/tests/ only during evaluation, keeping them hidden from the agent.
This is the single source of truth for all benchmark configuration.
benchmark:
name: "hello-world-api"
type: server
description: "REST API translation"
source_language: python
target_language: java
source:
language: python
description: "Flask-based REST API"
install_cmd: "pip install -r requirements.txt"
run_cmd: "python app.py"
port: 3000
port_env_var: "SERVER_PORT"
startup_wait: 3
destination:
language: java
description: "Spring Boot-based REST API"
build_cmd: "mvn clean package -DskipTests"
run_cmd: "java -jar target/*.jar"
port: 3000
port_env_var: "SERVER_PORT"
startup_wait: 8
testing:
tool: pytest
test_files:
- "test_api.py"
test_dir: "tests"benchmark:
name: "jq-gojq"
type: cli
description: "JSON processor CLI translation"
source_language: c
target_language: go
source:
language: c
description: "jq - C implementation"
build_cmd: "autoreconf -i && ./configure && make"
run_cmd: "./jq"
destination:
language: go
description: "gojq - Go implementation"
install_cmd: "go mod download"
build_cmd: "go build -o gojq ./cmd/gojq"
run_cmd: "./gojq"
testing:
tool: pytest
test_files:
- "test_jq.py"
test_dir: "tests"| Field | Required | Description |
|---|---|---|
benchmark.name |
Yes | Unique benchmark identifier |
benchmark.type |
Yes | server or cli |
benchmark.description |
Yes | Brief description |
benchmark.source_language |
Yes | Source language |
benchmark.target_language |
Yes | Target language |
source.install_cmd |
No | Install dependencies |
source.build_cmd |
No | Build/compile command |
source.run_cmd |
Yes | How to run the implementation |
source.port |
Server only | Port number |
source.port_env_var |
Server only | Environment variable for port |
source.startup_wait |
Server only | Seconds to wait for startup |
Contains benchmark metadata (some fields auto-calculated):
id: "001"
name: my-benchmark
type: server
source_language: python
target_language: go
description: Brief description
complexity:
loc: 1234 # Auto-calculated
test_count: 20 # Auto-calculated
architecture:
frameworks:
source: ["Flask"]
target: ["Gin"]
has_database: false
has_authentication: false
provenance:
source_repo: "https://github.com/..."
source_commit: "abc123..." # Required - exact commit SHA
source_version: "v1.0.0" # Optional - tag if cloned from release
license: "MIT"
notes: "What makes this benchmark notable"Tests must be converted from the source repository's existing test suite - not written from scratch. The converted tests should be implementation-agnostic, verifying behavior through the standardized interface.
- Locate tests in the source repository
- Identify tests that use the external interface (CLI args, HTTP endpoints)
- Convert to pytest format
- Exclude tests that rely on internal implementation details
# tests/conftest.py
import pytest
def pytest_addoption(parser):
parser.addoption("--api-port", default="3000")
@pytest.fixture
def api_base_url(request):
port = request.config.getoption("--api-port")
return f"http://localhost:{port}"# tests/test_api.py
import requests
class TestAPI:
def test_get_endpoint(self, api_base_url):
response = requests.get(f"{api_base_url}/users")
assert response.status_code == 200
assert "users" in response.json()
def test_post_endpoint(self, api_base_url):
response = requests.post(
f"{api_base_url}/users",
json={"name": "Alice"}
)
assert response.status_code == 201# tests/conftest.py
import pytest
import shlex
from pathlib import Path
def pytest_addoption(parser):
parser.addoption("--work-dir", required=True)
parser.addoption("--run-cmd", required=True)
@pytest.fixture
def work_dir(request):
return Path(request.config.getoption("--work-dir"))
@pytest.fixture
def run_cmd(request):
return shlex.split(request.config.getoption("--run-cmd"))# tests/test_cli.py
import subprocess
import json
class TestCLI:
def test_basic_filter(self, work_dir, run_cmd):
result = subprocess.run(
run_cmd + ["."],
input='{"a": 1}',
capture_output=True,
text=True,
cwd=work_dir
)
assert result.returncode == 0
assert json.loads(result.stdout) == {"a": 1}Do test:
- Endpoint availability and status codes
- Input/output format and structure
- Command-line arguments and flags
- Error handling for invalid inputs
Don't test:
- Internal implementation details
- Module/plugin systems requiring external files
- Platform-specific behavior
- Debug-only features
Edit Dockerfile.dev to install required language runtimes:
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
# Basic tools
RUN apt-get update && apt-get install -y \
curl wget git vim ca-certificates gnupg \
&& rm -rf /var/lib/apt/lists/*
# Python
RUN apt-get update && apt-get install -y python3.12 python3-pip
# Go
RUN wget https://go.dev/dl/go1.22.linux-amd64.tar.gz && \
tar -C /usr/local -xzf go1.22.linux-amd64.tar.gz && \
rm go1.22.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
# Java
RUN apt-get update && apt-get install -y openjdk-17-jdk maven
# Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Node.js (required for Claude Code)
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt-get install -y nodejs
# Claude Code
RUN npm install -g @anthropic-ai/claude-code
WORKDIR /workspace
CMD ["/bin/bash"]| Argument | Required | Description |
|---|---|---|
--name |
Yes | Benchmark name (becomes directory name) |
--type |
Yes | server or cli |
--source |
Yes | Source language |
--target |
Yes | Target language |
--description |
Yes | Brief description |
--source-desc |
No | Detailed source description |
--target-desc |
No | Detailed target description |
--source-build |
No | Source build command |
--source-run |
No | Source run command |
--target-build |
No | Target build command |
--target-run |
No | Target run command |
--source-port-env |
No | Source port env var (default: SERVER_PORT) |
--target-port-env |
No | Target port env var (default: SERVER_PORT) |
| Script | Purpose |
|---|---|
scripts/init_benchmark.py |
Create new benchmark from template |
generate_prompts.py |
Generate prompt.md from benchmark.yml |
scripts/generate_metadata.py |
Auto-calculate LOC and test counts |
run_benchmarks.py |
Run tests against implementations |
run_agent.py |
Run agent to generate destination code |
Before submitting a new benchmark:
- Source implementation builds successfully
- Source implementation runs without errors
- Source passes 100% of tests
- Tests are implementation-agnostic (no internal imports)
-
benchmark.ymlhas all required fields -
metadata.ymlhas correct metadata -
Dockerfile.devinstalls all required runtimes -
prompt.mdis generated and accurate
After creating a benchmark, you can run an experiment to verify that an agent can reasonably attempt the translation task.
Create a YAML file (e.g., test_my_benchmark.yml):
name: test-my-benchmark
agent: claude-code
benchmarks:
- my-benchmark
timeout: 3600
description: "Test run for new benchmark"python3 run_experiment.py test_my_benchmark.ymlCheck the experiment output:
experiments/<timestamp>_test-my-benchmark/
├── summary.jsonl # Results with pass rate
├── my-benchmark/
│ ├── workspace/dst/ # Agent-generated code
│ └── agent_log.jsonl # Agent execution log
| Metric | Expectation |
|---|---|
| Build Success | Agent should produce code that compiles/builds |
| Pass Rate | Some tests should pass (baseline varies by complexity) |
| Agent Log | No obvious errors or misunderstandings of the task |
If the agent consistently fails to build or passes 0% of tests, consider:
- Is the prompt clear enough?
- Is the source implementation well-documented?
- Are the build commands correct in
benchmark.yml? - Is the task appropriately scoped for the target complexity?