Adding New Benchmarks

This guide explains how to add a new benchmark to the MCode Benchmark framework.

Overview

Each benchmark evaluates AI agents' ability to translate software from one language/framework to another. The framework supports two types:

Type	Interface	Examples
server	HTTP request/response	Flask → Gin, Express → Spring Boot
cli	stdin/stdout/args/exit code	jq → gojq, httpie → xh

Quick Start

1. Initialize from Template

python3 scripts/init_benchmark.py \
  --name "my-benchmark" \
  --type server \
  --source python \
  --target go \
  --description "Brief description" \
  --source-desc "Flask-based REST API" \
  --target-desc "Gin-based REST API" \
  --source-build "pip install -r requirements.txt" \
  --source-run "python app.py" \
  --target-build "go build -o app" \
  --target-run "./app"

2. Add Source Implementation

Place your complete, working source code in dataset/my-benchmark/workspace/src/.

3. Write Tests

Create implementation-agnostic tests in dataset/my-benchmark/tests/.

4. Update Dockerfile

Edit dataset/my-benchmark/Dockerfile.dev to install required language runtimes.

5. Generate and Verify

# Generate prompt.md
python3 generate_prompts.py --benchmark my-benchmark

# Calculate metrics
python3 scripts/generate_metadata.py --benchmark my-benchmark

# Verify source passes tests
python3 run_benchmarks.py --test src

Directory Structure

After initialization, you'll have:

dataset/my-benchmark/
├── benchmark.yml          # Main configuration (single source of truth)
├── metadata.yml           # Benchmark metadata (auto-calculated)
├── Dockerfile.dev         # Multi-runtime development container
├── docker-compose.dev.yml # Container orchestration
├── workspace/
│   ├── src/               # Source implementation (given to agent)
│   ├── dst/               # Target implementation (agent generates)
│   └── prompt.md          # Generated instructions for agent
├── tests/                 # Hidden from agent during evaluation
│   ├── conftest.py        # Pytest fixtures
│   ├── test_api.py        # Test suite
│   └── requirements.txt   # Test dependencies
└── reference/             # Reference implementation (optional, hidden)

Key principle: Tests in tests/ are copied to workspace/tests/ only during evaluation, keeping them hidden from the agent.

Configuration Files

benchmark.yml

This is the single source of truth for all benchmark configuration.

Server Example

benchmark:
  name: "hello-world-api"
  type: server
  description: "REST API translation"
  source_language: python
  target_language: java

source:
  language: python
  description: "Flask-based REST API"
  install_cmd: "pip install -r requirements.txt"
  run_cmd: "python app.py"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

destination:
  language: java
  description: "Spring Boot-based REST API"
  build_cmd: "mvn clean package -DskipTests"
  run_cmd: "java -jar target/*.jar"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 8

testing:
  tool: pytest
  test_files:
    - "test_api.py"
  test_dir: "tests"

CLI Example

benchmark:
  name: "jq-gojq"
  type: cli
  description: "JSON processor CLI translation"
  source_language: c
  target_language: go

source:
  language: c
  description: "jq - C implementation"
  build_cmd: "autoreconf -i && ./configure && make"
  run_cmd: "./jq"

destination:
  language: go
  description: "gojq - Go implementation"
  install_cmd: "go mod download"
  build_cmd: "go build -o gojq ./cmd/gojq"
  run_cmd: "./gojq"

testing:
  tool: pytest
  test_files:
    - "test_jq.py"
  test_dir: "tests"

Configuration Reference

Field	Required	Description
`benchmark.name`	Yes	Unique benchmark identifier
`benchmark.type`	Yes	`server` or `cli`
`benchmark.description`	Yes	Brief description
`benchmark.source_language`	Yes	Source language
`benchmark.target_language`	Yes	Target language
`source.install_cmd`	No	Install dependencies
`source.build_cmd`	No	Build/compile command
`source.run_cmd`	Yes	How to run the implementation
`source.port`	Server only	Port number
`source.port_env_var`	Server only	Environment variable for port
`source.startup_wait`	Server only	Seconds to wait for startup

metadata.yml

Contains benchmark metadata (some fields auto-calculated):

id: "001"
name: my-benchmark
type: server
source_language: python
target_language: go
description: Brief description

complexity:
  loc: 1234           # Auto-calculated
  test_count: 20      # Auto-calculated

architecture:
  frameworks:
    source: ["Flask"]
    target: ["Gin"]
  has_database: false
  has_authentication: false

provenance:
  source_repo: "https://github.com/..."
  source_commit: "abc123..."  # Required - exact commit SHA
  source_version: "v1.0.0"    # Optional - tag if cloned from release
  license: "MIT"

notes: "What makes this benchmark notable"

Converting Tests

Tests must be converted from the source repository's existing test suite - not written from scratch. The converted tests should be implementation-agnostic, verifying behavior through the standardized interface.

Conversion Process

Locate tests in the source repository
Identify tests that use the external interface (CLI args, HTTP endpoints)
Convert to pytest format
Exclude tests that rely on internal implementation details

Server Tests

# tests/conftest.py
import pytest

def pytest_addoption(parser):
    parser.addoption("--api-port", default="3000")

@pytest.fixture
def api_base_url(request):
    port = request.config.getoption("--api-port")
    return f"http://localhost:{port}"

# tests/test_api.py
import requests

class TestAPI:
    def test_get_endpoint(self, api_base_url):
        response = requests.get(f"{api_base_url}/users")
        assert response.status_code == 200
        assert "users" in response.json()

    def test_post_endpoint(self, api_base_url):
        response = requests.post(
            f"{api_base_url}/users",
            json={"name": "Alice"}
        )
        assert response.status_code == 201

CLI Tests

# tests/conftest.py
import pytest
import shlex
from pathlib import Path

def pytest_addoption(parser):
    parser.addoption("--work-dir", required=True)
    parser.addoption("--run-cmd", required=True)

@pytest.fixture
def work_dir(request):
    return Path(request.config.getoption("--work-dir"))

@pytest.fixture
def run_cmd(request):
    return shlex.split(request.config.getoption("--run-cmd"))

# tests/test_cli.py
import subprocess
import json

class TestCLI:
    def test_basic_filter(self, work_dir, run_cmd):
        result = subprocess.run(
            run_cmd + ["."],
            input='{"a": 1}',
            capture_output=True,
            text=True,
            cwd=work_dir
        )
        assert result.returncode == 0
        assert json.loads(result.stdout) == {"a": 1}

Test Guidelines

Do test:

Endpoint availability and status codes
Input/output format and structure
Command-line arguments and flags
Error handling for invalid inputs

Don't test:

Internal implementation details
Module/plugin systems requiring external files
Platform-specific behavior
Debug-only features

Dockerfile Setup

Edit Dockerfile.dev to install required language runtimes:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

# Basic tools
RUN apt-get update && apt-get install -y \
    curl wget git vim ca-certificates gnupg \
    && rm -rf /var/lib/apt/lists/*

# Python
RUN apt-get update && apt-get install -y python3.12 python3-pip

# Go
RUN wget https://go.dev/dl/go1.22.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go1.22.linux-amd64.tar.gz && \
    rm go1.22.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"

# Java
RUN apt-get update && apt-get install -y openjdk-17-jdk maven

# Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Node.js (required for Claude Code)
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
    apt-get install -y nodejs

# Claude Code
RUN npm install -g @anthropic-ai/claude-code

WORKDIR /workspace
CMD ["/bin/bash"]

init_benchmark.py Arguments

Argument	Required	Description
`--name`	Yes	Benchmark name (becomes directory name)
`--type`	Yes	`server` or `cli`
`--source`	Yes	Source language
`--target`	Yes	Target language
`--description`	Yes	Brief description
`--source-desc`	No	Detailed source description
`--target-desc`	No	Detailed target description
`--source-build`	No	Source build command
`--source-run`	No	Source run command
`--target-build`	No	Target build command
`--target-run`	No	Target run command
`--source-port-env`	No	Source port env var (default: SERVER_PORT)
`--target-port-env`	No	Target port env var (default: SERVER_PORT)

Helper Scripts

Script	Purpose
`scripts/init_benchmark.py`	Create new benchmark from template
`generate_prompts.py`	Generate `prompt.md` from `benchmark.yml`
`scripts/generate_metadata.py`	Auto-calculate LOC and test counts
`run_benchmarks.py`	Run tests against implementations
`run_agent.py`	Run agent to generate destination code

Verification Checklist

Before submitting a new benchmark:

Source implementation builds successfully
Source implementation runs without errors
Source passes 100% of tests
Tests are implementation-agnostic (no internal imports)
benchmark.yml has all required fields
metadata.yml has correct metadata
Dockerfile.dev installs all required runtimes
prompt.md is generated and accurate

Optional: Run Experiment to Verify Agent Performance

After creating a benchmark, you can run an experiment to verify that an agent can reasonably attempt the translation task.

1. Create an Experiment Config

Create a YAML file (e.g., test_my_benchmark.yml):

name: test-my-benchmark
agent: claude-code
benchmarks:
  - my-benchmark
timeout: 3600
description: "Test run for new benchmark"

2. Run the Experiment

python3 run_experiment.py test_my_benchmark.yml

3. Review Results

Check the experiment output:

experiments/<timestamp>_test-my-benchmark/
├── summary.jsonl        # Results with pass rate
├── my-benchmark/
│   ├── workspace/dst/   # Agent-generated code
│   └── agent_log.jsonl  # Agent execution log

What to Look For

Metric	Expectation
Build Success	Agent should produce code that compiles/builds
Pass Rate	Some tests should pass (baseline varies by complexity)
Agent Log	No obvious errors or misunderstandings of the task

If the agent consistently fails to build or passes 0% of tests, consider:

Is the prompt clear enough?
Is the source implementation well-documented?
Are the build commands correct in benchmark.yml?
Is the task appropriately scoped for the target complexity?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding New Benchmarks

Overview

Quick Start

1. Initialize from Template

2. Add Source Implementation

3. Write Tests

4. Update Dockerfile

5. Generate and Verify

Directory Structure

Configuration Files

benchmark.yml

Server Example

CLI Example

Configuration Reference

metadata.yml

Converting Tests

Conversion Process

Server Tests

CLI Tests

Test Guidelines

Dockerfile Setup

init_benchmark.py Arguments

Helper Scripts

Verification Checklist

Optional: Run Experiment to Verify Agent Performance

1. Create an Experiment Config

2. Run the Experiment

3. Review Results

What to Look For

FilesExpand file tree

ADDING_BENCHMARKS.md

Latest commit

History

ADDING_BENCHMARKS.md

File metadata and controls

Adding New Benchmarks

Overview

Quick Start

1. Initialize from Template

2. Add Source Implementation

3. Write Tests

4. Update Dockerfile

5. Generate and Verify

Directory Structure

Configuration Files

benchmark.yml

Server Example

CLI Example

Configuration Reference

metadata.yml

Converting Tests

Conversion Process

Server Tests

CLI Tests

Test Guidelines

Dockerfile Setup

init_benchmark.py Arguments

Helper Scripts

Verification Checklist

Optional: Run Experiment to Verify Agent Performance

1. Create an Experiment Config

2. Run the Experiment

3. Review Results

What to Look For