Skip to content

Latest commit

 

History

History
442 lines (338 loc) · 11.8 KB

File metadata and controls

442 lines (338 loc) · 11.8 KB

Adding New Benchmarks

This guide explains how to add a new benchmark to the MCode Benchmark framework.

Overview

Each benchmark evaluates AI agents' ability to translate software from one language/framework to another. The framework supports two types:

Type Interface Examples
server HTTP request/response Flask → Gin, Express → Spring Boot
cli stdin/stdout/args/exit code jq → gojq, httpie → xh

Quick Start

1. Initialize from Template

python3 scripts/init_benchmark.py \
  --name "my-benchmark" \
  --type server \
  --source python \
  --target go \
  --description "Brief description" \
  --source-desc "Flask-based REST API" \
  --target-desc "Gin-based REST API" \
  --source-build "pip install -r requirements.txt" \
  --source-run "python app.py" \
  --target-build "go build -o app" \
  --target-run "./app"

2. Add Source Implementation

Place your complete, working source code in dataset/my-benchmark/workspace/src/.

3. Write Tests

Create implementation-agnostic tests in dataset/my-benchmark/tests/.

4. Update Dockerfile

Edit dataset/my-benchmark/Dockerfile.dev to install required language runtimes.

5. Generate and Verify

# Generate prompt.md
python3 generate_prompts.py --benchmark my-benchmark

# Calculate metrics
python3 scripts/generate_metadata.py --benchmark my-benchmark

# Verify source passes tests
python3 run_benchmarks.py --test src

Directory Structure

After initialization, you'll have:

dataset/my-benchmark/
├── benchmark.yml          # Main configuration (single source of truth)
├── metadata.yml           # Benchmark metadata (auto-calculated)
├── Dockerfile.dev         # Multi-runtime development container
├── docker-compose.dev.yml # Container orchestration
├── workspace/
│   ├── src/               # Source implementation (given to agent)
│   ├── dst/               # Target implementation (agent generates)
│   └── prompt.md          # Generated instructions for agent
├── tests/                 # Hidden from agent during evaluation
│   ├── conftest.py        # Pytest fixtures
│   ├── test_api.py        # Test suite
│   └── requirements.txt   # Test dependencies
└── reference/             # Reference implementation (optional, hidden)

Key principle: Tests in tests/ are copied to workspace/tests/ only during evaluation, keeping them hidden from the agent.


Configuration Files

benchmark.yml

This is the single source of truth for all benchmark configuration.

Server Example

benchmark:
  name: "hello-world-api"
  type: server
  description: "REST API translation"
  source_language: python
  target_language: java

source:
  language: python
  description: "Flask-based REST API"
  install_cmd: "pip install -r requirements.txt"
  run_cmd: "python app.py"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

destination:
  language: java
  description: "Spring Boot-based REST API"
  build_cmd: "mvn clean package -DskipTests"
  run_cmd: "java -jar target/*.jar"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 8

testing:
  tool: pytest
  test_files:
    - "test_api.py"
  test_dir: "tests"

CLI Example

benchmark:
  name: "jq-gojq"
  type: cli
  description: "JSON processor CLI translation"
  source_language: c
  target_language: go

source:
  language: c
  description: "jq - C implementation"
  build_cmd: "autoreconf -i && ./configure && make"
  run_cmd: "./jq"

destination:
  language: go
  description: "gojq - Go implementation"
  install_cmd: "go mod download"
  build_cmd: "go build -o gojq ./cmd/gojq"
  run_cmd: "./gojq"

testing:
  tool: pytest
  test_files:
    - "test_jq.py"
  test_dir: "tests"

Configuration Reference

Field Required Description
benchmark.name Yes Unique benchmark identifier
benchmark.type Yes server or cli
benchmark.description Yes Brief description
benchmark.source_language Yes Source language
benchmark.target_language Yes Target language
source.install_cmd No Install dependencies
source.build_cmd No Build/compile command
source.run_cmd Yes How to run the implementation
source.port Server only Port number
source.port_env_var Server only Environment variable for port
source.startup_wait Server only Seconds to wait for startup

metadata.yml

Contains benchmark metadata (some fields auto-calculated):

id: "001"
name: my-benchmark
type: server
source_language: python
target_language: go
description: Brief description

complexity:
  loc: 1234           # Auto-calculated
  test_count: 20      # Auto-calculated

architecture:
  frameworks:
    source: ["Flask"]
    target: ["Gin"]
  has_database: false
  has_authentication: false

provenance:
  source_repo: "https://github.com/..."
  source_commit: "abc123..."  # Required - exact commit SHA
  source_version: "v1.0.0"    # Optional - tag if cloned from release
  license: "MIT"

notes: "What makes this benchmark notable"

Converting Tests

Tests must be converted from the source repository's existing test suite - not written from scratch. The converted tests should be implementation-agnostic, verifying behavior through the standardized interface.

Conversion Process

  1. Locate tests in the source repository
  2. Identify tests that use the external interface (CLI args, HTTP endpoints)
  3. Convert to pytest format
  4. Exclude tests that rely on internal implementation details

Server Tests

# tests/conftest.py
import pytest

def pytest_addoption(parser):
    parser.addoption("--api-port", default="3000")

@pytest.fixture
def api_base_url(request):
    port = request.config.getoption("--api-port")
    return f"http://localhost:{port}"
# tests/test_api.py
import requests

class TestAPI:
    def test_get_endpoint(self, api_base_url):
        response = requests.get(f"{api_base_url}/users")
        assert response.status_code == 200
        assert "users" in response.json()

    def test_post_endpoint(self, api_base_url):
        response = requests.post(
            f"{api_base_url}/users",
            json={"name": "Alice"}
        )
        assert response.status_code == 201

CLI Tests

# tests/conftest.py
import pytest
import shlex
from pathlib import Path

def pytest_addoption(parser):
    parser.addoption("--work-dir", required=True)
    parser.addoption("--run-cmd", required=True)

@pytest.fixture
def work_dir(request):
    return Path(request.config.getoption("--work-dir"))

@pytest.fixture
def run_cmd(request):
    return shlex.split(request.config.getoption("--run-cmd"))
# tests/test_cli.py
import subprocess
import json

class TestCLI:
    def test_basic_filter(self, work_dir, run_cmd):
        result = subprocess.run(
            run_cmd + ["."],
            input='{"a": 1}',
            capture_output=True,
            text=True,
            cwd=work_dir
        )
        assert result.returncode == 0
        assert json.loads(result.stdout) == {"a": 1}

Test Guidelines

Do test:

  • Endpoint availability and status codes
  • Input/output format and structure
  • Command-line arguments and flags
  • Error handling for invalid inputs

Don't test:

  • Internal implementation details
  • Module/plugin systems requiring external files
  • Platform-specific behavior
  • Debug-only features

Dockerfile Setup

Edit Dockerfile.dev to install required language runtimes:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

# Basic tools
RUN apt-get update && apt-get install -y \
    curl wget git vim ca-certificates gnupg \
    && rm -rf /var/lib/apt/lists/*

# Python
RUN apt-get update && apt-get install -y python3.12 python3-pip

# Go
RUN wget https://go.dev/dl/go1.22.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go1.22.linux-amd64.tar.gz && \
    rm go1.22.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"

# Java
RUN apt-get update && apt-get install -y openjdk-17-jdk maven

# Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Node.js (required for Claude Code)
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
    apt-get install -y nodejs

# Claude Code
RUN npm install -g @anthropic-ai/claude-code

WORKDIR /workspace
CMD ["/bin/bash"]

init_benchmark.py Arguments

Argument Required Description
--name Yes Benchmark name (becomes directory name)
--type Yes server or cli
--source Yes Source language
--target Yes Target language
--description Yes Brief description
--source-desc No Detailed source description
--target-desc No Detailed target description
--source-build No Source build command
--source-run No Source run command
--target-build No Target build command
--target-run No Target run command
--source-port-env No Source port env var (default: SERVER_PORT)
--target-port-env No Target port env var (default: SERVER_PORT)

Helper Scripts

Script Purpose
scripts/init_benchmark.py Create new benchmark from template
generate_prompts.py Generate prompt.md from benchmark.yml
scripts/generate_metadata.py Auto-calculate LOC and test counts
run_benchmarks.py Run tests against implementations
run_agent.py Run agent to generate destination code

Verification Checklist

Before submitting a new benchmark:

  • Source implementation builds successfully
  • Source implementation runs without errors
  • Source passes 100% of tests
  • Tests are implementation-agnostic (no internal imports)
  • benchmark.yml has all required fields
  • metadata.yml has correct metadata
  • Dockerfile.dev installs all required runtimes
  • prompt.md is generated and accurate

Optional: Run Experiment to Verify Agent Performance

After creating a benchmark, you can run an experiment to verify that an agent can reasonably attempt the translation task.

1. Create an Experiment Config

Create a YAML file (e.g., test_my_benchmark.yml):

name: test-my-benchmark
agent: claude-code
benchmarks:
  - my-benchmark
timeout: 3600
description: "Test run for new benchmark"

2. Run the Experiment

python3 run_experiment.py test_my_benchmark.yml

3. Review Results

Check the experiment output:

experiments/<timestamp>_test-my-benchmark/
├── summary.jsonl        # Results with pass rate
├── my-benchmark/
│   ├── workspace/dst/   # Agent-generated code
│   └── agent_log.jsonl  # Agent execution log

What to Look For

Metric Expectation
Build Success Agent should produce code that compiles/builds
Pass Rate Some tests should pass (baseline varies by complexity)
Agent Log No obvious errors or misunderstandings of the task

If the agent consistently fails to build or passes 0% of tests, consider:

  • Is the prompt clear enough?
  • Is the source implementation well-documented?
  • Are the build commands correct in benchmark.yml?
  • Is the task appropriately scoped for the target complexity?