CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

MCode Benchmark is an evaluation framework for software translation across programming languages and frameworks. It evaluates AI agents' ability to translate projects from one language/framework to another while maintaining functional equivalence.

Supported Project Types

Projects must have naturally standardized interfaces - no adapters or wrappers needed.

Type	Interface	Example
CLI Tools	stdin/stdout/args/exit code	jq → gojq
REST APIs	HTTP request/response	Flask → Gin
GraphQL APIs	HTTP + GraphQL query	Apollo → gqlgen
gRPC Services	Protobuf over HTTP/2	Python gRPC → Go gRPC
File Processors	File(s) in → File(s) out	Compilers, converters

Principle: The interface itself must be the contract. If you need to build an adapter layer, the project doesn't fit.

Core Architecture

Benchmark Structure

Each benchmark follows a standardized structure:

workspace/: The agent's working environment (mounted to container)
- src/: Complete source implementation
- dst/: Target implementation (empty, for agent to generate)
- prompt.md: Translation task instructions
reference/: Pre-existing reference implementation (hidden from agent, not mounted)
tests/: Integration tests using pytest (hidden from agent during development)
benchmark.yml: Single source of truth for configuration
metadata.yml: Benchmark metadata (LOC, test count, frameworks, provenance)
Dockerfile.dev: Multi-runtime development container
docker-compose.dev.yml: Container orchestration
generate_prompt.py: Generates prompt.md from benchmark.yml
generate_all_tests.py: (Optional) Generates tests from source test suite

Key Design Principles

Tests are hidden: tests/ directory is outside workspace/ and only copied in during evaluation
Single source of truth: All configuration lives in benchmark.yml
Implementation-agnostic tests: Tests verify behavior through standardized interfaces
Standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)

Template System

The benchmark-skeleton/ directory contains:

Reusable scripts that work for all benchmarks
Template files with placeholders ({{BENCHMARK_NAME}}, {{SOURCE_LANG}}, etc.)
Example test files to be replaced with actual tests

Common Commands

Create a New Benchmark

python3 scripts/init_benchmark.py \
  --name "your-benchmark-name" \
  --type server \
  --source python \
  --target go \
  --description "Brief description" \
  --source-desc "Flask-based" \
  --target-desc "Gin-based" \
  --source-build "pip install -r requirements.txt" \
  --target-build "go build -o app"

Arguments:

--type: Required. Either server (REST APIs) or cli (command-line tools)
--source-port-env: Environment variable for source port (default: SERVER_PORT)
--target-port-env: Environment variable for target port (default: SERVER_PORT)

This creates dataset/your-benchmark-name/ with all necessary files and placeholders configured.

Generate Prompt File

From within a benchmark directory:

python3 generate_prompt.py

This creates workspace/prompt.md from benchmark.yml configuration.

Run Benchmark Tests

From within a benchmark directory:

python3 test.py --test both    # Test source and destination
python3 test.py --test src     # Test source only
python3 test.py --test dst     # Test destination only

This:

Creates temporary workspace with tests
Starts the dev container
Builds and tests implementation(s)
Reports metrics (Build Success, Pass Rate)
Cleans up

Workflow for Creating New Benchmarks

REST API Benchmarks

Initialize: Use init_benchmark.py to create from template
Edit Dockerfile.dev: Install both language runtimes
Implement source: Create complete working API in workspace/src/
Create destination stub: Either leave workspace/dst/ empty or add minimal scaffolding
Write tests: Create tests/test_api.py with pytest tests
Generate prompt: Run generate_prompt.py
Verify: Run python3 test.py to ensure source passes and destination fails

CLI Tool Benchmarks

Initialize: Create benchmark structure
Clone source: Clone source repo to workspace/src/
Clone destination: Clone reference implementation to workspace/dst/
Generate tests: Create generate_all_tests.py to parse source test suite
Filter tests: Remove non-implementation-agnostic tests:
- Module/plugin systems (require external files)
- Internal/debug features (not public API)
- Platform-specific behavior
Verify: Source should pass 100%, destination shows baseline compatibility

benchmark.yml Configuration

REST API Example

benchmark:
  name: "flask-gin-api"
  type: server
  description: "REST API translation"
  source_language: python
  target_language: go

source:
  language: python
  description: "Flask-based REST API"
  install_cmd: "pip install -r requirements.txt"
  run_cmd: "python app.py"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

destination:
  language: go
  description: "Gin-based REST API"
  build_cmd: "go build -o app"
  run_cmd: "./app"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

testing:
  tool: pytest
  test_dir: "tests"

CLI Tool Example

benchmark:
  name: "jq-gojq"
  type: cli
  description: "JSON processor CLI translation"
  source_language: c
  target_language: go

source:
  language: c
  description: "jq - C implementation"
  build_cmd: "autoreconf -i && ./configure --with-oniguruma=builtin && make"
  binary_path: "src/jq"

destination:
  language: go
  description: "gojq - Go implementation"
  install_cmd: "go mod download"
  build_cmd: "go build -o gojq ./cmd/gojq"
  binary_path: "dst/gojq"

testing:
  tool: pytest
  test_dir: "tests"

metadata.yml

Each benchmark has a metadata.yml file with additional information:

name: benchmark-name
type: server
source_language: python
target_language: go
description: Brief description

complexity:
  loc: 1234        # Lines of code (auto-calculated)
  test_count: 20   # Number of test cases (auto-calculated)

architecture:
  frameworks:
    source: ["Flask"]
    target: ["Gin"]
  has_database: false
  has_authentication: false

provenance:
  source_repo: "https://github.com/..."
  source_commit: "abc123..."  # Required - exact commit SHA
  source_version: "v1.0.0"    # Optional - tag if cloned from release
  license: "MIT"

notes: "Brief description of what makes this benchmark notable"

Auto-calculate Metrics

python3 scripts/generate_metadata.py --benchmark <name>  # Single benchmark
python3 scripts/generate_metadata.py --all               # All benchmarks

Requirements

Host machine must have:

Docker and Docker Compose
Python 3 with dependencies: pip3 install pyyaml pytest

Python Virtual Environment

Always use the .venv virtual environment for Python operations:

# Activate the virtual environment before running Python commands
source .venv/bin/activate

# Then run commands
python3 scripts/init_benchmark.py ...
python3 test.py --test src

All Python commands in this repository should be run with the virtual environment activated.

Testing with pytest

All benchmarks use pytest for testing. Tests should be implementation-agnostic.

REST API Tests

def test_get_users(api_url):
    response = requests.get(f"{api_url}/users")
    assert response.status_code == 200
    assert "users" in response.json()

CLI Tool Tests

def test_identity_filter(jq_binary):
    result = subprocess.run(
        [str(jq_binary), "."],
        input='{"a": 1}',
        capture_output=True,
        text=True
    )
    assert json.loads(result.stdout) == {"a": 1}

Test Generation from Source Test Suite

For projects with existing test suites (like jq), create generate_all_tests.py to:

Parse the source test format
Convert to pytest tests
Filter non-implementation-agnostic tests

Example filtering:

Module imports (require external files)
Internal features ($__loc__ in jq)
Platform-specific behavior

Important Notes

Tests are temporarily copied to workspace/tests/ during evaluation, then removed
The destination build must succeed before tests run; if build fails, Pass Rate = 0.0
For REST APIs: Both source and destination use the same port (tested sequentially)
For CLI tools: Tests compare output semantically (JSON parsing) not string equality
Source implementation should pass 100% of tests; destination shows compatibility baseline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Repository Overview

Supported Project Types

Core Architecture

Benchmark Structure

Key Design Principles

Template System

Common Commands

Create a New Benchmark

Generate Prompt File

Run Benchmark Tests

Workflow for Creating New Benchmarks

REST API Benchmarks

CLI Tool Benchmarks

benchmark.yml Configuration

REST API Example

CLI Tool Example

metadata.yml

Auto-calculate Metrics

Requirements

Python Virtual Environment

Testing with pytest

REST API Tests

CLI Tool Tests

Test Generation from Source Test Suite

Important Notes

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Overview

Supported Project Types

Core Architecture

Benchmark Structure

Key Design Principles

Template System

Common Commands

Create a New Benchmark

Generate Prompt File

Run Benchmark Tests

Workflow for Creating New Benchmarks

REST API Benchmarks

CLI Tool Benchmarks

benchmark.yml Configuration

REST API Example

CLI Tool Example

metadata.yml

Auto-calculate Metrics

Requirements

Python Virtual Environment

Testing with pytest

REST API Tests

CLI Tool Tests

Test Generation from Source Test Suite

Important Notes