This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
MCode Benchmark is an evaluation framework for software translation across programming languages and frameworks. It evaluates AI agents' ability to translate projects from one language/framework to another while maintaining functional equivalence.
Projects must have naturally standardized interfaces - no adapters or wrappers needed.
| Type | Interface | Example |
|---|---|---|
| CLI Tools | stdin/stdout/args/exit code | jq → gojq |
| REST APIs | HTTP request/response | Flask → Gin |
| GraphQL APIs | HTTP + GraphQL query | Apollo → gqlgen |
| gRPC Services | Protobuf over HTTP/2 | Python gRPC → Go gRPC |
| File Processors | File(s) in → File(s) out | Compilers, converters |
Principle: The interface itself must be the contract. If you need to build an adapter layer, the project doesn't fit.
Each benchmark follows a standardized structure:
- workspace/: The agent's working environment (mounted to container)
- src/: Complete source implementation
- dst/: Target implementation (empty, for agent to generate)
- prompt.md: Translation task instructions
- reference/: Pre-existing reference implementation (hidden from agent, not mounted)
- tests/: Integration tests using pytest (hidden from agent during development)
- benchmark.yml: Single source of truth for configuration
- metadata.yml: Benchmark metadata (LOC, test count, frameworks, provenance)
- Dockerfile.dev: Multi-runtime development container
- docker-compose.dev.yml: Container orchestration
- generate_prompt.py: Generates prompt.md from benchmark.yml
- generate_all_tests.py: (Optional) Generates tests from source test suite
- Tests are hidden:
tests/directory is outsideworkspace/and only copied in during evaluation - Single source of truth: All configuration lives in
benchmark.yml - Implementation-agnostic tests: Tests verify behavior through standardized interfaces
- Standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)
The benchmark-skeleton/ directory contains:
- Reusable scripts that work for all benchmarks
- Template files with placeholders ({{BENCHMARK_NAME}}, {{SOURCE_LANG}}, etc.)
- Example test files to be replaced with actual tests
python3 scripts/init_benchmark.py \
--name "your-benchmark-name" \
--type server \
--source python \
--target go \
--description "Brief description" \
--source-desc "Flask-based" \
--target-desc "Gin-based" \
--source-build "pip install -r requirements.txt" \
--target-build "go build -o app"Arguments:
--type: Required. Eitherserver(REST APIs) orcli(command-line tools)--source-port-env: Environment variable for source port (default: SERVER_PORT)--target-port-env: Environment variable for target port (default: SERVER_PORT)
This creates dataset/your-benchmark-name/ with all necessary files and placeholders configured.
From within a benchmark directory:
python3 generate_prompt.pyThis creates workspace/prompt.md from benchmark.yml configuration.
From within a benchmark directory:
python3 test.py --test both # Test source and destination
python3 test.py --test src # Test source only
python3 test.py --test dst # Test destination onlyThis:
- Creates temporary workspace with tests
- Starts the dev container
- Builds and tests implementation(s)
- Reports metrics (Build Success, Pass Rate)
- Cleans up
- Initialize: Use
init_benchmark.pyto create from template - Edit Dockerfile.dev: Install both language runtimes
- Implement source: Create complete working API in
workspace/src/ - Create destination stub: Either leave
workspace/dst/empty or add minimal scaffolding - Write tests: Create
tests/test_api.pywith pytest tests - Generate prompt: Run
generate_prompt.py - Verify: Run
python3 test.pyto ensure source passes and destination fails
- Initialize: Create benchmark structure
- Clone source: Clone source repo to
workspace/src/ - Clone destination: Clone reference implementation to
workspace/dst/ - Generate tests: Create
generate_all_tests.pyto parse source test suite - Filter tests: Remove non-implementation-agnostic tests:
- Module/plugin systems (require external files)
- Internal/debug features (not public API)
- Platform-specific behavior
- Verify: Source should pass 100%, destination shows baseline compatibility
benchmark:
name: "flask-gin-api"
type: server
description: "REST API translation"
source_language: python
target_language: go
source:
language: python
description: "Flask-based REST API"
install_cmd: "pip install -r requirements.txt"
run_cmd: "python app.py"
port: 3000
port_env_var: "SERVER_PORT"
startup_wait: 3
destination:
language: go
description: "Gin-based REST API"
build_cmd: "go build -o app"
run_cmd: "./app"
port: 3000
port_env_var: "SERVER_PORT"
startup_wait: 3
testing:
tool: pytest
test_dir: "tests"benchmark:
name: "jq-gojq"
type: cli
description: "JSON processor CLI translation"
source_language: c
target_language: go
source:
language: c
description: "jq - C implementation"
build_cmd: "autoreconf -i && ./configure --with-oniguruma=builtin && make"
binary_path: "src/jq"
destination:
language: go
description: "gojq - Go implementation"
install_cmd: "go mod download"
build_cmd: "go build -o gojq ./cmd/gojq"
binary_path: "dst/gojq"
testing:
tool: pytest
test_dir: "tests"Each benchmark has a metadata.yml file with additional information:
name: benchmark-name
type: server
source_language: python
target_language: go
description: Brief description
complexity:
loc: 1234 # Lines of code (auto-calculated)
test_count: 20 # Number of test cases (auto-calculated)
architecture:
frameworks:
source: ["Flask"]
target: ["Gin"]
has_database: false
has_authentication: false
provenance:
source_repo: "https://github.com/..."
source_commit: "abc123..." # Required - exact commit SHA
source_version: "v1.0.0" # Optional - tag if cloned from release
license: "MIT"
notes: "Brief description of what makes this benchmark notable"python3 scripts/generate_metadata.py --benchmark <name> # Single benchmark
python3 scripts/generate_metadata.py --all # All benchmarksHost machine must have:
- Docker and Docker Compose
- Python 3 with dependencies:
pip3 install pyyaml pytest
Always use the .venv virtual environment for Python operations:
# Activate the virtual environment before running Python commands
source .venv/bin/activate
# Then run commands
python3 scripts/init_benchmark.py ...
python3 test.py --test srcAll Python commands in this repository should be run with the virtual environment activated.
All benchmarks use pytest for testing. Tests should be implementation-agnostic.
def test_get_users(api_url):
response = requests.get(f"{api_url}/users")
assert response.status_code == 200
assert "users" in response.json()def test_identity_filter(jq_binary):
result = subprocess.run(
[str(jq_binary), "."],
input='{"a": 1}',
capture_output=True,
text=True
)
assert json.loads(result.stdout) == {"a": 1}For projects with existing test suites (like jq), create generate_all_tests.py to:
- Parse the source test format
- Convert to pytest tests
- Filter non-implementation-agnostic tests
Example filtering:
- Module imports (require external files)
- Internal features (
$__loc__in jq) - Platform-specific behavior
- Tests are temporarily copied to
workspace/tests/during evaluation, then removed - The destination build must succeed before tests run; if build fails, Pass Rate = 0.0
- For REST APIs: Both source and destination use the same port (tested sequentially)
- For CLI tools: Tests compare output semantically (JSON parsing) not string equality
- Source implementation should pass 100% of tests; destination shows compatibility baseline