Skip to content

Latest commit

 

History

History
294 lines (226 loc) · 9.03 KB

File metadata and controls

294 lines (226 loc) · 9.03 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

MCode Benchmark is an evaluation framework for software translation across programming languages and frameworks. It evaluates AI agents' ability to translate projects from one language/framework to another while maintaining functional equivalence.

Supported Project Types

Projects must have naturally standardized interfaces - no adapters or wrappers needed.

Type Interface Example
CLI Tools stdin/stdout/args/exit code jq → gojq
REST APIs HTTP request/response Flask → Gin
GraphQL APIs HTTP + GraphQL query Apollo → gqlgen
gRPC Services Protobuf over HTTP/2 Python gRPC → Go gRPC
File Processors File(s) in → File(s) out Compilers, converters

Principle: The interface itself must be the contract. If you need to build an adapter layer, the project doesn't fit.

Core Architecture

Benchmark Structure

Each benchmark follows a standardized structure:

  • workspace/: The agent's working environment (mounted to container)
    • src/: Complete source implementation
    • dst/: Target implementation (empty, for agent to generate)
    • prompt.md: Translation task instructions
  • reference/: Pre-existing reference implementation (hidden from agent, not mounted)
  • tests/: Integration tests using pytest (hidden from agent during development)
  • benchmark.yml: Single source of truth for configuration
  • metadata.yml: Benchmark metadata (LOC, test count, frameworks, provenance)
  • Dockerfile.dev: Multi-runtime development container
  • docker-compose.dev.yml: Container orchestration
  • generate_prompt.py: Generates prompt.md from benchmark.yml
  • generate_all_tests.py: (Optional) Generates tests from source test suite

Key Design Principles

  1. Tests are hidden: tests/ directory is outside workspace/ and only copied in during evaluation
  2. Single source of truth: All configuration lives in benchmark.yml
  3. Implementation-agnostic tests: Tests verify behavior through standardized interfaces
  4. Standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)

Template System

The benchmark-skeleton/ directory contains:

  • Reusable scripts that work for all benchmarks
  • Template files with placeholders ({{BENCHMARK_NAME}}, {{SOURCE_LANG}}, etc.)
  • Example test files to be replaced with actual tests

Common Commands

Create a New Benchmark

python3 scripts/init_benchmark.py \
  --name "your-benchmark-name" \
  --type server \
  --source python \
  --target go \
  --description "Brief description" \
  --source-desc "Flask-based" \
  --target-desc "Gin-based" \
  --source-build "pip install -r requirements.txt" \
  --target-build "go build -o app"

Arguments:

  • --type: Required. Either server (REST APIs) or cli (command-line tools)
  • --source-port-env: Environment variable for source port (default: SERVER_PORT)
  • --target-port-env: Environment variable for target port (default: SERVER_PORT)

This creates dataset/your-benchmark-name/ with all necessary files and placeholders configured.

Generate Prompt File

From within a benchmark directory:

python3 generate_prompt.py

This creates workspace/prompt.md from benchmark.yml configuration.

Run Benchmark Tests

From within a benchmark directory:

python3 test.py --test both    # Test source and destination
python3 test.py --test src     # Test source only
python3 test.py --test dst     # Test destination only

This:

  1. Creates temporary workspace with tests
  2. Starts the dev container
  3. Builds and tests implementation(s)
  4. Reports metrics (Build Success, Pass Rate)
  5. Cleans up

Workflow for Creating New Benchmarks

REST API Benchmarks

  1. Initialize: Use init_benchmark.py to create from template
  2. Edit Dockerfile.dev: Install both language runtimes
  3. Implement source: Create complete working API in workspace/src/
  4. Create destination stub: Either leave workspace/dst/ empty or add minimal scaffolding
  5. Write tests: Create tests/test_api.py with pytest tests
  6. Generate prompt: Run generate_prompt.py
  7. Verify: Run python3 test.py to ensure source passes and destination fails

CLI Tool Benchmarks

  1. Initialize: Create benchmark structure
  2. Clone source: Clone source repo to workspace/src/
  3. Clone destination: Clone reference implementation to workspace/dst/
  4. Generate tests: Create generate_all_tests.py to parse source test suite
  5. Filter tests: Remove non-implementation-agnostic tests:
    • Module/plugin systems (require external files)
    • Internal/debug features (not public API)
    • Platform-specific behavior
  6. Verify: Source should pass 100%, destination shows baseline compatibility

benchmark.yml Configuration

REST API Example

benchmark:
  name: "flask-gin-api"
  type: server
  description: "REST API translation"
  source_language: python
  target_language: go

source:
  language: python
  description: "Flask-based REST API"
  install_cmd: "pip install -r requirements.txt"
  run_cmd: "python app.py"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

destination:
  language: go
  description: "Gin-based REST API"
  build_cmd: "go build -o app"
  run_cmd: "./app"
  port: 3000
  port_env_var: "SERVER_PORT"
  startup_wait: 3

testing:
  tool: pytest
  test_dir: "tests"

CLI Tool Example

benchmark:
  name: "jq-gojq"
  type: cli
  description: "JSON processor CLI translation"
  source_language: c
  target_language: go

source:
  language: c
  description: "jq - C implementation"
  build_cmd: "autoreconf -i && ./configure --with-oniguruma=builtin && make"
  binary_path: "src/jq"

destination:
  language: go
  description: "gojq - Go implementation"
  install_cmd: "go mod download"
  build_cmd: "go build -o gojq ./cmd/gojq"
  binary_path: "dst/gojq"

testing:
  tool: pytest
  test_dir: "tests"

metadata.yml

Each benchmark has a metadata.yml file with additional information:

name: benchmark-name
type: server
source_language: python
target_language: go
description: Brief description

complexity:
  loc: 1234        # Lines of code (auto-calculated)
  test_count: 20   # Number of test cases (auto-calculated)

architecture:
  frameworks:
    source: ["Flask"]
    target: ["Gin"]
  has_database: false
  has_authentication: false

provenance:
  source_repo: "https://github.com/..."
  source_commit: "abc123..."  # Required - exact commit SHA
  source_version: "v1.0.0"    # Optional - tag if cloned from release
  license: "MIT"

notes: "Brief description of what makes this benchmark notable"

Auto-calculate Metrics

python3 scripts/generate_metadata.py --benchmark <name>  # Single benchmark
python3 scripts/generate_metadata.py --all               # All benchmarks

Requirements

Host machine must have:

  • Docker and Docker Compose
  • Python 3 with dependencies: pip3 install pyyaml pytest

Python Virtual Environment

Always use the .venv virtual environment for Python operations:

# Activate the virtual environment before running Python commands
source .venv/bin/activate

# Then run commands
python3 scripts/init_benchmark.py ...
python3 test.py --test src

All Python commands in this repository should be run with the virtual environment activated.

Testing with pytest

All benchmarks use pytest for testing. Tests should be implementation-agnostic.

REST API Tests

def test_get_users(api_url):
    response = requests.get(f"{api_url}/users")
    assert response.status_code == 200
    assert "users" in response.json()

CLI Tool Tests

def test_identity_filter(jq_binary):
    result = subprocess.run(
        [str(jq_binary), "."],
        input='{"a": 1}',
        capture_output=True,
        text=True
    )
    assert json.loads(result.stdout) == {"a": 1}

Test Generation from Source Test Suite

For projects with existing test suites (like jq), create generate_all_tests.py to:

  1. Parse the source test format
  2. Convert to pytest tests
  3. Filter non-implementation-agnostic tests

Example filtering:

  • Module imports (require external files)
  • Internal features ($__loc__ in jq)
  • Platform-specific behavior

Important Notes

  • Tests are temporarily copied to workspace/tests/ during evaluation, then removed
  • The destination build must succeed before tests run; if build fails, Pass Rate = 0.0
  • For REST APIs: Both source and destination use the same port (tested sequentially)
  • For CLI tools: Tests compare output semantically (JSON parsing) not string equality
  • Source implementation should pass 100% of tests; destination shows compatibility baseline