Skip to content

Latest commit

 

History

History
234 lines (180 loc) · 7.31 KB

File metadata and controls

234 lines (180 loc) · 7.31 KB

API Translation Benchmark Guide

This guide explains how to create new API translation benchmarks following the standardized workflow.

Overview

Each benchmark evaluates an AI agent's ability to translate an API from one language/framework to another. The workflow:

  • Hides tests from the agent during development
  • Provides consistent development environment
  • Reports standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)

Standard Benchmark Structure

benchmark-name/
├── workspace/              # Visible to agent in dev container
│   ├── src/               # Source implementation (complete)
│   ├── dst/               # Destination (empty or stub)
│   └── prompt.md          # Translation task instructions
├── benchmark.yml          # Configuration (build/run/test settings)
├── Dockerfile.dev         # Multi-runtime dev container
├── docker-compose.dev.yml # Container orchestration
├── test.sh                # Evaluation script (reusable)
├── tests/                 # Integration tests (hidden from agent)
│   └── api_test.hurl
├── generate_prompt.py     # Generates prompt.md from config (reusable)
├── CLAUDE.md              # Documentation for Claude Code
└── README.md              # Quick start guide

Template Structure

The templates/benchmark-skeleton/ contains:

  • Reusable scripts: test.sh, generate_prompt.py (work for all benchmarks)
  • Template files: benchmark.yml, docker-compose.dev.yml, Dockerfile.dev (pre-configured with placeholders)
  • Empty directories: workspace/src/, workspace/dst/, tests/
  • Example test: tests/api_test.hurl (replace with actual tests)

Creating a New Benchmark

Step 1: Initialize from Template

Use the init script to create a new benchmark from the template:

python3 init_benchmark.py \
  --name "your-benchmark-name" \
  --source python \
  --target go \
  --description "Brief description of what the API does" \
  --source-desc "Flask-based" \
  --target-desc "Gin-based" \
  --source-build "pip install -r requirements.txt" \
  --source-run "python app.py" \
  --source-port-env "PORT" \
  --target-build "go build -o app" \
  --target-run "./app" \
  --target-port-env "SERVER_PORT"

Note: Use --source-port-env and --target-port-env to specify the environment variable names your applications use for the port. Common conventions:

  • Flask/Django: PORT
  • Spring Boot: SERVER_PORT
  • Go (varies): PORT or SERVER_PORT
  • Node.js/Express: PORT

The default is SERVER_PORT if not specified.

This creates ../your-benchmark-name/ with the complete structure and configured files.

Step 2: Edit Dockerfile.dev

Edit the generated Dockerfile.dev to install both language runtimes. The template includes commented examples for common languages.

Example for Python + Go:

# Install Python
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install Go
RUN wget https://go.dev/dl/go1.20.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go1.20.linux-amd64.tar.gz && \
    rm go1.20.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"

Step 3: Create Source Implementation

In workspace/src/, implement a complete, working API:

  • Must read port from environment variable specified in port_env_var
  • Should be simple but representative of real translation challenges
  • Include any necessary config files (requirements.txt, go.mod, package.json, etc.)

Example (Python/Flask):

import os
from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
    return {"message": "Hello World"}

if __name__ == "__main__":
    port = int(os.environ.get('SERVER_PORT', 3000))
    app.run(host='0.0.0.0', port=port)

Step 4: Create Destination Stub (Optional)

In workspace/dst/, you can either:

  • Leave it empty (agent starts from scratch)
  • Provide a minimal stub (e.g., just pom.xml for Java, go.mod for Go)

Step 5: Write Integration Tests

Create tests/api_test.hurl with comprehensive test cases:

# Use {{port}} variable, not hardcoded port
GET http://localhost:{{port}}/api/users
HTTP 200
[Asserts]
header "Content-Type" contains "application/json"
jsonpath "$.users" count > 0

GET http://localhost:{{port}}/api/users/1
HTTP 200
[Asserts]
jsonpath "$.id" == 1
jsonpath "$.name" exists

Test coverage guidelines:

  • Happy path (default cases)
  • Edge cases (empty inputs, special characters)
  • Error cases (404, validation errors)
  • All major endpoints
  • Data transformations

Step 6: Generate prompt.md

cd ../your-benchmark-name
python3 generate_prompt.py

This creates workspace/prompt.md from your benchmark.yml.

Step 7: Test Your Benchmark

# Ensure source implementation passes tests
./test.sh

Should output:

Source (python) API: PASSED ✓
Destination (go) API: FAILED ✗  # Expected if dst is empty
Build Success: 0
Pass Rate: 0.00

Supported Language Pairs (Priority)

  1. Python → Java
  2. Node.js → Go
  3. Python → Go
  4. Java → Python
  5. Ruby → Python

Requirements

On host machine:

  • Docker & Docker Compose
  • Hurl
  • Python3 with PyYAML: pip3 install pyyaml

Key Design Principles

  1. Tests hidden from agent: Keep tests/ outside workspace/
  2. Single source of truth: All config in benchmark.yml
  3. Environment variables for ports: Never hardcode ports
  4. Reusable scripts: test.sh and generate_prompt.py work for all benchmarks
  5. Consistent metrics: Always report Build Success (0/1) and Pass Rate (0.0-1.0)

File Checklist

When creating a new benchmark, ensure you have:

  • benchmark.yml configured
  • workspace/src/ with complete source implementation
  • workspace/dst/ prepared (empty or stub)
  • tests/api_test.hurl with comprehensive tests
  • Dockerfile.dev with both language runtimes
  • docker-compose.dev.yml configured
  • test.sh copied and working
  • generate_prompt.py copied
  • workspace/prompt.md generated
  • README.md with quick start instructions
  • CLAUDE.md with detailed documentation

Example: Hello Translation

See the hello-translation/ benchmark as a reference implementation for Python → Java translation.

Troubleshooting

Tests failing on source API:

  • Check startup_wait - may need more time
  • Verify port environment variable is set correctly
  • Check source implementation reads from correct env var

Build command not working:

  • Ensure all dependencies are in Dockerfile.dev
  • Check working directory (cd src or cd dst in test.sh)
  • Verify build command syntax in benchmark.yml

Port conflicts:

  • Change port in benchmark.yml
  • Ensure no other services using the port
  • Both source and destination should use same port (different test runs)

Contributing

To add a new benchmark:

  1. Follow this guide to create the benchmark
  2. Test thoroughly with ./test.sh
  3. Document any language-specific considerations in CLAUDE.md
  4. Share with the team for review