This guide explains how to create new API translation benchmarks following the standardized workflow.
Each benchmark evaluates an AI agent's ability to translate an API from one language/framework to another. The workflow:
- Hides tests from the agent during development
- Provides consistent development environment
- Reports standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)
benchmark-name/
├── workspace/ # Visible to agent in dev container
│ ├── src/ # Source implementation (complete)
│ ├── dst/ # Destination (empty or stub)
│ └── prompt.md # Translation task instructions
├── benchmark.yml # Configuration (build/run/test settings)
├── Dockerfile.dev # Multi-runtime dev container
├── docker-compose.dev.yml # Container orchestration
├── test.sh # Evaluation script (reusable)
├── tests/ # Integration tests (hidden from agent)
│ └── api_test.hurl
├── generate_prompt.py # Generates prompt.md from config (reusable)
├── CLAUDE.md # Documentation for Claude Code
└── README.md # Quick start guide
The templates/benchmark-skeleton/ contains:
- Reusable scripts:
test.sh,generate_prompt.py(work for all benchmarks) - Template files:
benchmark.yml,docker-compose.dev.yml,Dockerfile.dev(pre-configured with placeholders) - Empty directories:
workspace/src/,workspace/dst/,tests/ - Example test:
tests/api_test.hurl(replace with actual tests)
Use the init script to create a new benchmark from the template:
python3 init_benchmark.py \
--name "your-benchmark-name" \
--source python \
--target go \
--description "Brief description of what the API does" \
--source-desc "Flask-based" \
--target-desc "Gin-based" \
--source-build "pip install -r requirements.txt" \
--source-run "python app.py" \
--source-port-env "PORT" \
--target-build "go build -o app" \
--target-run "./app" \
--target-port-env "SERVER_PORT"Note: Use --source-port-env and --target-port-env to specify the environment variable names your applications use for the port. Common conventions:
- Flask/Django:
PORT - Spring Boot:
SERVER_PORT - Go (varies):
PORTorSERVER_PORT - Node.js/Express:
PORT
The default is SERVER_PORT if not specified.
This creates ../your-benchmark-name/ with the complete structure and configured files.
Edit the generated Dockerfile.dev to install both language runtimes. The template includes commented examples for common languages.
Example for Python + Go:
# Install Python
RUN apt-get update && apt-get install -y \
python3.10 python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Install Go
RUN wget https://go.dev/dl/go1.20.linux-amd64.tar.gz && \
tar -C /usr/local -xzf go1.20.linux-amd64.tar.gz && \
rm go1.20.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"In workspace/src/, implement a complete, working API:
- Must read port from environment variable specified in
port_env_var - Should be simple but representative of real translation challenges
- Include any necessary config files (requirements.txt, go.mod, package.json, etc.)
Example (Python/Flask):
import os
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return {"message": "Hello World"}
if __name__ == "__main__":
port = int(os.environ.get('SERVER_PORT', 3000))
app.run(host='0.0.0.0', port=port)In workspace/dst/, you can either:
- Leave it empty (agent starts from scratch)
- Provide a minimal stub (e.g., just pom.xml for Java, go.mod for Go)
Create tests/api_test.hurl with comprehensive test cases:
# Use {{port}} variable, not hardcoded port
GET http://localhost:{{port}}/api/users
HTTP 200
[Asserts]
header "Content-Type" contains "application/json"
jsonpath "$.users" count > 0
GET http://localhost:{{port}}/api/users/1
HTTP 200
[Asserts]
jsonpath "$.id" == 1
jsonpath "$.name" existsTest coverage guidelines:
- Happy path (default cases)
- Edge cases (empty inputs, special characters)
- Error cases (404, validation errors)
- All major endpoints
- Data transformations
cd ../your-benchmark-name
python3 generate_prompt.pyThis creates workspace/prompt.md from your benchmark.yml.
# Ensure source implementation passes tests
./test.shShould output:
Source (python) API: PASSED ✓
Destination (go) API: FAILED ✗ # Expected if dst is empty
Build Success: 0
Pass Rate: 0.00
- Python → Java
- Node.js → Go
- Python → Go
- Java → Python
- Ruby → Python
On host machine:
- Docker & Docker Compose
- Hurl
- Python3 with PyYAML:
pip3 install pyyaml
- Tests hidden from agent: Keep
tests/outsideworkspace/ - Single source of truth: All config in
benchmark.yml - Environment variables for ports: Never hardcode ports
- Reusable scripts:
test.shandgenerate_prompt.pywork for all benchmarks - Consistent metrics: Always report Build Success (0/1) and Pass Rate (0.0-1.0)
When creating a new benchmark, ensure you have:
-
benchmark.ymlconfigured -
workspace/src/with complete source implementation -
workspace/dst/prepared (empty or stub) -
tests/api_test.hurlwith comprehensive tests -
Dockerfile.devwith both language runtimes -
docker-compose.dev.ymlconfigured -
test.shcopied and working -
generate_prompt.pycopied -
workspace/prompt.mdgenerated -
README.mdwith quick start instructions -
CLAUDE.mdwith detailed documentation
See the hello-translation/ benchmark as a reference implementation for Python → Java translation.
Tests failing on source API:
- Check
startup_wait- may need more time - Verify port environment variable is set correctly
- Check source implementation reads from correct env var
Build command not working:
- Ensure all dependencies are in Dockerfile.dev
- Check working directory (
cd srcorcd dstin test.sh) - Verify build command syntax in benchmark.yml
Port conflicts:
- Change
portin benchmark.yml - Ensure no other services using the port
- Both source and destination should use same port (different test runs)
To add a new benchmark:
- Follow this guide to create the benchmark
- Test thoroughly with
./test.sh - Document any language-specific considerations in CLAUDE.md
- Share with the team for review