API Translation Benchmark Guide

This guide explains how to create new API translation benchmarks following the standardized workflow.

Overview

Each benchmark evaluates an AI agent's ability to translate an API from one language/framework to another. The workflow:

Hides tests from the agent during development
Provides consistent development environment
Reports standardized metrics: Build Success (0/1) and Pass Rate (0.0-1.0)

Standard Benchmark Structure

benchmark-name/
├── workspace/              # Visible to agent in dev container
│   ├── src/               # Source implementation (complete)
│   ├── dst/               # Destination (empty or stub)
│   └── prompt.md          # Translation task instructions
├── benchmark.yml          # Configuration (build/run/test settings)
├── Dockerfile.dev         # Multi-runtime dev container
├── docker-compose.dev.yml # Container orchestration
├── test.sh                # Evaluation script (reusable)
├── tests/                 # Integration tests (hidden from agent)
│   └── api_test.hurl
├── generate_prompt.py     # Generates prompt.md from config (reusable)
├── CLAUDE.md              # Documentation for Claude Code
└── README.md              # Quick start guide

Template Structure

The templates/benchmark-skeleton/ contains:

Reusable scripts: test.sh, generate_prompt.py (work for all benchmarks)
Template files: benchmark.yml, docker-compose.dev.yml, Dockerfile.dev (pre-configured with placeholders)
Empty directories: workspace/src/, workspace/dst/, tests/
Example test: tests/api_test.hurl (replace with actual tests)

Creating a New Benchmark

Step 1: Initialize from Template

Use the init script to create a new benchmark from the template:

python3 init_benchmark.py \
  --name "your-benchmark-name" \
  --source python \
  --target go \
  --description "Brief description of what the API does" \
  --source-desc "Flask-based" \
  --target-desc "Gin-based" \
  --source-build "pip install -r requirements.txt" \
  --source-run "python app.py" \
  --source-port-env "PORT" \
  --target-build "go build -o app" \
  --target-run "./app" \
  --target-port-env "SERVER_PORT"

Note: Use --source-port-env and --target-port-env to specify the environment variable names your applications use for the port. Common conventions:

Flask/Django: PORT
Spring Boot: SERVER_PORT
Go (varies): PORT or SERVER_PORT
Node.js/Express: PORT

The default is SERVER_PORT if not specified.

This creates ../your-benchmark-name/ with the complete structure and configured files.

Step 2: Edit Dockerfile.dev

Edit the generated Dockerfile.dev to install both language runtimes. The template includes commented examples for common languages.

Example for Python + Go:

# Install Python
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install Go
RUN wget https://go.dev/dl/go1.20.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go1.20.linux-amd64.tar.gz && \
    rm go1.20.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"

Step 3: Create Source Implementation

In workspace/src/, implement a complete, working API:

Must read port from environment variable specified in port_env_var
Should be simple but representative of real translation challenges
Include any necessary config files (requirements.txt, go.mod, package.json, etc.)

Example (Python/Flask):

import os
from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
    return {"message": "Hello World"}

if __name__ == "__main__":
    port = int(os.environ.get('SERVER_PORT', 3000))
    app.run(host='0.0.0.0', port=port)

Step 4: Create Destination Stub (Optional)

In workspace/dst/, you can either:

Leave it empty (agent starts from scratch)
Provide a minimal stub (e.g., just pom.xml for Java, go.mod for Go)

Step 5: Write Integration Tests

Create tests/api_test.hurl with comprehensive test cases:

# Use {{port}} variable, not hardcoded port
GET http://localhost:{{port}}/api/users
HTTP 200
[Asserts]
header "Content-Type" contains "application/json"
jsonpath "$.users" count > 0

GET http://localhost:{{port}}/api/users/1
HTTP 200
[Asserts]
jsonpath "$.id" == 1
jsonpath "$.name" exists

Test coverage guidelines:

Happy path (default cases)
Edge cases (empty inputs, special characters)
Error cases (404, validation errors)
All major endpoints
Data transformations

Step 6: Generate prompt.md

cd ../your-benchmark-name
python3 generate_prompt.py

This creates workspace/prompt.md from your benchmark.yml.

Step 7: Test Your Benchmark

# Ensure source implementation passes tests
./test.sh

Should output:

Source (python) API: PASSED ✓
Destination (go) API: FAILED ✗  # Expected if dst is empty
Build Success: 0
Pass Rate: 0.00

Supported Language Pairs (Priority)

Python → Java
Node.js → Go
Python → Go
Java → Python
Ruby → Python

Requirements

On host machine:

Docker & Docker Compose
Hurl
Python3 with PyYAML: pip3 install pyyaml

Key Design Principles

Tests hidden from agent: Keep tests/ outside workspace/
Single source of truth: All config in benchmark.yml
Environment variables for ports: Never hardcode ports
Reusable scripts: test.sh and generate_prompt.py work for all benchmarks
Consistent metrics: Always report Build Success (0/1) and Pass Rate (0.0-1.0)

File Checklist

When creating a new benchmark, ensure you have:

Example: Hello Translation

See the hello-translation/ benchmark as a reference implementation for Python → Java translation.

Troubleshooting

Tests failing on source API:

Check startup_wait - may need more time
Verify port environment variable is set correctly
Check source implementation reads from correct env var

Build command not working:

Ensure all dependencies are in Dockerfile.dev
Check working directory (cd src or cd dst in test.sh)
Verify build command syntax in benchmark.yml

Port conflicts:

Change port in benchmark.yml
Ensure no other services using the port
Both source and destination should use same port (different test runs)

Contributing

To add a new benchmark:

Follow this guide to create the benchmark
Test thoroughly with ./test.sh
Document any language-specific considerations in CLAUDE.md
Share with the team for review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Translation Benchmark Guide

Overview

Standard Benchmark Structure

Template Structure

Creating a New Benchmark

Step 1: Initialize from Template

Step 2: Edit Dockerfile.dev

Step 3: Create Source Implementation

Step 4: Create Destination Stub (Optional)

Step 5: Write Integration Tests

Step 6: Generate prompt.md

Step 7: Test Your Benchmark

Supported Language Pairs (Priority)

Requirements

Key Design Principles

File Checklist

Example: Hello Translation

Troubleshooting

Contributing

FilesExpand file tree

BENCHMARK_GUIDE.md

Latest commit

History

BENCHMARK_GUIDE.md

File metadata and controls

API Translation Benchmark Guide

Overview

Standard Benchmark Structure

Template Structure

Creating a New Benchmark

Step 1: Initialize from Template

Step 2: Edit Dockerfile.dev

Step 3: Create Source Implementation

Step 4: Create Destination Stub (Optional)

Step 5: Write Integration Tests

Step 6: Generate prompt.md

Step 7: Test Your Benchmark

Supported Language Pairs (Priority)

Requirements

Key Design Principles

File Checklist

Example: Hello Translation

Troubleshooting

Contributing