Run coding agent evaluations on SWE-bench style tasks using Modal sandboxes.
Anvil makes it easy to run agents against SWE-bench Pro tasks. It handles the infrastructure—spinning up Modal sandboxes, applying patches, running test harnesses, aggregating results—so you can benchmark different models and configurations in just 2 commands.
1. Install dependencies
uv venv
source .venv/bin/activate
uv sync2. Configure environment
Copy .env.example to .env and fill in:
OPENROUTER_API_KEY(or whichever provider you're using)REGISTRY_USERNAME- your Docker Hub usernameREGISTRY_PASSWORD- a Docker Hub access token
3. Authenticate services
Make sure Docker is running locally, then:
modal setup # Modal account for sandboxed execution
docker login # Docker Hub for image pulls4. Create a private Docker Hub repository
Go to hub.docker.com and create a new private repository (e.g., anvil-images).
⚠️ Public repos will not work—Anvil refuses to push task images to public repositories to prevent data leakage.
Build and push Docker images for a dataset to your private repo:
anvil publish-images --dataset datasets/file-utilization -u <dockerhub-username> --repo anvil-imagesModal sandboxes pull images from Docker Hub, so task images need to be pushed there first.
To remove local anvil images: docker rmi $(docker images <dockerhub-username>/anvil-images -q) --force
Run an agent on all tasks and evaluate the patches:
anvil run-evals \
--model openrouter/google/gemini-2.5-flash \
--dataset datasets/file-utilization \
--agent mini-swe-agent \
--dockerhub-username <dockerhub-username> \
--dockerhub-repo anvil-images \
--n-attempts 3Use --n-attempts to control how many runs per task (useful for pass@k metrics). Results are saved to <dataset>/runs/<agent>_<model>/.
💡 Progress is saved automatically to minimize costs. If you re-run the same command, completed tasks are skipped—nothing runs on Modal for those tasks. Use
--no-continueto start fresh.
💡 Use
--agent oracleto run golden patches fromgold_patches.jsoninstead of an LLM—useful for validating your test harness.
| Flag | Default | Description |
|---|---|---|
--model |
— | Model ID (required for agents, optional for oracle) |
--dataset |
— | Dataset ID or path |
--dockerhub-username |
— | Docker Hub username |
--dockerhub-repo |
— | Docker Hub repo name |
--agent |
mini-swe-agent | Agent to use (mini-swe-agent or oracle) |
--n-attempts |
1 | Attempts per task (for pass@k) |
--max-parallel |
30 | Concurrent agent runs |
--no-continue |
false | Start fresh, ignore previous results |
--max-wait |
auto | Minutes to wait for Modal rate limits |
-
Agent phase: Each task runs in a Modal sandbox using the pre-built Docker image. The agent (mini-swe-agent) receives the problem statement and generates a patch.
-
Eval phase: Patches are applied and test harnesses run inside containers. Results are aggregated into pass/fail per task.
-
Output: Trajectories, patches, stdout/stderr, and eval results are saved per-task. A summary with pass@k metrics is printed at the end.