Anvil

Run coding agent evaluations on SWE-bench style tasks using Modal sandboxes.

Anvil makes it easy to run agents against SWE-bench Pro tasks. It handles the infrastructure—spinning up Modal sandboxes, applying patches, running test harnesses, aggregating results—so you can benchmark different models and configurations in just 2 commands.

Setup

1. Install dependencies

uv venv
source .venv/bin/activate
uv sync

2. Configure environment

Copy .env.example to .env and fill in:

OPENROUTER_API_KEY (or whichever provider you're using)
REGISTRY_USERNAME - your Docker Hub username
REGISTRY_PASSWORD - a Docker Hub access token

3. Authenticate services

Make sure Docker is running locally, then:

modal setup          # Modal account for sandboxed execution
docker login         # Docker Hub for image pulls

4. Create a private Docker Hub repository

Go to hub.docker.com and create a new private repository (e.g., anvil-images).

⚠️ Public repos will not work—Anvil refuses to push task images to public repositories to prevent data leakage.

Usage

Publish task images

Build and push Docker images for a dataset to your private repo:

anvil publish-images --dataset datasets/file-utilization -u <dockerhub-username> --repo anvil-images

Modal sandboxes pull images from Docker Hub, so task images need to be pushed there first.

To remove local anvil images: docker rmi $(docker images <dockerhub-username>/anvil-images -q) --force

Run evaluations

Run an agent on all tasks and evaluate the patches:

anvil run-evals \
  --model openrouter/google/gemini-2.5-flash \
  --dataset datasets/file-utilization \
  --agent mini-swe-agent \
  --dockerhub-username <dockerhub-username> \
  --dockerhub-repo anvil-images \
  --n-attempts 3

Use --n-attempts to control how many runs per task (useful for pass@k metrics). Results are saved to <dataset>/runs/<agent>_<model>/.

💡 Progress is saved automatically to minimize costs. If you re-run the same command, completed tasks are skipped—nothing runs on Modal for those tasks. Use --no-continue to start fresh.

💡 Use --agent oracle to run golden patches from gold_patches.json instead of an LLM—useful for validating your test harness.

Options

Flag	Default	Description
`--model`	—	Model ID (required for agents, optional for oracle)
`--dataset`	—	Dataset ID or path
`--dockerhub-username`	—	Docker Hub username
`--dockerhub-repo`	—	Docker Hub repo name
`--agent`	mini-swe-agent	Agent to use (`mini-swe-agent` or `oracle`)
`--n-attempts`	1	Attempts per task (for pass@k)
`--max-parallel`	30	Concurrent agent runs
`--no-continue`	false	Start fresh, ignore previous results
`--max-wait`	auto	Minutes to wait for Modal rate limits

How it works

Agent phase: Each task runs in a Modal sandbox using the pre-built Docker image. The agent (mini-swe-agent) receives the problem statement and generates a patch.
Eval phase: Patches are applied and test harnesses run inside containers. Results are aggregated into pass/fail per task.
Output: Trajectories, patches, stdout/stderr, and eval results are saved per-task. A summary with pass@k metrics is printed at the end.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/anvil		src/anvil
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anvil

Setup

Usage

Publish task images

Run evaluations

Options

How it works

About

Uh oh!

Releases

Packages

Languages

AfterQuery/anvil

Folders and files

Latest commit

History

Repository files navigation

Anvil

Setup

Usage

Publish task images

Run evaluations

Options

How it works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages