WebArena-Infinity is a scalable approach for automatically generating realistic web environments paired with verifiable tasks, enabling robust training and evaluation of general-purpose browser agents. Instead of relying on labor-intensive manual construction, our approach uses a multi-agent system that coordinates coding agents with privileged access to environment internals and browser agents that interact through the UI—to iteratively generate, test, audit, and refine environments. Starting from real-world artifacts such as product manuals and workflows, the approach produces high-fidelity applications, difficulty-graded tasks, and reliable programmatic verifiers, all optimized for future reinforcement learning. The result is a cost-efficient, reproducible, and extensible benchmark suite that captures meaningful real-world complexity while remaining fully automated.
The repo ships with pre-built environments you can browse and evaluate agents against immediately.
Requires Python 3.12+ and uv.
git clone https://github.com/web-arena-x/webarena-infinity.git
cd webarena-infinity
bash setup.shPick any app and launch its server:
cd apps/gmail && python3 server.py --port 8000Open http://localhost:8000 in your browser to explore the app. Each environment is a self-contained web app with realistic seed data and UI interactions.
Set up at least one API key in a .env file (gitignored by default):
# .env — at least one key required
export GOOGLE_API_KEY='...' # For gemini-flash, gemini-pro, gemini-cu
export OPENAI_API_KEY='...' # For gpt
export ANTHROPIC_API_KEY='...' # For claude, claude-cuThen run an evaluation:
source .env
python evaluation/run_eval_parallel.py \
--model gpt \
--difficulty easy \
--workers 4 \
--web-app apps/gmailResults go to apps/<app>/results/ and include a browsable report.html.
--model flag |
Agent | API key required |
|---|---|---|
gpt |
GPT-4o via browser-use | OPENAI_API_KEY |
claude |
Claude Sonnet 4.6 via browser-use | ANTHROPIC_API_KEY |
gemini-flash |
Gemini Flash 3 via browser-use | GOOGLE_API_KEY |
gemini-pro |
Gemini Pro 3 via browser-use | GOOGLE_API_KEY |
gemini-cu |
Gemini computer-use (coordinate-based) | GOOGLE_API_KEY |
claude-cu |
Claude computer-use (coordinate-based) | ANTHROPIC_API_KEY |
kimi |
Moonshot Kimi K2.5 | KIMI_API_KEY |
qwen |
Alibaba Qwen 3.5 VL | DASHSCOPE_API_KEY |
Warning: The pipeline invokes Claude Code with
--dangerously-skip-permissionsso it can generate and modify code autonomously, hence there is inherent risk of unwanted behaviors. Run this in a sandboxed environment (VM, container, or disposable instance) if possible.
The full pipeline uses Claude Code to generate apps from documentation, then evaluates and iterates. This requires the Claude Code CLI.
Create a markdown file in apps/app-description/ describing the web app you want to create for your browser-use agent. See the existing descriptions for reference:
python infra/pipeline.py \
--app-name my-app \
--docs-path apps/app-description/my-app.md \
--model gemini-flash \
--workers 4This will create apps/my-app/ with the generated web app, tasks, and the verifiers.
For generating multiple environments in parallel on EC2, each environment gets its own instance. This section walks through the full setup.
- AWS CLI v2 configured with credentials (
aws configure) - An EC2 key pair in your target region
- A GitHub personal access token with repo access
- Claude Code CLI authentication (for the base AMI)
All AWS resource names and defaults are centralized in infra/config.sh. Every infrastructure script sources this file. Edit it once to customize for your AWS account:
# infra/config.sh — edit these values for your setup
MM_REGION="${AWS_REGION:-us-east-1}" # AWS region
MM_S3_BUCKET="${MM_S3_BUCKET:-mirror-mirror-results}" # S3 bucket for results
MM_IAM_ROLE="${MM_IAM_ROLE:-mirror-mirror-ec2}" # IAM role for EC2 instances
MM_SECURITY_GROUP="${MM_SECURITY_GROUP:-mm-pipeline}" # Security group name
MM_INSTANCE_TYPE="${MM_INSTANCE_TYPE:-m5.4xlarge}" # EC2 instance type
MM_PROJECT_TAG="${MM_PROJECT_TAG:-mirror-mirror}" # AWS tag for resource discoveryEach value can also be overridden via environment variable without editing the file.
Create a .env file in the repo root with your credentials:
# Required
export GITHUB_TOKEN='ghp_...' # GitHub PAT for git clone/push on instances
export GITHUB_REPO='your-org/webarena-infinity' # GitHub owner/repo
export KEY_PAIR_NAME='my-key-pair' # Name of your EC2 key pair
# Agent API keys (at least one)
export GOOGLE_API_KEY='...' # For Gemini agents
export OPENAI_API_KEY='...' # For GPT agent
export ANTHROPIC_API_KEY='...' # For Claude agentsA base AMI pre-installs all dependencies (Python, Node.js, Claude CLI, Playwright, Chromium) so instances start fast. This step requires manual SSH intervention to authenticate Claude Code.
bash infra/setup/create_base_ami.sh --key-pair my-key-pairThe script will:
- Launch a temporary EC2 instance
- Install all system and Python dependencies
- Pause and ask you to SSH in to run:
claude login claude plugins install frontend-design
- Create an AMI from the instance
- Clean up the temporary instance
Note the AMI ID from the output — you'll use it in the next step.
Create a JSONL file where each line maps an environment to its source documentation:
{"env_id": "linear-account-settings", "docs_path": "apps/user-manuals/linear/02-account"}
{"env_id": "gmail-organize-and-manage", "docs_path": "apps/user-manuals/gmail/organize-and-manage"}Pre-built manifests are available:
infra/env_manifest.jsonl— full suite (19 environments)infra/env_manifest_quick.jsonl— quick test (2 environments)
source .env
bash infra/setup/run_batches.sh \
--manifest infra/env_manifest.jsonl \
--batch-size 5 \
--ami ami-0abc123def456 \
--model gemini-flash \
--workers 8run_batches.sh handles the full lifecycle automatically:
- Launches a batch of EC2 instances (one per environment)
- Starts the pipeline on each instance
- Polls for completion
- Tears down finished instances
- Launches the next batch until all environments are done
Crashed instances are kept alive for debugging. Once investigated, clean them up with:
bash infra/setup/teardown.sh --allYou can monitor progress while batches run:
bash infra/setup/monitor.sh --watch 120 # refresh every 2 minutesTo aggregate results across environments into S3 with a browsable HTML dashboard, first create the bucket:
bash infra/setup/setup_s3.shThen collect:
python infra/collect_results.pyThis generates a cross-environment summary and uploads an HTML dashboard to http://{MM_S3_BUCKET}.s3-website-{MM_REGION}.amazonaws.com/.
@article{zhou2026wainf,
title = "WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale",
author = "Shuyan Zhou",
journal = "shuyanzhou.com",
year = "2026",
month = "March",
url = "https://webarena.dev/webarena-infinity/"
}