WebArena-Infinity

Generating Browser Environments with Verifiable Tasks at Scale

WebArena-Infinity is a scalable approach for automatically generating realistic web environments paired with verifiable tasks, enabling robust training and evaluation of general-purpose browser agents. Instead of relying on labor-intensive manual construction, our approach uses a multi-agent system that coordinates coding agents with privileged access to environment internals and browser agents that interact through the UI—to iteratively generate, test, audit, and refine environments. Starting from real-world artifacts such as product manuals and workflows, the approach produces high-fidelity applications, difficulty-graded tasks, and reliable programmatic verifiers, all optimized for future reinforcement learning. The result is a cost-efficient, reproducible, and extensible benchmark suite that captures meaningful real-world complexity while remaining fully automated.

Quick Start: Explore Existing Environments

The repo ships with pre-built environments you can browse and evaluate agents against immediately.

1. Install

Requires Python 3.12+ and uv.

git clone https://github.com/web-arena-x/webarena-infinity.git
cd webarena-infinity
bash setup.sh

2. Browse an Environment

Pick any app and launch its server:

cd apps/gmail && python3 server.py --port 8000

Open http://localhost:8000 in your browser to explore the app. Each environment is a self-contained web app with realistic seed data and UI interactions.

3. Run an Agent Against It

Set up at least one API key in a .env file (gitignored by default):

# .env — at least one key required
export GOOGLE_API_KEY='...'       # For gemini-flash, gemini-pro, gemini-cu
export OPENAI_API_KEY='...'       # For gpt
export ANTHROPIC_API_KEY='...'    # For claude, claude-cu

Then run an evaluation:

source .env

python evaluation/run_eval_parallel.py \
    --model gpt \
    --difficulty easy \
    --workers 4 \
    --web-app apps/gmail

Results go to apps/<app>/results/ and include a browsable report.html.

Supported Agents

`--model` flag	Agent	API key required
`gpt`	GPT-4o via browser-use	`OPENAI_API_KEY`
`claude`	Claude Sonnet 4.6 via browser-use	`ANTHROPIC_API_KEY`
`gemini-flash`	Gemini Flash 3 via browser-use	`GOOGLE_API_KEY`
`gemini-pro`	Gemini Pro 3 via browser-use	`GOOGLE_API_KEY`
`gemini-cu`	Gemini computer-use (coordinate-based)	`GOOGLE_API_KEY`
`claude-cu`	Claude computer-use (coordinate-based)	`ANTHROPIC_API_KEY`
`kimi`	Moonshot Kimi K2.5	`KIMI_API_KEY`
`qwen`	Alibaba Qwen 3.5 VL	`DASHSCOPE_API_KEY`

Quick Try: Run the Full Approach Locally

Warning: The pipeline invokes Claude Code with --dangerously-skip-permissions so it can generate and modify code autonomously, hence there is inherent risk of unwanted behaviors. Run this in a sandboxed environment (VM, container, or disposable instance) if possible.

The full pipeline uses Claude Code to generate apps from documentation, then evaluates and iterates. This requires the Claude Code CLI.

1. Write an app description

Create a markdown file in apps/app-description/ describing the web app you want to create for your browser-use agent. See the existing descriptions for reference:

2. Run the pipeline

python infra/pipeline.py \
    --app-name my-app \
    --docs-path apps/app-description/my-app.md \
    --model gemini-flash \
    --workers 4

This will create apps/my-app/ with the generated web app, tasks, and the verifiers.

AWS Setup (Parallel Multi-Environment)

For generating multiple environments in parallel on EC2, each environment gets its own instance. This section walks through the full setup.

Prerequisites

AWS CLI v2 configured with credentials (aws configure)
An EC2 key pair in your target region
A GitHub personal access token with repo access
Claude Code CLI authentication (for the base AMI)

1. Configure Infrastructure

All AWS resource names and defaults are centralized in infra/config.sh. Every infrastructure script sources this file. Edit it once to customize for your AWS account:

# infra/config.sh — edit these values for your setup
MM_REGION="${AWS_REGION:-us-east-1}"              # AWS region
MM_S3_BUCKET="${MM_S3_BUCKET:-mirror-mirror-results}"  # S3 bucket for results
MM_IAM_ROLE="${MM_IAM_ROLE:-mirror-mirror-ec2}"        # IAM role for EC2 instances
MM_SECURITY_GROUP="${MM_SECURITY_GROUP:-mm-pipeline}"  # Security group name
MM_INSTANCE_TYPE="${MM_INSTANCE_TYPE:-m5.4xlarge}"      # EC2 instance type
MM_PROJECT_TAG="${MM_PROJECT_TAG:-mirror-mirror}"       # AWS tag for resource discovery

Each value can also be overridden via environment variable without editing the file.

2. Environment Variables

Create a .env file in the repo root with your credentials:

#  Required 
export GITHUB_TOKEN='ghp_...'          # GitHub PAT for git clone/push on instances
export GITHUB_REPO='your-org/webarena-infinity'  # GitHub owner/repo
export KEY_PAIR_NAME='my-key-pair'     # Name of your EC2 key pair

#  Agent API keys (at least one) 
export GOOGLE_API_KEY='...'            # For Gemini agents
export OPENAI_API_KEY='...'            # For GPT agent
export ANTHROPIC_API_KEY='...'         # For Claude agents

3. Create a Base AMI

A base AMI pre-installs all dependencies (Python, Node.js, Claude CLI, Playwright, Chromium) so instances start fast. This step requires manual SSH intervention to authenticate Claude Code.

bash infra/setup/create_base_ami.sh --key-pair my-key-pair

The script will:

Launch a temporary EC2 instance
Install all system and Python dependencies

Pause and ask you to SSH in to run:

claude login
claude plugins install frontend-design

Create an AMI from the instance
Clean up the temporary instance

Note the AMI ID from the output — you'll use it in the next step.

4. Define Your Environment Manifest

Create a JSONL file where each line maps an environment to its source documentation:

{"env_id": "linear-account-settings", "docs_path": "apps/user-manuals/linear/02-account"}
{"env_id": "gmail-organize-and-manage", "docs_path": "apps/user-manuals/gmail/organize-and-manage"}

Pre-built manifests are available:

infra/env_manifest.jsonl — full suite (19 environments)
infra/env_manifest_quick.jsonl — quick test (2 environments)

5. Run

source .env

bash infra/setup/run_batches.sh \
    --manifest infra/env_manifest.jsonl \
    --batch-size 5 \
    --ami ami-0abc123def456 \
    --model gemini-flash \
    --workers 8

run_batches.sh handles the full lifecycle automatically:

Launches a batch of EC2 instances (one per environment)
Starts the pipeline on each instance
Polls for completion
Tears down finished instances
Launches the next batch until all environments are done

Crashed instances are kept alive for debugging. Once investigated, clean them up with:

bash infra/setup/teardown.sh --all

You can monitor progress while batches run:

bash infra/setup/monitor.sh --watch 120   # refresh every 2 minutes

6. Collect Results (Optional)

To aggregate results across environments into S3 with a browsable HTML dashboard, first create the bucket:

bash infra/setup/setup_s3.sh

Then collect:

python infra/collect_results.py

This generates a cross-environment summary and uploads an HTML dashboard to http://{MM_S3_BUCKET}.s3-website-{MM_REGION}.amazonaws.com/.

Citation

@article{zhou2026wainf,
  title   = "WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale",
  author  = "Shuyan Zhou",
  journal = "shuyanzhou.com",
  year    = "2026",
  month   = "March",
  url     = "https://webarena.dev/webarena-infinity/"
}

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
.github/workflows		.github/workflows
apps		apps
bench		bench
docs		docs
evaluation		evaluation
infra		infra
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
demo_tasks.json		demo_tasks.json
pyproject.toml		pyproject.toml
serve_all.py		serve_all.py
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebArena-Infinity

Generating Browser Environments with Verifiable Tasks at Scale

Quick Start: Explore Existing Environments

1. Install

2. Browse an Environment

3. Run an Agent Against It

Supported Agents

Quick Try: Run the Full Approach Locally

1. Write an app description

2. Run the pipeline

AWS Setup (Parallel Multi-Environment)

Prerequisites

1. Configure Infrastructure

2. Environment Variables

3. Create a Base AMI

4. Define Your Environment Manifest

5. Run

6. Collect Results (Optional)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebArena-Infinity

Generating Browser Environments with Verifiable Tasks at Scale

Quick Start: Explore Existing Environments

1. Install

2. Browse an Environment

3. Run an Agent Against It

Supported Agents

Quick Try: Run the Full Approach Locally

1. Write an app description

2. Run the pipeline

AWS Setup (Parallel Multi-Environment)

Prerequisites

1. Configure Infrastructure

2. Environment Variables

3. Create a Base AMI

4. Define Your Environment Manifest

5. Run

6. Collect Results (Optional)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages