A product-agnostic, AI-driven exploratory test framework that intelligently
explores, tests, and validates any web application. Configure it for your stack via a
small config.yaml, point it at your app, and let specialized agents drive a real browser
to find bugs, render anomalies, and unscripted edge cases.
Powered by a LangGraph Swarm architecture, Playwright, and your choice of Claude (default) or Google Gemini, this framework dynamically routes tasks to behavioral QA personas and advanced stress/exploration agents, self-heals from UI errors, optionally consults user-provided MCP servers and Agent Skills for domain knowledge, generates reproducible Playwright test scripts from every bug found, and writes Markdown executive test reports.
It can also analyze GitHub Pull Requests β pass a PR URL and the framework extracts the code diff, feeds it to an LLM, and auto-generates targeted test missions covering the UI areas most likely impacted by the changes.
demo.mov
The framework is built on a Supervisor-Worker Swarm pattern. Based on the mission type
(determined by the thread_id keyword), the system spins up either a Standard or
Advanced routing graph.
graph TD
classDef user fill:#6366f1,stroke:#4f46e5,stroke-width:2px,color:#fff;
classDef core fill:#3b82f6,stroke:#2563eb,stroke-width:2px,color:#fff;
classDef supervisor fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#fff;
classDef agent fill:#10b981,stroke:#059669,stroke-width:2px,color:#fff;
classDef db fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#fff;
classDef tool fill:#ec4899,stroke:#db2777,stroke-width:2px,color:#fff;
classDef external fill:#475569,stroke:#334155,stroke-width:2px,color:#fff;
User([User / CI]):::user -->|YAML Missions| Main(main.py):::core
User -->|GitHub PR URL| PR(pr_analyzer.py):::core
PR -->|MCP or gh CLI| GH[GitHub API]:::external
PR -->|Generated Missions| Main
Main -->|Standard Missions| S_Supervisor{QA Supervisor}:::supervisor
Main -->|Advanced Missions| A_Supervisor{Adv. Supervisor}:::supervisor
Main -->|Checkpoints + Store| DB[(SQLite Memory)]:::db
subgraph SQA [Standard QA Swarm]
S_Supervisor <-->|Routes & Returns| S_New([New User Agent]):::agent
S_Supervisor <--> S_Power([Power User Agent]):::agent
S_Supervisor <--> S_Adv([Adversarial User Agent]):::agent
end
subgraph ATS [Advanced Testing Swarm]
A_Supervisor <-->|Routes & Returns| A_Acc([Accessibility User Agent]):::agent
A_Supervisor <--> A_Data([Data Heavy User Agent]):::agent
A_Supervisor <--> A_Imp([Impatient User Agent]):::agent
A_Supervisor <--> A_Ret([Returning User Agent]):::agent
A_Supervisor <--> A_Explorer([Explorer Agent]):::agent
end
SQA --> Tools[[Tools & APIs]]:::tool
ATS --> Tools
subgraph Integrations [External Integrations]
Tools -->|JSON Intents / Action Tape| Engine[Browser Engine]:::external
Engine -->|Playwright| PW[Chromium]:::external
Tools -->|Optional Docs/Knowledge| MCP[User-configured MCP Servers]:::external
Tools -->|Optional Skills| Skills[User-installed Agent Skills]:::external
Tools -->|UI under test| WebApp[Your Web Application]:::external
end
style SQA fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,stroke-dasharray: 5 5,color:#166534
style ATS fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,stroke-dasharray: 5 5,color:#b45309
style Integrations fill:#f8fafc,stroke:#64748b,stroke-width:2px,stroke-dasharray: 5 5,color:#0f172a
-
Mission Dispatcher (
main.py): Loadsmissions/*.yamlfiles and provisions the correct graph network based onthread_idnaming conventions (accessibility,data_heavy,impatient,returning,explorer,chaos, orautonomousroute to the advanced graph; everything else to the standard 3-persona swarm). Can also accept a--pr-urlto auto-generate missions from a GitHub Pull Request viapr_analyzer.py. -
Supervisor-Worker Flow: A Supervisor node dynamically evaluates the workspace state and dispatches control to specialized worker nodes.
-
Record-and-Translate Browser Engine (
src/agentic_explorer/tools/browser/engine.py): Agents are the brainβthey never touch the browser directly. Instead they emit strict JSON intents toexecute_browser_command. The engine:- Validates selectors against a resilience policy (rejects XPath / positional CSS at runtime).
- Executes the command with Playwright and captures an Accessibility Tree / DOM snapshot.
- Appends every command to an immutable Action Tape
(
report_<thread_id>/action_tape.jsonl). - On bug detection,
generate_reproduction_spectranslates the tape into a runnablereproduction_*.spec.tsPlaywright test.
-
Tool Modality: Agents receive (1) the deterministic browser engine, (2) screenshot capture and reproduction-generation tools, (3) any MCP servers you configure in
mcp_servers.json, and (4) any Agent Skills installed underAGENT_SKILLS_ROOT. The framework ships zero hardcoded MCP servers or skills β bring your own. -
State & Memory (
agent_memory.sqlite): An asynchronous SQLite checkpointer remembers agent states (including theaction_tapefield), allowing a reusedthread_idto resume precisely where it left off. A companion LangGraph Store (optionally configured with an embedding index for semantic search) provides four levels of cross-session memory, with LLM-driven operations powered by Langmem:- Semantic β page knowledge, selector reliability, application quirks, plus
Langmem-managed agent observations (via
record_observationtool) - Episodic β session summaries, deduplicated bug catalog
- Procedural β self-improving agent prompts and routing rules optimized via
Langmem's
create_prompt_optimizer - Prioritization β risk-scored page ranking injected into supervisor routing
Agents can query past findings at runtime via the
recall_past_findingstool, which uses semantic search when an embedding index is configured and falls back to keyword matching otherwise. Agents can also proactively record observations via therecord_observationtool (powered by Langmem'screate_manage_memory_tool). The supervisor receives aMEMORY_CONTEXTsection with known pages, bugs, quirks, agent observations, and high-risk areas on every routing cycle. - Semantic β page knowledge, selector reliability, application quirks, plus
Langmem-managed agent observations (via
src/agentic_explorer/main.pyβ CLI entry, swarm graph compiler, transient-error retrysrc/agentic_explorer/pr_analyzer.pyβ PR-driven test scenario generation (GitHub MCP server preferred,ghCLI fallback)src/agentic_explorer/auth_setup.pyβ generic login flow that savesauth.jsonsrc/agentic_explorer/config.pyβconfig.yamlloader (with${ENV}interpolation)src/agentic_explorer/utils/llm.pyβmake_llm()multi-provider factory; supports Claude (API key / Vertex AI) and Gemini (API key / OAuth) with auto-detectionsrc/agentic_explorer/utils/llm_json.pyβ YAML/JSON extraction helpers for LLM responsessrc/agentic_explorer/orchestration/graph_base.pyβ shared graph infrastructure (AgentState, node factories, tool filtering)src/agentic_explorer/orchestration/standard_graph.pyβ 3 standard QA personassrc/agentic_explorer/orchestration/advanced_graph.pyβ 4 advanced personas plus autonomous explorersrc/agentic_explorer/memory.pyβ cross-session memory: semantic (pages, selectors, quirks, Langmem-managed agent observations), episodic (session summaries, bug catalog), procedural (self-improving prompts via Langmem prompt optimizer), semantic-search recall tool, proactive observation tool, regression mission generation, app model export, test prioritizationsrc/agentic_explorer/tools/browser/engine.pyβ Record-and-Translate browser enginesrc/agentic_explorer/tools/common/custom_tools.pyβ screenshot, MCP loader, Skills toolssrc/agentic_explorer/ui/state_emitter.pyβ non-blocking state bridge for Visual Modesrc/agentic_explorer/ui/swarm_diagram.pyβ Mermaid diagram generatorsrc/agentic_explorer/ui/dashboard.pyβ Streamlit dashboard app
- Product-Agnostic: One small
config.yamladapts the framework to any web app. - Persona-Driven QA Agents: Three standard QA personas plus five advanced agents β each prompted around a specific testing strategy.
- Record-and-Translate Engine: Agents emit JSON intents, the deterministic engine
executes and records every step to an immutable Action Tape. Every bug automatically
generates a reproducible
reproduction_*.spec.tsPlaywright script. - Resilient Selector Policy (Engine-Enforced):
execute_browser_commandrejects brittle XPath / positional selectors at runtime, enforcingdata-test-subjβaria-labelβ visible text priority. - Self-Healing Browser Execution: Playwright actions are wrapped to catch uncaught exceptions. Errors are returned as natural language so agents can adapt strategies.
- Screenshot Evidence: Agents capture full-page screenshots when bugs or anomalies are detected, then generate reproducible Playwright specs from the Action Tape.
- Bring-Your-Own MCP: Plug in any MCP servers via a standard
mcp_servers.jsonβ agents query them for domain knowledge instead of guessing. - Bring-Your-Own Skills: Install Agent Skills (per the
agentskills.io spec) under
AGENT_SKILLS_ROOTand the framework exposes them automatically. - Cross-Session Learning: A four-level memory system (semantic, episodic, procedural, prioritization) powered by Langmem lets agents learn across sessions. The framework remembers page structures, selector reliability, application quirks, agent observations, past bugs, and which testing strategies worked. Agent prompts and supervisor routing rules self-improve via Langmem's prompt optimizer after each batch. Agents can proactively record observations and recall past findings using semantic search.
- Regression Testing: Run
--regressionto auto-generate missions from the bug catalog β no YAML needed. The framework targets pages with known open bugs and historically flaky areas. - Application Model Export: Run
--export-modelto export the discovered application structure (pages, selectors with reliability scores, bugs, quirks, session stats) asapp_model.json. - PR-Driven Test Generation: Pass a GitHub PR URL (
--pr-url) and the framework extracts the diff (preferring the GitHub MCP server, falling back toghCLI), sends it to an LLM, and auto-generates targeted mission YAML covering the UI areas impacted by the code changes. When historical bug data exists, it's injected into the LLM prompt for better-targeted missions. Optionally execute the generated missions immediately with--execute. - Automated Artifact Generation: Every test produces an isolated folder containing
raw execution traces, the Action Tape, bug screenshots, reproducible
.spec.tsfiles, and an executive Markdown report.
Python 3.11+ is required. A virtual environment is highly recommended.
# Create and activate a virtual environment (plain venv or uv)
python -m venv .venv
source .venv/bin/activate
# Install the package and all dependencies (editable mode)
pip install -e .
# Or, if you use uv (recommended β much faster):
uv venv
uv pip install -e .
# Optional: Install Visual Mode (Streamlit dashboard)
pip install -e ".[visual]"
# Or with uv:
uv pip install -e ".[visual]"
# Install the Playwright Chromium browser
playwright install chromiumKeeping dependencies up to date: After pulling new changes, always re-sync your virtual environment to pick up any added or updated packages:
# pip pip install -e . # uv uv pip install -e .
Copy .env.example β .env and fill in your values. The framework supports two LLM
providers β Claude (default) and Gemini β and auto-detects which to use from
available credentials.
# --- LLM Provider (optional β auto-detected from credentials if not set) ---
# LLM_PROVIDER="claude" # or: gemini
# --- Claude authentication (default provider β choose one) ---
# Option A: Direct API key
ANTHROPIC_API_KEY="your_anthropic_api_key_here"
# Option B: Vertex AI (zero config if you already use Claude Code)
# The framework reads ~/.claude/settings.json automatically. If it contains
# CLAUDE_CODE_USE_VERTEX=1 and ANTHROPIC_VERTEX_PROJECT_ID, Claude on Vertex
# AI is used with no additional setup.
# --- Gemini authentication (alternative provider β choose one) ---
# Option A: API key
# GOOGLE_API_KEY="your_gemini_api_key_here"
# Option B: OAuth credentials (no env var needed)
# If GOOGLE_API_KEY is not set, the framework loads ~/.gemini/oauth_creds.json
# produced by: gemini auth login
# --- Application under test ---
APP_URL="https://your-app.example.com"
APP_USERNAME="your_user"
APP_PASSWORD="your_password"
APP_CONFIG="./config.yaml"
MCP_SERVERS_CONFIG="./mcp_servers.json"
AGENT_SKILLS_ROOT="./agent-skills"
AGENT_SKILL_SCRIPT_TIMEOUT="60"Provider auto-detection order (when LLM_PROVIDER is not set):
| Priority | Credential Source | Provider |
|---|---|---|
| 1 | ANTHROPIC_API_KEY env var |
Claude (direct API) |
| 2 | ~/.claude/settings.json with CLAUDE_CODE_USE_VERTEX=1 |
Claude (Vertex AI) |
| 3 | GOOGLE_API_KEY env var |
Gemini (API key) |
| 4 | ~/.gemini/oauth_creds.json |
Gemini (OAuth) |
Smart model defaults β the framework picks the best model for your auth method:
| Auth Method | Default Model | Rationale |
|---|---|---|
| Claude API key | claude-haiku-4-5 |
Fast, economical |
| Claude Vertex AI | claude-haiku-4-5 |
Fast, economical |
| Gemini API key | gemini-3.1-flash-lite |
Fast, economical |
| Gemini OAuth | gemini-3.1-flash-lite |
Fast, economical |
Override models via env vars (CLAUDE_MODEL, GEMINI_MODEL) or in config.yaml (see below).
Copy config.yaml.example β config.yaml and customize for your application:
app:
name: "My Web Application"
url: ${APP_URL}
description: "Brief description used to give agent prompts domain context."
auth:
method: form
selectors:
username: 'input[name="username"]'
password: 'input[name="password"]'
submit: 'button[type="submit"]'
post_login_check: 'a[href="/home"]' # selector that confirms login worked
paths:
mcp_servers: ./mcp_servers.json
skills_root: ./agent-skills
# LLM provider (optional β auto-detected from credentials by default)
llm:
# provider: claude # or: gemini
# claude_model: claude-sonnet-4-6
# claude_vision_model: claude-haiku-4-5
# gemini_model: gemini-3.1-flash-lite
# gemini_vision_model: gemini-3.1-flash-lite
# Embedding model for semantic search in long-term memory (optional).
# When configured, recall_past_findings uses vector similarity instead of
# keyword matching. Gemini users can use their existing API key; Claude
# users can run a local model via Ollama.
# embedding_model: google-genai:models/embedding-001 # Gemini (768d)
# embedding_dims: 768
# embedding_model: ollama:nomic-embed-text # Ollama local (768d)
# embedding_dims: 768Copy mcp_servers.json.example β mcp_servers.json and list any MCP servers you want
the agents to consult. Format follows the standard Claude Desktop / Code shape:
{
"mcpServers": {
"github": {
"transport": "http",
"url": "https://api.githubcopilot.com/mcp/"
},
"my-docs": {
"transport": "http",
"url": "https://my-docs.example.com/_mcp/"
}
}
}The github entry is used by the PR analyzer (--pr-url) to fetch PR data via MCP
tools (get_pull_request, get_pull_request_diff, get_pull_request_files). If not
configured, the analyzer falls back to the gh CLI.
If the file is missing or empty, agents simply run without MCP tools.
Install any Skills (per agentskills.io) under
the directory pointed at by AGENT_SKILLS_ROOT (default ./agent-skills/). The framework
discovers them automatically and exposes fetch_agent_skill and run_agent_skill_script
to agents. If the directory is missing the framework just logs an info message.
Generate a reusable auth.json cookie file so subsequent runs can skip the login screen:
agent-authThe auth flow uses the selectors defined in config.yaml > auth. Adjust them to match
your app's login form.
The framework includes an optional Visual Mode β a Streamlit-based real-time dashboard that displays live browser screenshots, swarm state diagrams, thought streams, and action tapes while missions execute.
Visual Mode uses a one-way "spectator" architecture with zero performance overhead:
Main Process (LangGraph) Streamlit Dashboard
βββββββββββββββββββββββ ββββββββββββββββββββ
β Supervisor β Agent βββJSONβββΆ β Polls every ~1s: β
β Playwright engine βββJPEGβββΆ β .agent_state.jsonβ
β (fire-and-forget) β β .latest_vision.jpgβ
βββββββββββββββββββββββ ββββββββββββββββββββ
The main process writes state atomically to .agent_state.json and async screenshots to .latest_vision.jpg. The dashboard polls these files. No IPC, no callbacks, no blocking.
Visual Mode requires Streamlit as an optional dependency:
# Install with visual mode support
pip install -e ".[visual]"
# Or with uv:
uv pip install -e ".[visual]"Add the --visual flag to any mission:
# Standard mission with visual mode
agent-explorer --missions missions/new_user_agent.yaml --visual
# Visual mode with headed browser (recommended for debugging)
agent-explorer --missions missions/explorer_agent.yaml --headed --visual
# PR-driven testing with visual mode
agent-explorer --pr-url https://github.com/org/repo/pull/123 --execute --visual
# Regression testing with visual mode
agent-explorer --regression --headed --visualThe Streamlit dashboard will automatically open in your default browser at http://localhost:8501. If it doesn't open automatically, navigate to that URL manually.
The dashboard provides four key views:
- Sidebar: Mission ID, graph type (standard/advanced), LLM provider, and live metrics (steps, bugs, explored paths)
- Live Browser Vision: Real-time JPEG screenshots of the Playwright viewport, updated after each browser command
- Swarm State Diagram: Interactive Mermaid diagram showing the Supervisor-Worker topology with the currently active node highlighted in green
- Tabbed Activity Views:
- Thought Stream: Latest LLM reasoning from the active agent
- Action Tape: Recent browser commands with execution time and status
- Bugs: Discovered bugs with detailed descriptions and bug count
- Paths: URLs visited during the mission
Zero when --visual is not used β all emission code short-circuits on a single boolean check. When enabled:
- State writes: ~1ms per update (2KB JSON + atomic
os.replace) - Screenshots: Fire-and-forget async tasks (JPEG quality 50, ~30-80KB)
- Main process never waits for the dashboard
Missions live in missions/*.yaml. See missions/README.md for the
schema and writing guide. Eight templates ship in the repo, one for each supported agent:
missions/new_user_agent.yamlmissions/power_user_agent.yamlmissions/adversarial_user_agent.yamlmissions/accessibility_user_agent.yamlmissions/data_heavy_user_agent.yamlmissions/impatient_user_agent.yamlmissions/returning_user_agent.yamlmissions/explorer_agent.yaml
All of them contain placeholders (<YOUR_APP>, <APP_URL>, <example_search_term>, β¦) β fill
them in for your application before running.
# Standard 3-persona QA swarm (uses auto-detected provider β Claude by default)
agent-explorer --missions missions/new_user_agent.yaml
# Explicitly choose a provider
agent-explorer --missions missions/power_user_agent.yaml --provider claude
agent-explorer --missions missions/power_user_agent.yaml --provider gemini
# Advanced persona mission
agent-explorer --missions missions/accessibility_user_agent.yaml --headed
# Autonomous exploration (visible browser recommended)
agent-explorer --missions missions/explorer_agent.yaml --headed
# Clear all memory (checkpoints + learned knowledge) to restart fresh
agent-explorer --missions missions/new_user_agent.yaml --clear-all
# Clear only checkpoints (preserves learned memory: pages, bugs, procedures)
agent-explorer --missions missions/new_user_agent.yaml --clear-checkpoints
# Clear only learned memory (preserves checkpoints for resume)
agent-explorer --missions missions/new_user_agent.yaml --clear-learned
# Override the supervisor step limit (default: 30)
agent-explorer --missions missions/new_user_agent.yaml --max-steps 50
# Suppress verbose ReAct console output (traces.log still captures everything)
agent-explorer --missions missions/new_user_agent.yaml --quiet# Auto-generate and run missions targeting known bugs (no --missions needed)
agent-explorer --regression --headed
# Combine regression with manual missions
agent-explorer --missions missions/new_user_agent.yaml --regression
# Export discovered app structure as JSON
agent-explorer --export-modelGenerate targeted test scenarios from a GitHub Pull Request.
The analyzer prefers the GitHub MCP server when a "github" entry exists in
mcp_servers.json (see setup above). If the MCP server is not configured or unreachable,
it falls back to the gh CLI (must be installed and
authenticated via gh auth login).
# Generate missions only (writes missions/pr_123.yaml)
agent-explorer --pr-url https://github.com/org/repo/pull/123
# Generate and execute immediately
agent-explorer --pr-url https://github.com/org/repo/pull/123 --execute --headed
# Write generated missions to a custom directory
agent-explorer --pr-url https://github.com/org/repo/pull/123 --output-dir ./pr-missions
# Combine with existing missions
agent-explorer --missions missions/new_user_agent.yaml --pr-url https://github.com/org/repo/pull/123 --executeThe analyzer extracts the PR title, description, file list, and full code diff, then sends
them along with the app context from config.yaml to an LLM. The LLM maps the changes to
the remaining standard and advanced personas and generates 3-8 targeted missions with
specific, actionable prompts. Generated mission files follow the same YAML format as
hand-written ones and can be re-run later with --missions.
For every mission, the framework generates a report_<thread_id>/ directory containing:
traces.logβ Full audit trail of every thought, plan, and tool invocation.test_report.mdβ Concise executive summary generated by the LLM (objective, actions, bugs, Action Tape stats, PASS/FAIL).action_tape.jsonlβ Line-delimited JSON log of every deterministic browser command. The source for reproduction scripts.reproduction_*.spec.tsβ Auto-generated Playwright TypeScript tests, one per bug detected. Run with:npx playwright test report_<thread_id>/reproduction_*.spec.ts --headed
screenshots/β Image evidence captured on every detected bug.
If you are an AI coding assistant contributing to this repository, see AGENTS.md
for the conventions covering agent registration, selector policy, and tool behavior.
This project is licensed under the MIT License. See LICENSE for details.
