Skip to content

srbarrios/agentic-test-explorer

Repository files navigation

Agentic Explorer Logo

A product-agnostic, AI-driven exploratory test framework that intelligently explores, tests, and validates any web application. Configure it for your stack via a small config.yaml, point it at your app, and let specialized agents drive a real browser to find bugs, render anomalies, and unscripted edge cases.

Powered by a LangGraph Swarm architecture, Playwright, and your choice of Claude (default) or Google Gemini, this framework dynamically routes tasks to behavioral QA personas and advanced stress/exploration agents, self-heals from UI errors, optionally consults user-provided MCP servers and Agent Skills for domain knowledge, generates reproducible Playwright test scripts from every bug found, and writes Markdown executive test reports.

It can also analyze GitHub Pull Requests β€” pass a PR URL and the framework extracts the code diff, feeds it to an LLM, and auto-generates targeted test missions covering the UI areas most likely impacted by the changes.


🎬 Demo

demo.mov

πŸ—οΈ Architecture

The framework is built on a Supervisor-Worker Swarm pattern. Based on the mission type (determined by the thread_id keyword), the system spins up either a Standard or Advanced routing graph.

graph TD
    classDef user fill:#6366f1,stroke:#4f46e5,stroke-width:2px,color:#fff;
    classDef core fill:#3b82f6,stroke:#2563eb,stroke-width:2px,color:#fff;
    classDef supervisor fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#fff;
    classDef agent fill:#10b981,stroke:#059669,stroke-width:2px,color:#fff;
    classDef db fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#fff;
    classDef tool fill:#ec4899,stroke:#db2777,stroke-width:2px,color:#fff;
    classDef external fill:#475569,stroke:#334155,stroke-width:2px,color:#fff;

    User([User / CI]):::user -->|YAML Missions| Main(main.py):::core
    User -->|GitHub PR URL| PR(pr_analyzer.py):::core
    PR -->|MCP or gh CLI| GH[GitHub API]:::external
    PR -->|Generated Missions| Main

    Main -->|Standard Missions| S_Supervisor{QA Supervisor}:::supervisor
    Main -->|Advanced Missions| A_Supervisor{Adv. Supervisor}:::supervisor
    Main -->|Checkpoints + Store| DB[(SQLite Memory)]:::db

    subgraph SQA [Standard QA Swarm]
        S_Supervisor <-->|Routes & Returns| S_New([New User Agent]):::agent
        S_Supervisor <--> S_Power([Power User Agent]):::agent
        S_Supervisor <--> S_Adv([Adversarial User Agent]):::agent
    end

    subgraph ATS [Advanced Testing Swarm]
        A_Supervisor <-->|Routes & Returns| A_Acc([Accessibility User Agent]):::agent
        A_Supervisor <--> A_Data([Data Heavy User Agent]):::agent
        A_Supervisor <--> A_Imp([Impatient User Agent]):::agent
        A_Supervisor <--> A_Ret([Returning User Agent]):::agent
        A_Supervisor <--> A_Explorer([Explorer Agent]):::agent
    end

    SQA --> Tools[[Tools & APIs]]:::tool
    ATS --> Tools

    subgraph Integrations [External Integrations]
        Tools -->|JSON Intents / Action Tape| Engine[Browser Engine]:::external
        Engine -->|Playwright| PW[Chromium]:::external
        Tools -->|Optional Docs/Knowledge| MCP[User-configured MCP Servers]:::external
        Tools -->|Optional Skills| Skills[User-installed Agent Skills]:::external
        Tools -->|UI under test| WebApp[Your Web Application]:::external
    end

    style SQA fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,stroke-dasharray: 5 5,color:#166534
    style ATS fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,stroke-dasharray: 5 5,color:#b45309
    style Integrations fill:#f8fafc,stroke:#64748b,stroke-width:2px,stroke-dasharray: 5 5,color:#0f172a
Loading

Architecture Details

  1. Mission Dispatcher (main.py): Loads missions/*.yaml files and provisions the correct graph network based on thread_id naming conventions (accessibility, data_heavy, impatient, returning, explorer, chaos, or autonomous route to the advanced graph; everything else to the standard 3-persona swarm). Can also accept a --pr-url to auto-generate missions from a GitHub Pull Request via pr_analyzer.py.

  2. Supervisor-Worker Flow: A Supervisor node dynamically evaluates the workspace state and dispatches control to specialized worker nodes.

  3. Record-and-Translate Browser Engine (src/agentic_explorer/tools/browser/engine.py): Agents are the brainβ€”they never touch the browser directly. Instead they emit strict JSON intents to execute_browser_command. The engine:

    • Validates selectors against a resilience policy (rejects XPath / positional CSS at runtime).
    • Executes the command with Playwright and captures an Accessibility Tree / DOM snapshot.
    • Appends every command to an immutable Action Tape (report_<thread_id>/action_tape.jsonl).
    • On bug detection, generate_reproduction_spec translates the tape into a runnable reproduction_*.spec.ts Playwright test.
  4. Tool Modality: Agents receive (1) the deterministic browser engine, (2) screenshot capture and reproduction-generation tools, (3) any MCP servers you configure in mcp_servers.json, and (4) any Agent Skills installed under AGENT_SKILLS_ROOT. The framework ships zero hardcoded MCP servers or skills β€” bring your own.

  5. State & Memory (agent_memory.sqlite): An asynchronous SQLite checkpointer remembers agent states (including the action_tape field), allowing a reused thread_id to resume precisely where it left off. A companion LangGraph Store (optionally configured with an embedding index for semantic search) provides four levels of cross-session memory, with LLM-driven operations powered by Langmem:

    • Semantic β€” page knowledge, selector reliability, application quirks, plus Langmem-managed agent observations (via record_observation tool)
    • Episodic β€” session summaries, deduplicated bug catalog
    • Procedural β€” self-improving agent prompts and routing rules optimized via Langmem's create_prompt_optimizer
    • Prioritization β€” risk-scored page ranking injected into supervisor routing

    Agents can query past findings at runtime via the recall_past_findings tool, which uses semantic search when an embedding index is configured and falls back to keyword matching otherwise. Agents can also proactively record observations via the record_observation tool (powered by Langmem's create_manage_memory_tool). The supervisor receives a MEMORY_CONTEXT section with known pages, bugs, quirks, agent observations, and high-risk areas on every routing cycle.

Source Layout

  • src/agentic_explorer/main.py β€” CLI entry, swarm graph compiler, transient-error retry
  • src/agentic_explorer/pr_analyzer.py β€” PR-driven test scenario generation (GitHub MCP server preferred, gh CLI fallback)
  • src/agentic_explorer/auth_setup.py β€” generic login flow that saves auth.json
  • src/agentic_explorer/config.py β€” config.yaml loader (with ${ENV} interpolation)
  • src/agentic_explorer/utils/llm.py β€” make_llm() multi-provider factory; supports Claude (API key / Vertex AI) and Gemini (API key / OAuth) with auto-detection
  • src/agentic_explorer/utils/llm_json.py β€” YAML/JSON extraction helpers for LLM responses
  • src/agentic_explorer/orchestration/graph_base.py β€” shared graph infrastructure (AgentState, node factories, tool filtering)
  • src/agentic_explorer/orchestration/standard_graph.py β€” 3 standard QA personas
  • src/agentic_explorer/orchestration/advanced_graph.py β€” 4 advanced personas plus autonomous explorer
  • src/agentic_explorer/memory.py β€” cross-session memory: semantic (pages, selectors, quirks, Langmem-managed agent observations), episodic (session summaries, bug catalog), procedural (self-improving prompts via Langmem prompt optimizer), semantic-search recall tool, proactive observation tool, regression mission generation, app model export, test prioritization
  • src/agentic_explorer/tools/browser/engine.py β€” Record-and-Translate browser engine
  • src/agentic_explorer/tools/common/custom_tools.py β€” screenshot, MCP loader, Skills tools
  • src/agentic_explorer/ui/state_emitter.py β€” non-blocking state bridge for Visual Mode
  • src/agentic_explorer/ui/swarm_diagram.py β€” Mermaid diagram generator
  • src/agentic_explorer/ui/dashboard.py β€” Streamlit dashboard app

✨ Key Features

  • Product-Agnostic: One small config.yaml adapts the framework to any web app.
  • Persona-Driven QA Agents: Three standard QA personas plus five advanced agents β€” each prompted around a specific testing strategy.
  • Record-and-Translate Engine: Agents emit JSON intents, the deterministic engine executes and records every step to an immutable Action Tape. Every bug automatically generates a reproducible reproduction_*.spec.ts Playwright script.
  • Resilient Selector Policy (Engine-Enforced): execute_browser_command rejects brittle XPath / positional selectors at runtime, enforcing data-test-subj β†’ aria-label β†’ visible text priority.
  • Self-Healing Browser Execution: Playwright actions are wrapped to catch uncaught exceptions. Errors are returned as natural language so agents can adapt strategies.
  • Screenshot Evidence: Agents capture full-page screenshots when bugs or anomalies are detected, then generate reproducible Playwright specs from the Action Tape.
  • Bring-Your-Own MCP: Plug in any MCP servers via a standard mcp_servers.json β€” agents query them for domain knowledge instead of guessing.
  • Bring-Your-Own Skills: Install Agent Skills (per the agentskills.io spec) under AGENT_SKILLS_ROOT and the framework exposes them automatically.
  • Cross-Session Learning: A four-level memory system (semantic, episodic, procedural, prioritization) powered by Langmem lets agents learn across sessions. The framework remembers page structures, selector reliability, application quirks, agent observations, past bugs, and which testing strategies worked. Agent prompts and supervisor routing rules self-improve via Langmem's prompt optimizer after each batch. Agents can proactively record observations and recall past findings using semantic search.
  • Regression Testing: Run --regression to auto-generate missions from the bug catalog β€” no YAML needed. The framework targets pages with known open bugs and historically flaky areas.
  • Application Model Export: Run --export-model to export the discovered application structure (pages, selectors with reliability scores, bugs, quirks, session stats) as app_model.json.
  • PR-Driven Test Generation: Pass a GitHub PR URL (--pr-url) and the framework extracts the diff (preferring the GitHub MCP server, falling back to gh CLI), sends it to an LLM, and auto-generates targeted mission YAML covering the UI areas impacted by the code changes. When historical bug data exists, it's injected into the LLM prompt for better-targeted missions. Optionally execute the generated missions immediately with --execute.
  • Automated Artifact Generation: Every test produces an isolated folder containing raw execution traces, the Action Tape, bug screenshots, reproducible .spec.ts files, and an executive Markdown report.

πŸ› οΈ Setup

1. Dependencies

Python 3.11+ is required. A virtual environment is highly recommended.

# Create and activate a virtual environment (plain venv or uv)
python -m venv .venv
source .venv/bin/activate

# Install the package and all dependencies (editable mode)
pip install -e .

# Or, if you use uv (recommended β€” much faster):
uv venv
uv pip install -e .

# Optional: Install Visual Mode (Streamlit dashboard)
pip install -e ".[visual]"
# Or with uv:
uv pip install -e ".[visual]"

# Install the Playwright Chromium browser
playwright install chromium

Keeping dependencies up to date: After pulling new changes, always re-sync your virtual environment to pick up any added or updated packages:

# pip
pip install -e .

# uv
uv pip install -e .

2. Environment Variables

Copy .env.example β†’ .env and fill in your values. The framework supports two LLM providers β€” Claude (default) and Gemini β€” and auto-detects which to use from available credentials.

# --- LLM Provider (optional β€” auto-detected from credentials if not set) ---
# LLM_PROVIDER="claude"         # or: gemini

# --- Claude authentication (default provider β€” choose one) ---

# Option A: Direct API key
ANTHROPIC_API_KEY="your_anthropic_api_key_here"

# Option B: Vertex AI (zero config if you already use Claude Code)
# The framework reads ~/.claude/settings.json automatically. If it contains
# CLAUDE_CODE_USE_VERTEX=1 and ANTHROPIC_VERTEX_PROJECT_ID, Claude on Vertex
# AI is used with no additional setup.

# --- Gemini authentication (alternative provider β€” choose one) ---

# Option A: API key
# GOOGLE_API_KEY="your_gemini_api_key_here"

# Option B: OAuth credentials (no env var needed)
# If GOOGLE_API_KEY is not set, the framework loads ~/.gemini/oauth_creds.json
# produced by: gemini auth login

# --- Application under test ---
APP_URL="https://your-app.example.com"
APP_USERNAME="your_user"
APP_PASSWORD="your_password"

APP_CONFIG="./config.yaml"
MCP_SERVERS_CONFIG="./mcp_servers.json"

AGENT_SKILLS_ROOT="./agent-skills"
AGENT_SKILL_SCRIPT_TIMEOUT="60"

Provider auto-detection order (when LLM_PROVIDER is not set):

Priority Credential Source Provider
1 ANTHROPIC_API_KEY env var Claude (direct API)
2 ~/.claude/settings.json with CLAUDE_CODE_USE_VERTEX=1 Claude (Vertex AI)
3 GOOGLE_API_KEY env var Gemini (API key)
4 ~/.gemini/oauth_creds.json Gemini (OAuth)

Smart model defaults β€” the framework picks the best model for your auth method:

Auth Method Default Model Rationale
Claude API key claude-haiku-4-5 Fast, economical
Claude Vertex AI claude-haiku-4-5 Fast, economical
Gemini API key gemini-3.1-flash-lite Fast, economical
Gemini OAuth gemini-3.1-flash-lite Fast, economical

Override models via env vars (CLAUDE_MODEL, GEMINI_MODEL) or in config.yaml (see below).

3. App Configuration

Copy config.yaml.example β†’ config.yaml and customize for your application:

app:
  name: "My Web Application"
  url: ${APP_URL}
  description: "Brief description used to give agent prompts domain context."

auth:
  method: form
  selectors:
    username: 'input[name="username"]'
    password: 'input[name="password"]'
    submit:   'button[type="submit"]'
  post_login_check: 'a[href="/home"]'   # selector that confirms login worked

paths:
  mcp_servers: ./mcp_servers.json
  skills_root: ./agent-skills

# LLM provider (optional β€” auto-detected from credentials by default)
llm:
  # provider: claude              # or: gemini
  # claude_model: claude-sonnet-4-6
  # claude_vision_model: claude-haiku-4-5
  # gemini_model: gemini-3.1-flash-lite
  # gemini_vision_model: gemini-3.1-flash-lite

  # Embedding model for semantic search in long-term memory (optional).
  # When configured, recall_past_findings uses vector similarity instead of
  # keyword matching.  Gemini users can use their existing API key; Claude
  # users can run a local model via Ollama.
  # embedding_model: google-genai:models/embedding-001   # Gemini (768d)
  # embedding_dims: 768
  # embedding_model: ollama:nomic-embed-text             # Ollama local (768d)
  # embedding_dims: 768

4. (Optional) MCP Servers

Copy mcp_servers.json.example β†’ mcp_servers.json and list any MCP servers you want the agents to consult. Format follows the standard Claude Desktop / Code shape:

{
  "mcpServers": {
    "github": {
      "transport": "http",
      "url": "https://api.githubcopilot.com/mcp/"
    },
    "my-docs": {
      "transport": "http",
      "url": "https://my-docs.example.com/_mcp/"
    }
  }
}

The github entry is used by the PR analyzer (--pr-url) to fetch PR data via MCP tools (get_pull_request, get_pull_request_diff, get_pull_request_files). If not configured, the analyzer falls back to the gh CLI.

If the file is missing or empty, agents simply run without MCP tools.

5. (Optional) Agent Skills

Install any Skills (per agentskills.io) under the directory pointed at by AGENT_SKILLS_ROOT (default ./agent-skills/). The framework discovers them automatically and exposes fetch_agent_skill and run_agent_skill_script to agents. If the directory is missing the framework just logs an info message.

6. Authenticate

Generate a reusable auth.json cookie file so subsequent runs can skip the login screen:

agent-auth

The auth flow uses the selectors defined in config.yaml > auth. Adjust them to match your app's login form.


πŸ“Š Visual Mode (Real-Time Dashboard)

The framework includes an optional Visual Mode β€” a Streamlit-based real-time dashboard that displays live browser screenshots, swarm state diagrams, thought streams, and action tapes while missions execute.

Architecture: The Spectator Pattern

Visual Mode uses a one-way "spectator" architecture with zero performance overhead:

Main Process (LangGraph)          Streamlit Dashboard
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Supervisor β†’ Agent  │──JSON──▢ β”‚ Polls every ~1s: β”‚
β”‚ Playwright engine   │──JPEG──▢ β”‚  .agent_state.jsonβ”‚
β”‚ (fire-and-forget)   β”‚          β”‚  .latest_vision.jpgβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The main process writes state atomically to .agent_state.json and async screenshots to .latest_vision.jpg. The dashboard polls these files. No IPC, no callbacks, no blocking.

Installation

Visual Mode requires Streamlit as an optional dependency:

# Install with visual mode support
pip install -e ".[visual]"

# Or with uv:
uv pip install -e ".[visual]"

Usage

Add the --visual flag to any mission:

# Standard mission with visual mode
agent-explorer --missions missions/new_user_agent.yaml --visual

# Visual mode with headed browser (recommended for debugging)
agent-explorer --missions missions/explorer_agent.yaml --headed --visual

# PR-driven testing with visual mode
agent-explorer --pr-url https://github.com/org/repo/pull/123 --execute --visual

# Regression testing with visual mode
agent-explorer --regression --headed --visual

The Streamlit dashboard will automatically open in your default browser at http://localhost:8501. If it doesn't open automatically, navigate to that URL manually.

Dashboard Features

The dashboard provides four key views:

  • Sidebar: Mission ID, graph type (standard/advanced), LLM provider, and live metrics (steps, bugs, explored paths)
  • Live Browser Vision: Real-time JPEG screenshots of the Playwright viewport, updated after each browser command
  • Swarm State Diagram: Interactive Mermaid diagram showing the Supervisor-Worker topology with the currently active node highlighted in green
  • Tabbed Activity Views:
    • Thought Stream: Latest LLM reasoning from the active agent
    • Action Tape: Recent browser commands with execution time and status
    • Bugs: Discovered bugs with detailed descriptions and bug count
    • Paths: URLs visited during the mission

Performance Impact

Zero when --visual is not used β€” all emission code short-circuits on a single boolean check. When enabled:

  • State writes: ~1ms per update (2KB JSON + atomic os.replace)
  • Screenshots: Fire-and-forget async tasks (JPEG quality 50, ~30-80KB)
  • Main process never waits for the dashboard

πŸš€ Usage

Defining Missions

Missions live in missions/*.yaml. See missions/README.md for the schema and writing guide. Eight templates ship in the repo, one for each supported agent:

All of them contain placeholders (<YOUR_APP>, <APP_URL>, <example_search_term>, …) β€” fill them in for your application before running.

Running Missions from YAML

# Standard 3-persona QA swarm (uses auto-detected provider β€” Claude by default)
agent-explorer --missions missions/new_user_agent.yaml

# Explicitly choose a provider
agent-explorer --missions missions/power_user_agent.yaml --provider claude
agent-explorer --missions missions/power_user_agent.yaml --provider gemini

# Advanced persona mission
agent-explorer --missions missions/accessibility_user_agent.yaml --headed

# Autonomous exploration (visible browser recommended)
agent-explorer --missions missions/explorer_agent.yaml --headed

# Clear all memory (checkpoints + learned knowledge) to restart fresh
agent-explorer --missions missions/new_user_agent.yaml --clear-all

# Clear only checkpoints (preserves learned memory: pages, bugs, procedures)
agent-explorer --missions missions/new_user_agent.yaml --clear-checkpoints

# Clear only learned memory (preserves checkpoints for resume)
agent-explorer --missions missions/new_user_agent.yaml --clear-learned

# Override the supervisor step limit (default: 30)
agent-explorer --missions missions/new_user_agent.yaml --max-steps 50

# Suppress verbose ReAct console output (traces.log still captures everything)
agent-explorer --missions missions/new_user_agent.yaml --quiet

Regression Testing & Model Export

# Auto-generate and run missions targeting known bugs (no --missions needed)
agent-explorer --regression --headed

# Combine regression with manual missions
agent-explorer --missions missions/new_user_agent.yaml --regression

# Export discovered app structure as JSON
agent-explorer --export-model

PR-Driven Test Generation

Generate targeted test scenarios from a GitHub Pull Request.

The analyzer prefers the GitHub MCP server when a "github" entry exists in mcp_servers.json (see setup above). If the MCP server is not configured or unreachable, it falls back to the gh CLI (must be installed and authenticated via gh auth login).

# Generate missions only (writes missions/pr_123.yaml)
agent-explorer --pr-url https://github.com/org/repo/pull/123

# Generate and execute immediately
agent-explorer --pr-url https://github.com/org/repo/pull/123 --execute --headed

# Write generated missions to a custom directory
agent-explorer --pr-url https://github.com/org/repo/pull/123 --output-dir ./pr-missions

# Combine with existing missions
agent-explorer --missions missions/new_user_agent.yaml --pr-url https://github.com/org/repo/pull/123 --execute

The analyzer extracts the PR title, description, file list, and full code diff, then sends them along with the app context from config.yaml to an LLM. The LLM maps the changes to the remaining standard and advanced personas and generates 3-8 targeted missions with specific, actionable prompts. Generated mission files follow the same YAML format as hand-written ones and can be re-run later with --missions.


πŸ“Š Test Artifacts

For every mission, the framework generates a report_<thread_id>/ directory containing:

  1. traces.log β€” Full audit trail of every thought, plan, and tool invocation.
  2. test_report.md β€” Concise executive summary generated by the LLM (objective, actions, bugs, Action Tape stats, PASS/FAIL).
  3. action_tape.jsonl β€” Line-delimited JSON log of every deterministic browser command. The source for reproduction scripts.
  4. reproduction_*.spec.ts β€” Auto-generated Playwright TypeScript tests, one per bug detected. Run with:
    npx playwright test report_<thread_id>/reproduction_*.spec.ts --headed
  5. screenshots/ β€” Image evidence captured on every detected bug.

πŸ€– Guide for Autonomous Agents

If you are an AI coding assistant contributing to this repository, see AGENTS.md for the conventions covering agent registration, selector policy, and tool behavior.


πŸ“„ License

This project is licensed under the MIT License. See LICENSE for details.

Releases

No releases published

Packages

 
 
 

Contributors

Languages