Skip to content

bhavsec/autopentest-ai

Repository files navigation

WSTG Tests PortSwigger Guides MCP Tools Security Tools WAF Bypass Evidence Based License

AutoPentest

An agentic pentesting MCP server that automates web application penetration testing using the full OWASP Web Security Testing Guide and PortSwigger Web Security Academy technique references.

Point it at a target — it crawls your app, maps every endpoint, then spawns role-specialized agents (Scout, Analyzer, Exploiter, Reporter) to test for XSS, SQLi, SSRF, SSTI, IDOR and more. No false positives — every finding is backed by real, reproducible evidence with quality gates enforcing proof at every phase. Includes 31 PortSwigger technique guides, adaptive WAF evasion for 12 vendors, cross-phase vulnerability chaining, and risk-weighted endpoint prioritization. Run it with Claude Code, the API, or go fully offline using Ollama models.

Think of it as: A senior pentester's methodology encoded into an MCP server — 109 OWASP tests, 31 PortSwigger attack technique guides, 68+ MCP tools, 27 security tools, 4 specialized agent roles, 7 structured phases, automated quality assurance, and a zero-context final review.


AutoPentest CLI Output

Table of Contents


Why AutoPentest?

Manual penetration testing is thorough but slow. Automated scanners are fast but shallow. AutoPentest bridges the gap:

Capability Manual Pentest Automated Scanner AutoPentest
Full OWASP WSTG coverage Depends on tester Partial 109 tests
Business logic testing Yes No Yes
Multi-step exploitation Yes Limited Yes
Vulnerability chaining Yes No Yes
Evidence-based findings Yes Template output Reproducible curl commands
Consistent quality Varies Yes Phase gates + Final Judge
Speed Days Minutes Hours
Cross-domain auth (SSO/OIDC) Manual setup Usually fails Automated handling

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  LLM Orchestrator (Claude)                  │
│                                                             │
│  Reads CLAUDE.md workflow, manages phases,                  │
│  spawns role-specialized subagents                          │
└──────────┬──────────┬──────────┬──────────┬─────────────────┘
           │          │          │          │
     ┌─────▼────┐ ┌───▼─────┐ ┌──▼───────┐ ┌▼─────────┐
     │  Scout   │ │Analyzer │ │Exploiter │ │ Reporter │
     │  (recon) │ │ (vuln   │ │ (proof)  │ │ (QA /    │
     │          │ │  disc.) │ │          │ │  judge)  │
     └──────────┘ └─────────┘ └──────────┘ └──────────┘
           │          │          │          │
           │     MCP  │          │     MCP  │
           ▼          ▼          ▼          ▼
┌──────────────────────────┐  ┌──────────────────────┐
│  WSTG MCP Server         │  │  Playwright MCP      │
│  (68+ tools)             │  │  (Browser Testing)   │
│                          │  │                      │
│  ◦ 109 WSTG tests        │  │  ◦ DOM XSS proof     │
│  ◦ 31 technique guides   │  │  ◦ Clickjacking      │
│  ◦ Task tree             │  │  ◦ JS-rendered auth  │
│  ◦ Knowledge graph       │  └──────────────────────┘
│  ◦ WAF evasion           │
│  ◦ Tool output parser    │
│  ◦ Results verification  │  docker exec
│  ◦ Context compression   │       │
│  ◦ Endpoint priority     │       ▼
│  ◦ Quality gates         │  ┌──────────────────────┐
│  ◦ Report generation     │  │  autopentest-tools   │
└──────────────────────────┘  │  (Docker Container)  │
                              │                      │
                              │  27 security tools:  │
                              │  nuclei, sqlmap,     │
                              │  dalfox, katana,     │
                              │  ffuf, nmap ...      │
                              │                      │
                              │  Burp proxy          │
                              │  passthrough         │
                              └──────────────────────┘

How it works:

  1. Claude Code reads CLAUDE.md for the complete pentest methodology and orchestrates the 7-phase workflow
  2. Role-specialized subagents (Scout, Analyzer, Exploiter, Reporter) execute focused tasks with dedicated prompt templates, tool guidance, and anti-patterns
  3. WSTG MCP Server (68+ tools) provides OWASP test procedures, 31 PortSwigger technique guides, hierarchical task tree, knowledge graph, WAF evasion, endpoint prioritization, results verification, context compression, quality gates, and report generation
  4. Docker Container runs all 27 security tools — traffic optionally routes through Burp Suite for passive monitoring
  5. Playwright MCP handles browser-based testing (DOM XSS, clickjacking, JS-rendered login pages)

Features

Comprehensive OWASP Coverage

  • 109 WSTG test cases across 12 categories — from information gathering to API testing
  • Each test includes step-by-step CLI procedures, context-specific payloads, detection criteria, and severity rubrics
  • Tests are prioritized (MUST/SHOULD) with conditional triggers so nothing relevant is skipped

31 PortSwigger Attack Technique Guides

  • Sourced from PortSwigger Web Security Academy — detection methods, exploitation techniques, payloads, cheat sheets, and WAF bypass patterns
  • Organized by vulnerability class (SQLi, XSS, SSRF, JWT, OAuth, etc.) for direct use during testing
  • Integrated into every testing phase — agents automatically load the relevant technique guide before testing each vulnerability class
  • Database/platform-specific payload tables (Oracle vs MySQL vs PostgreSQL vs MSSQL for SQLi, Jinja2 vs Twig vs Freemarker for SSTI, etc.)
  • WAF bypass patterns organized by bypass level (basic → intermediate → advanced)

27 Pre-Configured Security Tools

  • All tools pre-installed in a single Docker image — make setup and you're ready
  • Tools organized by phase: discovery, injection testing, authentication, cryptography, API testing
  • Automatic Burp Suite proxy integration for passive traffic monitoring

Structured 7-Phase Workflow

  • Phase 0: Application Discovery & Mapping
  • Phase 1: Information Gathering & Reconnaissance
  • Phase 2: Configuration & Deployment Testing
  • Phase 3: Identity, Authentication, Authorization & Session Management
  • Phase 4: Input Validation Testing (pipelined XSS/SQLi/SSRF pipelines)
  • Phase 5: Error Handling, Cryptography, Business Logic, Client-Side & API Testing
  • Phase 6: Coverage Verification & Reporting
  • Phase 7: Final Judge Review & Remediation

Quality Assurance System

  • Automated phase gates — each phase must pass quality checks before proceeding
  • Quality Reviewer subagent at every phase transition identifies gaps and suggests improvements
  • Final Judge — a zero-context agent reviews the entire engagement cold, like an external QA reviewer
  • Exhaustion gates — "not vulnerable" requires proof of sufficient testing effort (minimum techniques and bypass attempts)

Evidence-Based Findings

  • Every finding requires reproducible curl commands and full request/response evidence
  • Three-tier classification: EXPLOITED (proven impact), POTENTIAL (blocked by control), FALSE_POSITIVE (control holds)
  • Anti-hallucination framework — "no exploit = no finding" enforced at every level
  • Evidence checklists per vulnerability class verified before any finding is logged

Role-Specialized Subagents

  • 4 dedicated roles with focused prompt templates, tool guidance, and anti-patterns:
    • Scout — reconnaissance only, maps attack surface without sending payloads (Phase 0-1)
    • Analyzer — identifies potential sinks with canary/witness payloads, builds exploitation queues (Phase 2-5 analysis)
    • Exploiter — consumes Analyzer output, proves exploitation with evidence, logs confirmed findings (Phase 4 exploitation)
    • Reporter — quality review and Final Judge, reviews data without sending requests (QA + post-report)
  • Validation checkpoint between analysis and exploitation prevents wasted effort
  • Each role has explicit allowed/restricted tool lists and input/output contracts

Pipelined Exploitation (Phase 4)

  • 3 independent two-stage pipelines run in parallel: XSS, Injection (SQLi/CMDi), SSRF/SSTI
  • Each pipeline: Analyzer (discover → analyze → queue) → validation checkpoint → Exploiter (exploit → log)
  • Each pipeline loads its PortSwigger technique guide for detection methods, cheat sheets, and WAF bypass patterns
  • WAF intelligence shared across all pipelines
  • Context-aware witness payloads for 13 sink types

Adaptive WAF Evasion

  • Automatic WAF fingerprinting from response headers, body, and status codes — identifies 12 WAF vendors (Cloudflare, AWS WAF, Akamai, Imperva, ModSecurity, F5, FortiWeb, Sucuri, Barracuda, Wordfence, NAXSI, Citrix)
  • Vendor-specific bypass payloads organized by complexity level (basic → intermediate → advanced)
  • WAF intelligence shared across all agents via deliverable system
  • Agents automatically identify WAF on first block response and switch to tailored bypass payloads

Cross-Phase Knowledge Graph

  • Entity-relationship graph tracks endpoints, parameters, technologies, findings, cookies, domains, and user roles
  • Automated vulnerability chaining via BFS path finding with 7 predefined chain patterns:
    • XSS + missing CSP, XSS + weak cookie (no HttpOnly), Open redirect + OAuth callback
    • IDOR + admin role, SSRF + cloud metadata, No lockout + no MFA, CORS + sensitive endpoint
  • Severity upgrades when chaining materially increases impact
  • Populated throughout testing, queried after Phase 4 for chain discovery

Hierarchical Task Tree

  • Persistent tree structure (phases as branches, tests as leaves) prevents LLM depth-first bias and context loss
  • Main agent maintains strategic macro view; subagents update only their assigned leaf nodes
  • Auto-propagation: when all children complete, parent auto-completes
  • Phase-level completion percentages for informed decision-making

Endpoint Risk Prioritization

  • Score and sort endpoints by risk for prioritized testing — highest risk tested first
  • Scoring factors: parameter count, technology risk indicators, taint chain confidence, tool convergence, auth requirements, injectable parameter names
  • Integrated into Phase 0 endpoint map generation

Tool Output Parsing

  • 13 built-in parsers for common CLI tools (nmap, nuclei, sqlmap, ffuf, httpx, whatweb, testssl, nikto, dalfox, katana, gau, wapiti, commix)
  • Condenses raw tool output 3-5x while preserving key findings, endpoints, and errors
  • Configurable verbosity: summary (~15 lines), detailed (~50 lines), full (complete parsed output)

CLI Tool Results Verification

  • Automatic validation of CLI tool output quality — detects empty output, proxy errors, permission issues, and suspicious results
  • 10 per-tool validators (nmap, nuclei, sqlmap, ffuf, feroxbuster, testssl, dalfox, wapiti, katana, httpx) with corrected command suggestions
  • When a tool produces empty or suspicious output, the validator suggests fixes (e.g., add -Pn for nmap, remove proxy env vars, try different flags)
  • Integrated into the tool execution workflow — agents call verify_tool_result() after every CLI tool run

Progressive Context Compression

  • Phase summaries (~500-800 words) auto-generated when phase gates pass — capturing findings, coverage, tool results, and attack surface in compressed form
  • Prevents context degradation in long-running engagements by replacing raw historical data with structured summaries
  • get_engagement_summary() combines all phase summaries into a single overview for injecting into new subagent prompts
  • Summaries stored as deliverables — accessible by any downstream agent without requiring full engagement history

Counterfactual Analysis (Second-Pass Discovery)

  • After an Analyzer completes with vulnerabilities found, a second Analyzer is spawned with instructions to "assume those vulns are patched"
  • The counterfactual Analyzer searches for additional vulnerabilities: different endpoints, different parameters, different injection contexts, logic flaws
  • Results are appended to the existing exploitation queue (automatic merge with deduplication by endpoint+parameter and auto-incrementing IDs)
  • Based on PenHeal ablation research showing +71% vulnerability coverage with counterfactual prompting

Multi-Domain Support

  • Automatic SSO/OAuth/OIDC/SAML detection and handling
  • Per-domain scope registration, crawling, and testing
  • Cookie jar management for cross-domain session persistence
  • 6-level authentication failure escalation (alternative grants → PKCE → headless browser → token extraction → user provision → unauthenticated)

Crash-Safe Engagement Management

  • Append-only findings.md and progress.log survive crashes
  • Git workspace checkpointing with rollback capability
  • Auto-resume on interruptionresume-prompt.md auto-generated at every checkpoint with full context (target, credentials, current phase, remaining tests, scope). Paste into a new session to continue exactly where you left off
  • Mid-phase checkpoint granularity — tracks which tests within a phase are completed, not just phase-level state
  • Full audit trail of every MCP tool call with timestamps

Professional Reporting

  • Markdown reports with executive summary, findings by severity, test coverage matrix, and tool coverage
  • Per-category coverage percentages and gap analysis
  • Vulnerability chaining analysis documented
  • Final Judge observations and quality notes included

Agent Role System

AutoPentest uses 4 specialized agent roles instead of generic subagents. Each role has a dedicated prompt template with focused tool guidance, input/output contracts, and anti-patterns.

Role Template Purpose Phases
Scout templates/agent-roles/scout.md Reconnaissance and attack surface mapping Phase 0-1, source code discovery
Analyzer templates/agent-roles/analyzer.md Vulnerability discovery with canary/witness payloads Phase 2-5 analysis
Exploiter templates/agent-roles/exploiter.md Exploitation proof with evidence Phase 4 exploitation
Reporter templates/agent-roles/reporter.md Quality review and Final Judge Phase transitions, post-report

How the Pipeline Works

Phase 4 (highest-impact testing) uses a two-stage pipeline per vulnerability class:

┌──────────────────────────────────────────────────────────────┐
│                    Pipeline 1: XSS                           │
│                                                              │
│  Analyzer (75 turns)          Exploiter (75 turns)           │
│  ┌─────────────────────┐      ┌─────────────────────┐        │
│  │ Discover endpoints  │      │ Load Analyzer queue │        │
│  │ Send canary payloads│─────▶│ Attempt exploitation│        │
│  │ Build exploit queue │ gate │ Prove impact        │        │
│  │ Save deliverable    │      │ Log findings        │        │
│  └─────────────────────┘      └─────────────────────┘        │
│                          ▲                                   │
│               validate_exploitation_queue()                  │
└──────────────────────────────────────────────────────────────┘

Three pipelines (XSS, Injection, SSRF/SSTI) run in parallel. The validation checkpoint between Analyzer and Exploiter ensures only well-formed exploitation queues proceed.

Role Boundaries

Each role has explicit tool restrictions enforced through prompts:

  • Scouts cannot call log_finding() or send attack payloads
  • Analyzers can log configuration findings (missing headers, weak cookies) but not injection-class findings
  • Exploiters cannot create new queues — they consume what the Analyzer produced
  • Reporters cannot send HTTP requests to the target — they review data only

For CTF challenges and small apps (<3 input endpoints), a legacy monolithic pipeline is available as a fallback.


Quick Start

Prerequisites

Installation

# 1. Clone the repository
git clone https://github.com/bhavsec/autopentest-ai.git
cd autopentest-ai

# 2. Install Python dependencies for the MCP server
cd server && uv sync && cd ..

# 3. Build Docker image and start the tools container
make setup

That's it. All 27 security tools are now installed and ready inside the Docker container.

Verify Installation

# Check all tools are installed
make verify-tools

# Expected output:
# [+] nuclei: installed
# [+] httpx: installed
# [+] katana: installed
# ... (27 tools total)

Start Testing

# Launch Claude Code in the project directory
claude

Then tell Claude what to test:

Run a full WSTG assessment against https://target.example.com

Usage

Option A: Interactive Mode

Launch Claude Code and provide the target:

Run a full pentest against https://app.example.com

Credentials: admin / P@ssw0rd123

Claude will ask for any missing information (like credentials) and begin the 7-phase workflow.

Option B: Config-Driven Mode (Recommended)

Create a YAML config file for repeatable, consistent assessments:

# configs/my-target.yaml
target:
  url: https://app.example.com
  scope:
    - app.example.com
    - api.example.com
  exclude:
    - cdn.example.com

authentication:
  login_type: form
  login_url: https://app.example.com/login
  credentials:
    username: testuser@example.com
    password: secret123
  login_flow:
    - "Type $username into the email field"
    - "Type $password into the password field"
    - "Click the 'Sign In' button"
  success_condition:
    type: url_contains
    value: "/dashboard"

rules:
  avoid:
    - description: "Do not test logout"
      type: path
      url_path: "/logout"
  focus:
    - description: "Prioritize API endpoints"
      type: path
      url_path: "/api"

reporting:
  tester_name: "Security Team"

Then in Claude Code:

Load the config from configs/my-target.yaml and run the pentest

Option C: Targeted Testing

Run specific WSTG tests against specific endpoints:

Run WSTG-INPV-05 (SQL Injection) against https://app.example.com/search?q=
Test https://app.example.com for CORS misconfiguration (WSTG-CONF-13)
Run all authentication tests (WSTG-ATHN) against https://app.example.com

Option D: Resume an Interrupted Engagement

Resume engagement pentest-2026-02-11-myapp

Testing Phases

Phase 0: Application Discovery & Mapping

The critical foundation phase. Claude autonomously:

  1. Pre-flight checks — verifies target reachability, detects redirects and cross-domain auth
  2. Launches 10+ background tools in parallel (katana, ffuf, nuclei, whatweb, gau, nmap, feroxbuster, wapiti, httpx)
  3. Recursive crawling — follows links to depth 2-3, parses HTML/JS for endpoints
  4. Directory brute-forcing — common paths + technology-specific wordlists
  5. Tool result ingestion — reads all background tool outputs and merges into unified endpoint map
  6. Builds structured endpoint inventory with parameters, auth requirements, and priority rankings

Output: A complete endpoint map organized by domain, ready for systematic testing.

Phase 1-2: Reconnaissance & Configuration

  • Server fingerprinting, technology detection, metadata review
  • Security header analysis (HSTS, CSP, CORS, X-Frame-Options)
  • TLS configuration testing, admin interface discovery
  • HTTP methods testing, file extension handling

Phase 3: Authentication, Authorization & Session Management

  • Role/privilege lattice built before testing (maps guards, middleware, and bypass tests)
  • IDOR testing with multiple alternate IDs per endpoint
  • CSRF testing on every state-changing endpoint
  • Session fixation, hijacking, and token analysis
  • JWT vulnerability testing (if applicable)
  • OAuth/OIDC weakness testing (if applicable)

Phase 4: Input Validation (Highest Impact)

Three independent two-stage pipelines run in parallel, each using the Analyzer→Exploiter role split:

Pipeline Vulnerability Classes Tools Technique Guides
XSS Pipeline Reflected XSS, Stored XSS, DOM XSS dalfox, Playwright XSS, DOM
Injection Pipeline SQL Injection, Command Injection, NoSQL Injection sqlmap, commix, nosqli SQLI, CMDI, NOSQLI
SSRF/SSTI Pipeline SSRF, SSTI, Path Traversal sstimap, ssrfmap SSRF, SSTI, PTRAV

Each pipeline: Analyzer (discover → analyze → build exploitation queue) → validation checkpoint → Exploiter (attempt exploitation → prove impact → log findings). WAF evasion intelligence is shared across all pipelines.

Phase 5: Error Handling, Crypto, Business Logic, Client-Side & APIs

  • Stack trace and error message disclosure
  • TLS/SSL testing via testssl.sh
  • Business logic bypass (workflow circumvention, request forgery)
  • Client-side testing (clickjacking, open redirects, DOM manipulation)
  • GraphQL and REST API testing
  • Vulnerability chaining analysis across all findings

Phase 6: Reporting

  • Coverage verification (test coverage + tool coverage)
  • Finding deduplication and severity calibration
  • Markdown report generation with executive summary, findings, coverage matrices

Phase 7: Final Judge Review

A zero-context agent reviews the entire engagement cold — no knowledge of testing decisions or difficulties. It examines:

  • Coverage integrity — rubber-stamped tests, missing endpoints
  • N/A cascade detection — categories with excessive "not applicable" markings
  • Finding quality — evidence completeness, severity consistency, chaining opportunities
  • Tool utilization — tools run but output never reviewed, lazy skip reasons
  • Missed attack surface — untested endpoints, untested parameters, untested domains

The verdict (PASS/CONDITIONAL_PASS/FAIL) triggers specific remediation actions before the report is delivered.


Security Tools

Discovery & Reconnaissance (Phase 0)

Tool Purpose Key Flags
katana Web crawler with JS rendering -jc for JavaScript crawling
httpx HTTP probing, tech detection -tech-detect -status-code -title
ffuf Directory/parameter fuzzing -w wordlist -mc all -fc 404
feroxbuster Recursive directory enumeration --smart --auto-tune
nuclei Template-based vuln scanner -t cves/ -t misconfigurations/
nikto Web server misconfiguration -Tuning 1234567890
whatweb Technology fingerprinting --aggression 3
nmap Port and service scanning -sV -sC --top-ports 1000
gau Historical URL discovery --blacklist png,jpg,gif
subfinder Subdomain enumeration -silent -all

Injection Testing (Phase 4)

Tool Purpose Key Flags
sqlmap SQL injection (all techniques) --batch --risk 3 --level 5
dalfox XSS scanning & exploitation --skip-bav --deep-domxss
commix Command injection --batch --all
sstimap Server-Side Template Injection -u <url>
ssrfmap SSRF exploitation -r request.txt
nosqli NoSQL injection -u <url>
crlfuzz CRLF injection / HTTP splitting -u <url>
smuggler HTTP request smuggling -u <url>

Authentication & Session (Phase 3)

Tool Purpose Key Flags
hydra Credential brute-force -L users.txt -P pass.txt
jwt_tool JWT token analysis & exploitation -t <token> -M at

Cryptography & APIs (Phase 5)

Tool Purpose Key Flags
testssl.sh TLS/SSL configuration testing --severity HIGH --sneaky
graphql-cop GraphQL security testing -t <url>
websocat WebSocket testing ws://<url>

Infrastructure (Phase 2)

Tool Purpose
corscanner CORS misconfiguration scanning
dnsreaper Subdomain takeover detection

Browser Automation

Tool Purpose
Playwright DOM XSS proof, clickjacking, JS-rendered login, client-side storage inspection

WSTG Knowledge Base

109 test cases across 12 OWASP categories, each with CLI-specific procedures:

Code Category Tests Examples
INFO Information Gathering 10 Search engine discovery, server fingerprinting, metadata review
CONF Configuration & Deployment 14 Security headers, CORS, CSP, HSTS, admin interfaces
IDNT Identity Management 5 Role definitions, registration, account enumeration
ATHN Authentication 11 Default creds, lockout, auth bypass, MFA, password policy
ATHZ Authorization 5 Directory traversal, auth bypass, privilege escalation, IDOR
SESS Session Management 11 Cookie attributes, CSRF, session fixation/hijacking, JWT
INPV Input Validation 20 XSS, SQLi, CMDi, SSTI, SSRF, path traversal, XXE, LDAP
ERRH Error Handling 2 Error messages, stack traces
CRYP Cryptography 4 TLS config, padding oracle, weak encryption
BUSL Business Logic 10 Workflow bypass, request forgery, file upload, rate limits
CLNT Client-Side 14 DOM XSS, clickjacking, open redirects, WebSockets, storage
APIT API Testing 3 GraphQL, REST, SOAP

Each test file includes:

  • Step-by-step CLI procedures (curl commands, tool invocations)
  • Payloads organized by bypass level (basic, intermediate, advanced)
  • Detection criteria with severity assessment rubrics
  • Remediation guidance with references

PortSwigger Technique Guides

31 attack technique reference guides sourced from PortSwigger Web Security Academy, organized by vulnerability class for direct use during real pentesting engagements.

What's Included

Code Category WSTG Mapping Key Content
SQLI SQL Injection INPV-05 UNION/blind/error/time-based/OOB techniques, database-specific cheat sheets (Oracle, MySQL, PostgreSQL, MSSQL), WAF bypass
XSS Cross-Site Scripting INPV-01, INPV-02, CLNT-01 Reflected/stored/DOM contexts, tag & event handler payloads, CSP bypass, filter evasion
CMDI OS Command Injection INPV-12 Separator characters, blind techniques (time-delay, OOB), OS-specific payloads
SSTI Server-Side Template Injection INPV-18 Jinja2/Twig/Freemarker/Velocity/ERB detection & exploitation, sandbox escapes
SSRF Server-Side Request Forgery INPV-19 URL scheme tricks, IP obfuscation, DNS rebinding, cloud metadata, filter bypass
PTRAV Path Traversal INPV-04 Encoding variations, null byte injection, wrapper bypass
XXE XML External Entities INPV-07 File retrieval, SSRF via XXE, blind XXE with OOB, parameter entities
AUTHN Authentication ATHN-01 to ATHN-07 Brute force, 2FA bypass, password reset poisoning, credential stuffing
AUTHZ Access Control ATHZ-01 to ATHZ-04 IDOR, privilege escalation, horizontal/vertical bypass, referer-based controls
JWT JSON Web Tokens SESS-10 Algorithm confusion (none/HS256→RS256), kid injection, JWK/JKU exploitation
OAUTH OAuth 2.0 ATHZ-05 Authorization code theft, open redirect, scope upgrade, CSRF on OAuth flows
CSRF Cross-Site Request Forgery SESS-05 Token bypass, SameSite bypass, referer validation bypass
SMUGGLE HTTP Request Smuggling INPV-15 CL.TE, TE.CL, TE.TE, HTTP/2 downgrade, request tunneling
DOM DOM-Based Vulnerabilities CLNT-01 Sources/sinks, DOM clobbering, prototype pollution gadgets
CORS Cross-Origin Resource Sharing CONF-13, CLNT-07 Origin reflection, null origin, subdomain trust exploitation
NOSQLI NoSQL Injection INPV-05 MongoDB operator injection, JavaScript injection, blind extraction
GRAPHQL GraphQL APIT-01 Introspection, field suggestion, batching attacks, authorization bypass
RACE Race Conditions BUSL-04 Limit overrun, TOCTOU, single-endpoint races, last-frame sync
UPLOAD File Upload BUSL-08, BUSL-09 Extension bypass, content-type manipulation, web shells, polyglot files
HOST Host Header Injection INPV-17 Password reset poisoning, cache poisoning, routing-based SSRF

Plus 11 more: CLICK, WS, CACHEPOIS, CACHEDEC, DESER, INFO, BUSL, PROTO, API, LLM, SKILLS.

How They're Used

Technique guides are integrated into every testing phase via the get_technique_guide() MCP tool:

Phase 2 → CORS guide for CONF-13 testing
Phase 3 → AUTHN, AUTHZ, CSRF, JWT, OAUTH guides for auth/session testing
Phase 4 → SQLI, XSS, CMDI, SSTI, SSRF, PTRAV, XXE guides for input validation
Phase 5 → DOM, CLICK, GRAPHQL, RACE, UPLOAD guides for client-side & business logic

Each parallel testing agent automatically loads its relevant technique guide before testing, providing:

  • Detection payloads — what to inject to identify the vulnerability
  • Exploitation techniques — organized by attack method with step-by-step procedures
  • Cheat sheets — database/platform-specific syntax tables for quick reference
  • WAF bypass patterns — encoding, obfuscation, and filter evasion strategies

Adding Custom Guides

See docs/adding-knowledge-base-resources.md for instructions on adding new technique guides to the knowledge base.


Quality Assurance System

AutoPentest has a multi-layered QA system that prevents shallow testing:

1. Phase Gates (Automated)

After each phase, phase_gate_check() validates:

  • All MUST-priority tests were executed
  • Minimum coverage thresholds are met
  • Tool coverage is adequate
  • No critical gaps exist

Blocked phases cannot proceed until all issues are resolved.

2. Quality Reviewer (Per-Phase)

A subagent spawned at every phase transition that:

  • Checks for 16 known anti-patterns (rubber-stamping, N/A cascades, finding inflation)
  • Identifies untested endpoints and parameters
  • Suggests vulnerability chaining opportunities
  • Recommends alternative approaches for blocked tests

3. Final Judge (Post-Report)

A zero-context agent that reviews the completed engagement with fresh eyes:

  • Analyzes coverage integrity across all domains
  • Detects N/A cascades and their root causes
  • Validates finding quality and evidence completeness
  • Identifies missed attack surface
  • Issues a verdict: PASS, CONDITIONAL_PASS, or FAIL

4. Exhaustion Gates

Marking a vulnerability as "not exploitable" requires proof of effort:

Vuln Class Min Techniques Min Bypass Attempts
XSS 3 5
SQL Injection 3 5
Command Injection 3 5
SSTI 2 3
SSRF 3 5
Path Traversal 3 5

5. Evidence Checklists

Before logging any finding, evidence requirements are verified:

  • Reproducible curl command
  • Full HTTP request and response
  • Proof of actual exploitation (not theoretical impact)
  • Correct classification tier (EXPLOITED vs POTENTIAL)

6. Live Engagement Logging

Every MCP tool call is automatically logged to engagements/<eid>/logs.txt with full arguments, results, and execution duration. Run tail -f logs.txt in a separate terminal to watch all agent activity in real time. 100% coverage via automatic tool wrapper — no manual instrumentation needed.

7. Phase Gate Timing

Phase gates enforce minimum 60-second intervals between calls (15s in CTF mode), preventing premature phase completion. Inter-gate work verification warns if fewer than 3 work events occur between consecutive gates.


Benchmarking

AutoPentest includes integration with the XBOW Validation Benchmarks — 104 CTF-style Docker challenges used as the industry standard for benchmarking AI pentest agents.

Benchmark Scores (Reference)

Agent Score Source
Shannon 96.2% KeygraphHQ (2024)
PentestGPT 86.5% USENIX Sec 2024

Usage

# Setup (one-time)
cd benchmarks/xbow && make setup

# Solve with AutoPentest (MCP server + CLAUDE.md + CTF mode)
make solve ID=XBEN-001-24

# Solve with raw Claude (baseline — no MCP, no methodology)
make solve ID=XBEN-001-24 RAW=1

# Solve by vulnerability tag
make solve-tag TAG=sqli

# Solve all 104 challenges
make solve-all

# Full baseline run for comparison
make solve-all RAW=1

# Score the latest run
make score

# Compare autopentest vs raw runs side-by-side
make compare

The solver has two modes:

  • autopentest (default): Runs Claude Code from the project root, loading .mcp.json (MCP server with 68+ tools) and CLAUDE.md (pentest methodology). Measures AutoPentest's full capability.
  • raw (RAW=1): Runs bare Claude Code with no MCP server or methodology. Baseline for measuring AutoPentest's value-add over raw LLM capability.

Each challenge is a Docker Compose app with a flag injected at build time. Flag extraction from Claude's output determines pass/fail. Results are scored per-challenge, per-tag, and per-difficulty-level.

CTF Mode

For CTF challenges and small apps, enable CTF mode for relaxed quality gates:

mode: ctf
target:
  url: https://target.com

CTF mode reduces phase gate timing (15s vs 60s), skips QA Reviewer requirements, and halves completion thresholds — while maintaining finding quality and evidence standards.


Example Report

A complete example report from a pentest against PortSwigger's Gin & Juice Shop (a deliberately vulnerable application) is included in the repository:

View Full Report

What the Report Includes

The report demonstrates AutoPentest's output against a real target with 23 findings across all severity levels:

Severity Count Examples
Critical 2 UNION-based SQL injection with full data extraction, access control bypass via X-Original-URL header
High 5 Reflected XSS via JS string escape bypass, IDOR on order details, XXE with local file read, DOM XSS via prototype pollution
Medium 6 Missing security headers, no account lockout, missing CSP, CRLF injection, DOM-based open redirect
Low 5 Infrastructure info disclosure, EOL AngularJS, insecure ALB cookies, weak TLS config
Informational 5 Consolidated duplicates and secondary evidence for primary findings

Report Structure

1. Executive Summary         — Target scope, finding summary, domain architecture
2. Detailed Findings         — Each finding with description, evidence (curl commands), and remediation
3. Vulnerability Chaining    — Cross-finding analysis (e.g., XSS + no CSP = severity upgrade)
4. Test Coverage Matrix      — Per-category WSTG coverage (100% across 12 categories)
5. Tool Coverage Matrix      — 27/27 tools tracked, 8 actively run

Sample Finding (SQL Injection)

From the report — a Critical SQL injection finding with full exploitation evidence:

FINDING-017: SQL Injection in /catalog category parameter — Full Data Extraction

Severity: Critical
WSTG Reference: WSTG-INPV-05

The category parameter is vulnerable to UNION-based SQL injection.
The attacker can:
  1. Inject a single quote to cause a 500 error (confirming injection)
  2. Use UNION SELECT with 8 columns to extract arbitrary data
  3. Enumerate tables: PRODUCTS, TRACKING, USERS
  4. Extract credentials from the USERS table

Evidence (reproducible curl command):
  curl -sk "https://ginandjuice.shop/catalog?category='+UNION+SELECT+1,USERNAME,PASSWORD,
  1,1,USERNAME,1,USERNAME+FROM+USERS+LIMIT+10--"

Every finding includes reproducible curl commands, full request/response evidence, and actionable remediation guidance.


Configuration

Engagement Config (YAML)

Config-driven pentests skip interactive questions and ensure consistency:

target:
  url: https://app.example.com
  scope: [app.example.com, api.example.com]

authentication:
  login_type: sso                    # form | sso | api | manual | none
  login_url: https://app.example.com/login
  credentials:
    username: testuser
    password: secret123
  sso:
    provider: keycloak               # keycloak | auth0 | okta | azure_ad
    auth_domain: auth.example.com
    realm: myrealm
    client_id: my-app

rules:
  avoid:
    - { type: path, url_path: "/logout", description: "Skip logout" }
    - { type: endpoint, method: DELETE, url_path: "/api/admin/*", description: "No destructive admin ops" }
  focus:
    - { type: path, url_path: "/api", description: "Prioritize API" }

reporting:
  tester_name: "Security Team"

MCP Server Configuration

The .mcp.json file registers two MCP servers:

{
  "mcpServers": {
    "wstg-pentest": {
      "command": "uv",
      "args": ["--directory", "./server", "run", "server.py"]
    },
    "playwright": {
      "command": "npx",
      "args": ["-y", "@playwright/mcp"]
    }
  }
}

Burp Suite Integration (Optional)

For passive traffic monitoring through Burp Suite Professional:

  1. Start Burp Suite and enable the proxy on all interfaces (0.0.0.0:8080)
  2. The Docker container automatically routes traffic through host.docker.internal:8080
  3. All HTTP requests appear in Burp's proxy history for manual review

Multi-Domain Testing

AutoPentest has first-class support for applications with multiple domains (e.g., a SPA frontend + API backend + SSO provider):

Automatic Detection

During Phase 0, AutoPentest detects cross-domain authentication by following login redirects:

app.example.com → redirects to → auth.example.com/login
                 → after login → app.example.com/callback

All domains are automatically registered in scope with their type (app, auth_provider, api, cdn).

Per-Domain Testing

Every WSTG test is evaluated per domain — not just the primary:

  • Discovery tools (katana, ffuf, nuclei) run against all domains
  • Input validation tools (sqlmap, dalfox) target endpoints on every domain with server-side processing
  • A test is "not applicable" only when no domain has the tested feature

Cross-Domain Authentication

Supported SSO protocols:

  • OAuth 2.0 / OIDC (Authorization Code, PKCE, Password Grant, Client Credentials)
  • SAML (SP-initiated flow)
  • Keycloak, Auth0, Okta, Azure AD
  • Custom SSO (redirect chain following with cookie jar)

Authentication escalation procedure (6 levels) ensures testing can proceed even with complex auth flows.


Crash Recovery

AutoPentest is designed to survive interruptions:

Automatic Checkpointing

  • Phase gates auto-save checkpoints on PASS
  • git_checkpoint() creates git snapshots of the engagement workspace
  • Append-only logs (findings.md, progress.log) survive crashes

Auto-Resume via resume-prompt.md (Recommended)

Every checkpoint and phase gate automatically generates engagements/<eid>/resume-prompt.md — a complete, self-contained prompt with everything a fresh session needs:

  • Target URL, authentication credentials, and scope domains
  • Current phase and which specific tests remain (mid-phase precision)
  • Cookie jar status and re-authentication instructions
  • Avoid/focus rules and endpoint map references

To resume after an interruption:

  1. Open a new Claude Code session
  2. Paste the contents of engagements/<eid>/resume-prompt.md
  3. Claude picks up exactly where it left off — no manual context needed

Resume from Checkpoint (Alternative)

Resume engagement pentest-2026-02-11-myapp

This restores:

  • All findings and test tracking data
  • Coverage statistics and phase gate results
  • Scope registrations and deliverables
  • Mid-phase remaining tests (not just phase-level state)
  • Instructions for what to do next

Manual Checkpoints

Save at any time:

Save a checkpoint before starting Phase 4 exploitation

Rollback on Failure

If a phase produces bad results, roll back to the previous checkpoint:

Roll back the engagement to the last checkpoint

Project Structure

autopentest-ai/
├── CLAUDE.md                          # Master pentest workflow (drives Claude Code)
├── .mcp.json                          # MCP server configuration
├── Dockerfile                         # Multi-stage Docker build (27 tools)
├── docker-compose.yml                 # Docker Compose alternative
├── Makefile                           # setup, start, stop, verify-tools, shell
│
├── server/
│   ├── server.py                      # FastMCP server (68+ MCP tools)
│   ├── task_tree.py                   # Hierarchical task tree (6 MCP tools)
│   ├── tool_parsers.py                # Tool output parsing (2 MCP tools, 13 parsers)
│   ├── endpoint_priority.py           # Endpoint risk prioritization (2 MCP tools)
│   ├── waf_evasion.py                 # Adaptive WAF evasion (3 MCP tools, 12 vendors)
│   ├── knowledge_graph.py             # Cross-phase knowledge graph (5 MCP tools)
│   ├── tool_verification.py           # CLI tool results verification (1 MCP tool, 10 validators)
│   ├── context_compression.py         # Progressive context compression (2 MCP tools)
│   └── pyproject.toml                 # Python dependencies
│
├── knowledge-base/
│   ├── web-security-testing-guide/    # OWASP WSTG knowledge base (109 test procedures)
│   │   ├── 01-information-gathering/  # 10 tests (WSTG-INFO-01 → 10)
│   │   ├── 02-configuration/          # 14 tests (WSTG-CONF-01 → 14)
│   │   ├── 03-identity-management/    # 5 tests  (WSTG-IDNT-01 → 05)
│   │   ├── 04-authentication/         # 11 tests (WSTG-ATHN-01 → 11)
│   │   ├── 05-authorization/          # 5 tests  (WSTG-ATHZ-01 → 05)
│   │   ├── 06-session-management/     # 11 tests (WSTG-SESS-01 → 11)
│   │   ├── 07-input-validation/       # 20 tests (WSTG-INPV-01 → 20)
│   │   ├── 08-error-handling/         # 2 tests  (WSTG-ERRH-01 → 02)
│   │   ├── 09-cryptography/           # 4 tests  (WSTG-CRYP-01 → 04)
│   │   ├── 10-business-logic/         # 10 tests (WSTG-BUSL-01 → 10)
│   │   ├── 11-client-side/            # 14 tests (WSTG-CLNT-01 → 14)
│   │   └── 12-api-testing/            # 3 tests  (WSTG-APIT-01 → 03)
│   └── portswigger-academy/           # 31 PortSwigger attack technique guides
│       ├── sql-injection.md           # UNION, blind, error-based, OOB, WAF bypass
│       ├── cross-site-scripting.md    # Reflected, stored, DOM, CSP bypass, filter evasion
│       ├── ssrf.md                    # URL schemes, cloud metadata, DNS rebinding
│       ├── ssti.md                    # Jinja2, Twig, Freemarker sandbox escapes
│       ├── jwt.md                     # Algorithm confusion, kid injection, JWK exploitation
│       ├── oauth.md                   # Auth code theft, redirect exploitation, scope upgrade
│       └── ... (31 total)             # One per vulnerability class
│
├── templates/                         # Testing guides and procedures
│   ├── input-validation-guide.md      # Phase 4 step-by-step procedures
│   ├── testing-strategies.md          # Test matrices, chaining, parallel strategy
│   ├── cli-tools-guide.md             # Tool setup and Docker management
│   ├── tools.md                       # Per-tool command reference
│   ├── quality-gates.md               # Phase quality checklists and anti-patterns
│   ├── cross-domain-auth-guide.md     # SSO/OIDC/SAML procedures
│   ├── source-code-analysis.md        # Security-focused code review template
│   ├── pipelined-testing.md           # Phase 4 pipelined exploitation strategy
│   ├── agent-roles/                   # Role-specialized subagent templates
│   │   ├── README.md                  # Role index and selection guide
│   │   ├── scout.md                   # Reconnaissance role (Phase 0-1)
│   │   ├── analyzer.md                # Vulnerability discovery role (Phase 2-5)
│   │   ├── exploiter.md               # Exploitation proof role (Phase 4)
│   │   └── reporter.md                # QA review + Final Judge role
│   ├── shared/
│   │   ├── honesty-framework.md       # Anti-hallucination guardrails
│   │   ├── exploit-classification.md  # Three-tier finding classification
│   │   ├── reproducibility.md         # Evidence format requirements
│   │   └── scope-rules.md            # Avoid/focus rule templates
│   └── wordlists/                     # Tech-specific fuzzing wordlists
│
├── benchmarks/
│   └── xbow/                              # XBOW benchmark suite (104 CTF challenges)
│       ├── runner.py                      # Challenge orchestration
│       ├── solver.py                      # Automated solver (Claude Code CLI)
│       ├── Makefile                       # solve, solve-all, score, compare
│       └── results/                       # Run reports
│
├── docs/
│   ├── ROADMAP.md                         # Competitive analysis + improvement roadmap
│   └── adding-knowledge-base-resources.md # Guide for adding new technique guides
│
├── configs/
│   ├── example-config.yaml            # Example engagement configuration
│   └── config-schema.md               # YAML schema documentation
│
├── scripts/
│   ├── install-tools.sh               # Docker build + container start
│   ├── browser-auth.py                # Headless Chromium auth (JS-rendered logins)
│   ├── pkce-auth.py                   # OAuth 2.0 PKCE flow automation
│   └── status.sh                      # Engagement status dashboard
│
└── engagements/                       # Runtime output (git-ignored)
    └── <engagement-id>/
        ├── logs.txt                   # Live engagement log (tail -f to watch)
        ├── findings.md                # Append-only findings log
        ├── progress.log               # Timestamped event log
        ├── resume-prompt.md           # Auto-resume prompt (paste into new session)
        ├── report.md                  # Final pentest report
        ├── cookies.txt                # Cross-domain cookie jar
        └── tool-output/               # Raw CLI tool outputs

Requirements

Requirement Version Notes
Docker 20.10+ Docker Desktop on macOS/Windows
Claude Code Latest npm install -g @anthropic-ai/claude-code
uv 0.1+ curl -LsSf https://astral.sh/uv/install.sh | sh
Node.js 18+ For Playwright MCP server
Python 3.10+ Managed by uv (no manual install needed)
Burp Suite Pro Latest Optional — for passive traffic monitoring

Supported platforms: macOS (Apple Silicon & Intel), Linux (x86_64 & ARM64)


FAQ

Q: Does this replace a human penetration tester?

No. AutoPentest automates the systematic, methodology-driven parts of a pentest. It excels at coverage (ensuring nothing is missed) and consistency (every test follows the same procedure). However, complex business logic, creative exploitation chains, and context-dependent risk assessment still benefit from human expertise. Think of it as a force multiplier.

Q: How long does a full assessment take?

It depends on the application's size and complexity. A typical medium-sized web app (50-100 endpoints) takes a few hours. Multi-domain applications with SSO take longer. The pipelined Phase 4 architecture parallelizes the most time-intensive testing.

Q: Can I run this without Burp Suite?

Yes. Burp Suite is optional and used only for passive traffic monitoring. All HTTP requests go through docker exec curl and all security tools run inside the Docker container. Without Burp, you lose the ability to review traffic in Burp's proxy history, but all testing functionality works.

Q: What are the PortSwigger technique guides?

31 attack reference guides covering detection, exploitation techniques, payloads, cheat sheets, and WAF bypass patterns — sourced from PortSwigger Web Security Academy. During testing, agents automatically load the relevant guide (e.g., the SQLi guide when testing for SQL injection) for comprehensive technique and payload reference. See docs/adding-knowledge-base-resources.md to add your own guides.

Q: How do I add custom wordlists or payloads?

Place wordlists in templates/wordlists/ and they'll be available inside the Docker container via the volume mount. The WSTG test files in knowledge-base/ can also be customized with additional payloads. To add new attack technique guides, follow the instructions in docs/adding-knowledge-base-resources.md.

Q: Can I test applications behind a VPN?

Yes. The Docker container inherits your host's network (on Linux with --network host) or reaches the host via host.docker.internal (on macOS/Windows). If your VPN is running on the host, the container can reach VPN-protected targets.

Q: What happens if a pentest is interrupted (crash, usage limit, timeout)?

AutoPentest automatically generates a resume-prompt.md file at every checkpoint with everything needed to continue. Open a new Claude Code session, paste the contents of engagements/<eid>/resume-prompt.md, and testing resumes exactly where it left off — including mid-phase progress, credentials, scope, and remaining tests.

Q: What about rate limiting?

AutoPentest includes three-tier error classification (Transient/Rate Limit/Permanent) with automatic backoff. If the target rate-limits requests, tools automatically slow down. You can also set avoid rules in the config to skip specific endpoints.

Q: What are the agent roles?

AutoPentest uses 4 specialized roles (Scout, Analyzer, Exploiter, Reporter) instead of generic subagents. Each role has a dedicated prompt template with focused tool guidance, restricted tool lists, and anti-patterns. This prevents agents from conflating reconnaissance, analysis, exploitation, and reporting — improving focus and failure isolation. See templates/agent-roles/README.md for the full role index.

Q: How does WAF evasion work?

When a payload gets blocked (403, block page), AutoPentest automatically fingerprints the WAF vendor from response characteristics, then loads vendor-specific bypass payloads organized by complexity level. 12 WAF vendors are supported (Cloudflare, AWS WAF, Akamai, Imperva, ModSecurity, F5, and more). WAF intelligence is shared across all agents via the deliverable system.

Q: What is counterfactual analysis?

After the first analysis pass finds vulnerabilities, AutoPentest can spawn a second Analyzer that assumes all known vulnerabilities are patched. This forces the agent to look for different attack vectors — different endpoints, parameters, injection contexts, and logic flaws. The results are merged into the existing exploitation queue with automatic deduplication. This technique is based on academic research (PenHeal ablation study) showing +71% vulnerability coverage improvement.

Q: How does results verification work?

When CLI tools (nmap, nuclei, sqlmap, etc.) produce empty or suspicious output, the verify_tool_result() tool detects common issues (proxy errors, permission denied, wrong flags) and suggests corrected commands. This prevents agents from silently counting broken tool runs as "completed" — a common failure mode in automated pentesting.

Q: How does vulnerability chaining work?

The knowledge graph tracks entities (endpoints, parameters, findings, cookies, domains) and relationships discovered during testing. After Phase 4, find_chains() uses BFS to discover multi-hop attack paths and checks 7 predefined chain patterns (e.g., XSS + missing CSP, SSRF + cloud metadata, IDOR + admin role). Chains that increase impact trigger automatic severity upgrades.


Disclaimer

This tool is intended for authorized security testing only. Only use AutoPentest against applications you have explicit permission to test. Unauthorized access to computer systems is illegal. The authors are not responsible for any misuse of this tool.

Always ensure you have:

  • Written authorization from the application owner
  • A clearly defined scope of what can and cannot be tested
  • An understanding of the testing environment (production vs staging)
  • Appropriate avoid rules configured for destructive or sensitive endpoints

Built with Model Context Protocol

About

Agentic Pentesting MCP server that discovers, exploits, and reports web application vulnerabilities.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors