Supacrawl

Zero-infrastructure web scraping for the terminal and AI assistants.

Why Supacrawl?

There are excellent web scraping tools available. Supacrawl takes a different approach: a CLI tool designed for individual developers who want to scrape from the terminal.

Zero infrastructure: pip install and go, no Docker/databases/Redis
Terminal-first: Designed for shell workflows and pipelines
MCP server: Give AI assistants direct access to web scraping
Clean markdown: Playwright renders JS, outputs readable markdown
LLM-ready: Built-in extraction with Ollama, OpenAI, or Anthropic
Anti-bot protection: Three-tier engine system (Playwright, Patchright, Camoufox) with automatic HTTP/2 fallback
PDF parsing: Auto-detect PDF URLs, extract text with optional OCR
Mobile emulation: Scrape as any mobile device using Playwright device descriptors

pip install supacrawl
playwright install chromium

Quick Start

# Scrape a page to markdown
supacrawl scrape https://example.com

# Crawl a website
supacrawl crawl https://docs.python.org/3/ -o ./python-docs --limit 50

# Discover URLs without fetching
supacrawl map https://example.com

# Web search
supacrawl search "python web scraping 2024"

# LLM extraction (requires LLM config)
supacrawl llm-extract https://example.com/pricing -p "Extract pricing tiers"

# Autonomous agent for complex tasks
supacrawl agent "Find the pricing for all plans on example.com"

MCP Server

Supacrawl includes an embedded Model Context Protocol (MCP) server, giving AI assistants like Claude, Cursor, and VS Code Copilot direct access to web scraping.

Install

pip install supacrawl[mcp]
playwright install chromium

Add to your MCP client

Claude Desktop: edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "supacrawl": {
      "command": "supacrawl-mcp",
      "args": ["--transport", "stdio"]
    }
  }
}

Claude Code: add to .mcp.json in your project root:

{
  "mcpServers": {
    "supacrawl": {
      "command": "supacrawl-mcp",
      "args": ["--transport", "stdio"]
    }
  }
}

Cursor / VS Code: add to your editor's MCP settings with the same config.

Available Tools

Tool	Description
`supacrawl_scrape`	Scrape a URL to markdown, HTML, screenshot, or PDF
`supacrawl_crawl`	Crawl multiple pages from a site
`supacrawl_map`	Discover URLs on a website without fetching content
`supacrawl_search`	Web search with multi-provider fallback
`supacrawl_extract`	Scrape pages for LLM-powered structured extraction
`supacrawl_summary`	Scrape a page for LLM-powered summarisation
`supacrawl_diagnose`	Diagnose scraping issues (CDN, bot detection, etc.)
`supacrawl_health`	Server health check and capability report

The CLI's agent command is intentionally omitted. When used via MCP, your LLM orchestrates the primitives directly; it is the agent. For standalone agentic workflows, use supacrawl agent from the CLI.

The server also exposes MCP resources (format references, search providers, capabilities) and prompts (workflow guides for scraping, extraction, research, and error handling).

Environment Variables

Pass environment variables via your MCP client config to customise behaviour:

{
  "mcpServers": {
    "supacrawl": {
      "command": "supacrawl-mcp",
      "args": ["--transport", "stdio"],
      "env": {
        "SUPACRAWL_ENGINE": "camoufox",
        "SUPACRAWL_STEALTH": "true",
        "SUPACRAWL_LOCALE": "en-AU",
        "SUPACRAWL_TIMEZONE": "Australia/Sydney",
        "BRAVE_API_KEY": "your-key-here",
        "TAVILY_API_KEY": "your-key-here",
        "SUPACRAWL_SEARCH_PROVIDERS": "brave,tavily"
      }
    }
  }
}

All configuration environment variables apply. The MCP server also supports SUPACRAWL_LOG_LEVEL (default: INFO). Search providers fall back automatically when one hits a rate limit or quota.

Troubleshooting

If scrapes return empty or minimal content, use supacrawl_diagnose to identify the cause (CDN protection, JS framework, bot detection). Common fixes: set wait_for=3000 for JS-heavy sites (enables SPA stability polling), use wait_until="load" or "networkidle" if resources must fully load, enable SUPACRAWL_STEALTH=true for bot-protected sites, or try only_main_content=false if the wrong content is extracted.

Optional Extras

pip install supacrawl[mcp,stealth]    # Patchright anti-bot evasion (Tier 2)
pip install supacrawl[mcp,camoufox]   # Camoufox for Akamai/Cloudflare (Tier 3)
pip install supacrawl[mcp,captcha]    # 2Captcha CAPTCHA solving

REST API

Supacrawl includes an optional REST API server compatible with the Firecrawl v2 protocol. Any tool that already integrates with Firecrawl (n8n, LangChain, LlamaIndex) can use Supacrawl as a self-hosted drop-in backend.

pip install supacrawl[api]
supacrawl serve

The server starts on port 8308 by default. Test it:

curl -X POST http://localhost:8308/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Endpoints

Endpoint	Method	Description
`/scrape`	POST	Scrape a single URL (synchronous)
`/crawl`	POST	Start a crawl job (async, returns job ID)
`/crawl/{id}`	GET	Poll crawl job status and results
`/map`	POST	Discover URLs on a site (synchronous)
`/search`	POST	Web search (synchronous)
`/extract`	POST	LLM extraction (async, returns job ID)
`/batch/scrape`	POST	Batch scrape multiple URLs (async)
`/supacrawl/health`	GET	Server health and version
`/supacrawl/diagnose`	POST	Pre-scrape diagnostics
`/supacrawl/summary`	POST	Summarise a page

Authentication is optional. Set SUPACRAWL_API_KEY to require a Bearer token; leave it unset for open access.

See docs/api-reference.md for full endpoint documentation, request/response examples, and n8n integration guide.

Commands

Command	Description
`scrape <url>`	Scrape single page to markdown
`crawl <url>`	Crawl website, save to directory
`map <url>`	Discover URLs from sitemap/links
`search <query>`	Web search with multi-provider fallback
`llm-extract <url>`	Extract structured data with LLM
`agent <prompt>`	Autonomous agent for complex tasks
`serve`	Start the REST API server
`cache`	Cache management (clear, stats, prune)

Run supacrawl <command> --help for options.

Output

Crawl produces a flat directory of markdown files:

output/
├── manifest.json          # URLs crawled (for resume)
├── index.md
├── about.md
└── docs_getting-started.md

Each markdown file includes YAML frontmatter with source URL and metadata.

Configuration

Core Settings

Variable	Default	Description
`SUPACRAWL_HEADLESS`	`true`	Set `false` to see browser
`SUPACRAWL_TIMEOUT`	`30000`	Page load timeout (ms)
`SUPACRAWL_ENGINE`	`playwright`	Browser engine: `playwright`, `patchright`, `camoufox`
`SUPACRAWL_PROXY`	-	Proxy URL (http/socks5)

LLM Features

Required for llm-extract, agent, and --summarize:

Variable	Description
`SUPACRAWL_LLM_PROVIDER`	`ollama`, `openai`, or `anthropic`
`SUPACRAWL_LLM_MODEL`	Model name (e.g., `qwen3:8b`)
`OPENAI_API_KEY`	For OpenAI provider
`ANTHROPIC_API_KEY`	For Anthropic provider
`OLLAMA_HOST`	Ollama URL (default: `localhost:11434`)

Search

Supacrawl supports multiple search providers with automatic fallback. If the primary provider hits a rate limit or quota, the next provider in the chain is tried automatically.

Variable	Default	Description
`BRAVE_API_KEY`	-	Brave Search API key (recommended). Free tier: ~1,000 searches/month. Get one at brave.com/search/api
`TAVILY_API_KEY`	-	Tavily API key. Supports web and news search
`SERPER_API_KEY`	-	Serper.dev API key. Google Search results
`SERPAPI_API_KEY`	-	SerpAPI API key. Google Search results
`EXA_API_KEY`	-	Exa.ai API key. Neural search for web and news
`SUPACRAWL_SEARCH_PROVIDERS`	`brave`	Comma-separated provider chain with fallback order (e.g., `brave,tavily,serper`)
`SUPACRAWL_SEARCH_RATE_LIMIT`	-	Override default rate limit (requests/second). Provider defaults: Brave 1/s, DuckDuckGo 0.5/s

Providers are tried in order. Set API keys for each provider you want to use; providers without keys are skipped. If no keys are configured, DuckDuckGo is used as a last-resort fallback.

Note: DuckDuckGo is a deprecated fallback. It has no official API and actively blocks automated access with CAPTCHA challenges. Configure at least one API-keyed provider for reliable search.

Caching

Supacrawl caches scraped content locally for faster repeated requests. Enable with --max-age:

# Cache for 1 hour
supacrawl scrape https://example.com --max-age 3600

Variable	Default	Description
`SUPACRAWL_CACHE_DIR`	`~/.supacrawl/cache`	Cache directory

Cache Management:

supacrawl cache stats   # View cache size and entry count
supacrawl cache prune   # Remove expired entries
supacrawl cache clear   # Clear all cache (with confirmation)

Cache Behaviour:

No automatic eviction; run cache prune periodically to clean expired entries
No size limits; cache grows unbounded, use cache clear if disk space is a concern
Files stored as <hash>.json where hash is SHA256 of normalised URL

Optional Extras

pip install supacrawl[stealth]    # Patchright for anti-bot evasion (Tier 2)
pip install supacrawl[camoufox]   # Camoufox for Akamai/Cloudflare bypass (Tier 3)
pip install supacrawl[captcha]    # 2Captcha for CAPTCHA solving
pip install supacrawl[pdf-ocr]    # OCR support for scanned PDFs

Select the browser engine with --engine (playwright, patchright, camoufox) or set SUPACRAWL_ENGINE as a default. Use --stealth for Tier 2, --engine camoufox for Tier 3, and --solve-captcha for CAPTCHA-protected sites. CAPTCHA solving requires CAPTCHA_API_KEY environment variable.

Copy .env.example to .env to configure.

System-Managed Playwright Browsers

Distributions like NixOS and Guix provide pre-built Playwright browser binaries. To use them, pin the Python package to match your system's browser version and set PLAYWRIGHT_BROWSERS_PATH:

pip install 'supacrawl' 'playwright==1.52.0'  # match your distro's version
export PLAYWRIGHT_BROWSERS_PATH=/nix/store/...-playwright-driver-browsers

Skip playwright install; your system already provides the binaries.

Development

# From source
conda env create -f environment.yaml && conda activate supacrawl
pip install -e .[dev]
playwright install chromium

# Quality checks
ruff check src/ && mypy src/
pytest -q -m "not e2e"

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Where Supacrawl Fits                                │
├─────────────────┬─────────────────┬─────────────────┬───────────────────────┤
│   Collection    │   Processing    │    Storage      │        Query          │
├─────────────────┼─────────────────┼─────────────────┼───────────────────────┤
│                 │                 │                 │                       │
│   supacrawl ────┼──► ragify ──────┼──► Qdrant ──────┼──► Claude Code        │
│                 │    LangChain    │    Chroma       │    Custom Agents      │
│   • scrape      │    LlamaIndex   │    Pinecone     │    RAG Apps           │
│   • crawl       │                 │    Weaviate     │                       │
│   • search      │   • chunk       │                 │                       │
│   • extract     │   • embed       │   • store       │   • retrieve          │
│                 │                 │   • index       │   • generate          │
│                 │                 │                 │                       │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────┘

Supacrawl does one thing well: get clean markdown from the web.

Comparison

	Supacrawl	crawl4ai	Firecrawl (self-hosted)	Firecrawl (cloud)
Infrastructure	`pip install`	`pip install`	Docker + PostgreSQL + Redis	Hosted API
MCP Server	Built-in (`[mcp]` extra)	Not included	Not included	Yes
Web Search	Built-in (6 providers with fallback)	Not included	Via SearXNG	Yes
LLM Providers	Ollama, OpenAI, Anthropic	Any via LiteLLM	OpenAI (Ollama experimental)	OpenAI
Intelligent Crawling	Yes (agent command)	Yes (adaptive crawling)	No	Yes (/agent)
Stealth/Anti-bot	Yes (3-tier: Patchright + Camoufox)	Yes (undetected browser)	No (Fire-engine is cloud-only)	Yes (Fire-engine)
PDF Parsing	Yes (text + OCR)	No	No	No
CAPTCHA Solving	Yes (2Captcha)	Optional (CapSolver)	No	No
Caching	Local files	Built-in	PostgreSQL	Managed
Licence	MIT	Apache-2.0	AGPL-3.0	AGPL-3.0
Cost	Free	Free	Free	Pay-per-use

Supacrawl is minimal and focused. crawl4ai is a feature-rich framework with adaptive crawling and chunking. Firecrawl is an API server for applications needing a scraping backend.

Licence

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
src/supacrawl		src/supacrawl
tests		tests
.env.example		.env.example
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supacrawl

Why Supacrawl?

Quick Start

MCP Server

Install

Add to your MCP client

Available Tools

Environment Variables

Troubleshooting

Optional Extras

REST API

Endpoints

Commands

Output

Configuration

Core Settings

LLM Features

Search

Caching

Optional Extras

System-Managed Playwright Browsers

Development

Architecture

Comparison

Licence

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Supacrawl

Why Supacrawl?

Quick Start

MCP Server

Install

Add to your MCP client

Available Tools

Environment Variables

Troubleshooting

Optional Extras

REST API

Endpoints

Commands

Output

Configuration

Core Settings

LLM Features

Search

Caching

Optional Extras

System-Managed Playwright Browsers

Development

Architecture

Comparison

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages