Supacrawl serves a REST API via supacrawl serve. The API is compatible with the Firecrawl v2 protocol, so tools that speak Firecrawl (n8n, LangChain, LlamaIndex) work as drop-in backends. Supacrawl-native endpoints extend the surface with capabilities the v2 protocol does not cover.
Install Supacrawl with the API extra and start the server:
pip install supacrawl[api]
supacrawl serveThe server binds to 0.0.0.0:8308 by default. Scrape a page:
curl -s http://localhost:8308/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}' | python -m json.toolsupacrawl serve [--host 0.0.0.0] [--port 8308] [--reload]--hostsets the bind address (default0.0.0.0).--portsets the bind port (default8308).--reloadenables auto-reload for development.
Authentication uses a Bearer token configured via the SUPACRAWL_API_KEY environment variable.
- If
SUPACRAWL_API_KEYis set, every request (except/supacrawl/health) must include a matchingAuthorizationheader. - If
SUPACRAWL_API_KEYis not set, authentication is disabled and all requests pass through.
export SUPACRAWL_API_KEY=YOUR_KEY
curl http://localhost:8308/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_KEY' \
-d '{"url": "https://example.com"}'Scrape a single URL synchronously. Returns scraped content in the requested formats.
Request body:
{
"url": "https://example.com",
"formats": ["markdown", "html"],
"onlyMainContent": true,
"waitFor": 0,
"timeout": 30000,
"includeTags": [".article-body"],
"excludeTags": [".sidebar"],
"mobile": false,
"actions": [],
"location": {"country": "AU", "languages": ["en-AU"]},
"headers": {"Cookie": "session=abc"},
"maxAge": 60000,
"proxy": "basic",
"storeInCache": true
}| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | null |
Output formats. Supacrawl extras: "images", "branding", "pdf" |
onlyMainContent |
boolean | true |
Extract main content area only |
waitFor |
integer | 0 |
Additional wait time in milliseconds after page load |
timeout |
integer | 30000 |
Page load timeout in milliseconds |
includeTags |
string[] | null |
CSS selectors for elements to include |
excludeTags |
string[] | null |
CSS selectors for elements to exclude |
mobile |
boolean | null |
Emulate a mobile viewport |
actions |
array | null |
Page actions (click, scroll, wait) |
location |
object | null |
Locale settings: { "country": "AU", "languages": ["en-AU"] } |
headers |
object | null |
Custom HTTP request headers |
maxAge |
integer | null |
Cache freshness in milliseconds (divided by 1000 for service layer) |
proxy |
string/boolean | null |
"basic", "enhanced", "auto" map to true; a URL string is passed through |
storeInCache |
boolean | null |
false bypasses cache entirely (sets maxAge to 0) |
Fields not listed here (e.g. skipTlsVerification, blockAds, removeBase64Images) are accepted silently and ignored.
Response body:
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"html": "<h1>Example Domain</h1>...",
"rawHtml": "<!doctype html>...",
"links": ["https://www.iana.org/domains/example"],
"screenshot": null,
"metadata": {
"title": "Example Domain",
"description": null,
"sourceURL": "https://example.com",
"url": "https://example.com",
"statusCode": 200,
"language": "en"
},
"actions": null,
"branding": null,
"changeTracking": null
}
}Example:
curl -s http://localhost:8308/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": true
}'Start an asynchronous crawl job. Returns a job ID immediately; use GET /crawl/{id} to poll for results.
Request body:
{
"url": "https://docs.example.com",
"limit": 100,
"maxDiscoveryDepth": 3,
"includePaths": ["/docs/.*"],
"excludePaths": ["/changelog/.*"],
"allowExternalLinks": false,
"allowSubdomains": false,
"maxConcurrency": 10,
"ignoreQueryParameters": false,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL for the crawl |
limit |
integer | 10000 |
Maximum pages to crawl |
maxDiscoveryDepth |
integer | 3 |
Maximum crawl depth |
includePaths |
string[] | null |
Regex patterns on URL path to include |
excludePaths |
string[] | null |
Regex patterns on URL path to exclude |
allowExternalLinks |
boolean | false |
Follow links to external domains |
allowSubdomains |
boolean | false |
Follow links to subdomains |
maxConcurrency |
integer | 10 |
Maximum concurrent requests |
ignoreQueryParameters |
boolean | false |
Deduplicate URLs differing only by query params |
scrapeOptions |
object | null |
Nested scrape options (same fields as POST /scrape minus url) |
Response body:
{
"success": true,
"id": "550e8400-e29b-41d4-a716-446655440000"
}Example:
curl -s http://localhost:8308/crawl \
-H 'Content-Type: application/json' \
-d '{
"url": "https://docs.example.com",
"limit": 50,
"maxDiscoveryDepth": 2
}'Poll the status of a crawl job. Results are paginated; follow the next URL for additional pages.
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
offset |
integer | 0 |
Pagination offset into the results list |
Response body:
{
"status": "scraping",
"total": 50,
"completed": 12,
"data": [
{
"markdown": "# Page Title\n...",
"metadata": {
"title": "Page Title",
"sourceURL": "https://docs.example.com/page",
"statusCode": 200
}
}
],
"next": "http://localhost:8308/crawl/550e8400-e29b-41d4-a716-446655440000?offset=10"
}Example:
curl -s http://localhost:8308/crawl/550e8400-e29b-41d4-a716-446655440000Cancel a running crawl job.
Response body:
{
"status": "cancelled",
"total": 50,
"completed": 12,
"data": []
}Example:
curl -s -X DELETE http://localhost:8308/crawl/550e8400-e29b-41d4-a716-446655440000Discover URLs on a website synchronously. Returns a list of discovered links with metadata.
Request body:
{
"url": "https://example.com",
"limit": 100,
"search": "pricing",
"sitemap": "include",
"includeSubdomains": false,
"ignoreQueryParameters": false,
"ignoreCache": false,
"timeout": 30000
}| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL to map |
limit |
integer | 5000 |
Maximum URLs to discover |
search |
string | null |
Filter URLs by relevance to this term |
sitemap |
string | "include" |
Sitemap handling: "skip", "include", or "only" |
includeSubdomains |
boolean | false |
Include subdomain URLs |
ignoreQueryParameters |
boolean | false |
Remove query parameters from URLs |
ignoreCache |
boolean | false |
Bypass cached map results |
timeout |
integer | 30000 |
Timeout in milliseconds |
Response body:
{
"success": true,
"links": [
{"url": "https://example.com/about", "title": "About Us", "description": "Learn more..."},
{"url": "https://example.com/pricing", "title": "Pricing", "description": null}
]
}Example:
curl -s http://localhost:8308/map \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "limit": 100}'Search the web synchronously. Results are bucketed by source type (web, images, news).
Request body:
{
"query": "python web scraping tutorial",
"limit": 5,
"sources": [{"type": "web"}],
"timeout": 30000,
"scrapeOptions": null
}| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
limit |
integer | 5 |
Maximum results per source type |
sources |
array | [{"type": "web"}] |
Source types. Accepts v2 objects [{"type": "web"}] or plain strings ["web"]. Valid types: "web", "images", "news" |
timeout |
integer | 30000 |
Timeout in milliseconds |
scrapeOptions |
object | null |
Nested scrape options for fetching result page content |
Response body:
{
"success": true,
"data": {
"web": [
{"title": "Web Scraping with Python", "url": "https://...", "description": "A guide...", "markdown": null}
],
"images": [
{"title": "Scraping diagram", "imageUrl": "https://...", "url": "https://..."}
],
"news": [
{"title": "Python 3.13 Released", "url": "https://...", "snippet": "The latest..."}
]
}
}Example:
curl -s http://localhost:8308/search \
-H 'Content-Type: application/json' \
-d '{"query": "python web scraping", "limit": 3}'Start an asynchronous LLM extraction job. Returns a job ID immediately; use GET /extract/{id} to poll for results.
Request body:
{
"urls": ["https://example.com/products"],
"prompt": "Extract product names and prices",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
}
},
"scrapeOptions": null
}| Field | Type | Default | Description |
|---|---|---|---|
urls |
string[] | required | URLs to extract data from |
prompt |
string | null |
Extraction prompt describing what to extract |
schema |
object | null |
JSON Schema for structured output |
scrapeOptions |
object | null |
Nested scrape options |
Response body:
{
"success": true,
"id": "660e8400-e29b-41d4-a716-446655440001"
}Example:
curl -s http://localhost:8308/extract \
-H 'Content-Type: application/json' \
-d '{
"urls": ["https://example.com/about"],
"prompt": "Extract the company name and founding year"
}'Poll the status of an extract job.
Response body:
{
"success": true,
"status": "completed",
"data": [
{"name": "Example Corp", "founding_year": 2010}
],
"error": null
}Example:
curl -s http://localhost:8308/extract/660e8400-e29b-41d4-a716-446655440001Start an asynchronous batch scrape job. Scrapes multiple URLs with the same options. Returns a job ID; use GET /batch/scrape/{id} to poll.
Request body:
{
"urls": ["https://example.com/page1", "https://example.com/page2"],
"formats": ["markdown"],
"onlyMainContent": true,
"waitFor": 0,
"timeout": 30000
}| Field | Type | Default | Description |
|---|---|---|---|
urls |
string[] | required | URLs to scrape |
All other fields are the same as POST /scrape (formats, onlyMainContent, waitFor, timeout, includeTags, excludeTags, mobile, actions, location, headers, maxAge, proxy, storeInCache), applied to every URL in the batch.
Response body:
{
"success": true,
"id": "770e8400-e29b-41d4-a716-446655440002"
}Example:
curl -s http://localhost:8308/batch/scrape \
-H 'Content-Type: application/json' \
-d '{
"urls": ["https://example.com", "https://example.org"],
"formats": ["markdown"]
}'Poll the status of a batch scrape job. Results are paginated; follow the next URL for additional pages.
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
offset |
integer | 0 |
Pagination offset into the results list |
Response body:
{
"status": "completed",
"total": 2,
"completed": 2,
"data": [
{
"markdown": "# Example Domain\n...",
"metadata": {"title": "Example Domain", "sourceURL": "https://example.com", "statusCode": 200}
},
{
"markdown": "# Example Org\n...",
"metadata": {"title": "Example Org", "sourceURL": "https://example.org", "statusCode": 200}
}
],
"next": null
}Example:
curl -s http://localhost:8308/batch/scrape/770e8400-e29b-41d4-a716-446655440002Credential verification stub. n8n's Firecrawl node tests credentials by hitting this endpoint. Supacrawl is self-hosted, so credits are always zero.
Response body:
{
"success": true,
"data": {"credits": 0}
}Example:
curl -s http://localhost:8308/team/credit-usage \
-H 'Authorization: Bearer YOUR_KEY'Health check returning version, uptime, and status. This endpoint never requires authentication.
Response body:
{
"success": true,
"version": "2026.3.1",
"status": "healthy",
"uptime_seconds": 3742
}Example:
curl -s http://localhost:8308/supacrawl/healthRun pre-scrape diagnostics on a URL. Reports CDN, bot protection, JavaScript requirements, and other characteristics useful for choosing scrape settings.
Request body:
{
"url": "https://example.com"
}Example:
curl -s http://localhost:8308/supacrawl/diagnose \
-H 'Content-Type: application/json' \
-d '{"url": "https://protected-site.com"}'Scrape a URL and return a summary of its content.
Request body:
{
"url": "https://example.com/article",
"maxLength": 500,
"focus": "key findings"
}| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to summarise |
maxLength |
integer | null |
Maximum summary length |
focus |
string | null |
Focus area for the summary |
Example:
curl -s http://localhost:8308/supacrawl/summary \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com/article", "focus": "conclusions"}'The crawl, extract, and batch scrape endpoints are asynchronous. They follow a consistent lifecycle:
- Submit a POST request. The server returns
{"success": true, "id": "<job-id>"}immediately. - Poll with GET using the job ID. The response includes a
statusfield. - Status transitions:
scraping(in progress),completed(finished successfully),failed(error occurred),cancelled(cancelled via DELETE).
# 1. Start a crawl
JOB_ID=$(curl -s http://localhost:8308/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "limit": 10}' | python -c "import sys,json; print(json.load(sys.stdin)['id'])")
# 2. Poll until complete
curl -s "http://localhost:8308/crawl/$JOB_ID"
# 3. Cancel if needed
curl -s -X DELETE "http://localhost:8308/crawl/$JOB_ID"A maximum of 3 concurrent async jobs (crawl, extract, and batch scrape combined) are allowed. Additional submissions return HTTP 429. This is configurable via SUPACRAWL_API_MAX_JOBS.
GET endpoints for crawl and batch scrape paginate results (up to 10 MB per response). When more results are available, the response includes a next URL to fetch the next page.
Jobs are stored in memory and expire after a configurable TTL (default 24 hours). Jobs are lost on server restart.
| Variable | Default | Description |
|---|---|---|
SUPACRAWL_API_KEY |
unset | Bearer token for authentication. When unset, auth is disabled |
SUPACRAWL_API_HOST |
0.0.0.0 |
Server bind address |
SUPACRAWL_API_PORT |
8308 |
Server bind port |
SUPACRAWL_API_JOB_TTL |
86400 |
Async job expiry in seconds (default 24 hours) |
SUPACRAWL_API_MAX_JOBS |
3 |
Maximum concurrent async jobs |
n8n has a built-in Firecrawl node. Point it at your local Supacrawl server:
-
Install and start Supacrawl:
pip install supacrawl[api] export SUPACRAWL_API_KEY=YOUR_KEY supacrawl serve -
In n8n, add a Firecrawl credential:
- API Key:
YOUR_KEY - API URL:
http://localhost:8308
- API Key:
-
Add a Firecrawl node to your workflow. It will use Supacrawl as its backend.
n8n verifies credentials by calling GET /team/credit-usage. Supacrawl returns a valid response, so the credential test passes.
All errors use a consistent envelope:
{
"success": false,
"error": "Human-readable error message"
}| Code | Meaning |
|---|---|
| 200 | Success |
| 400 | Bad request (invalid input). FastAPI's 422 validation errors are remapped to 400 with the standard error envelope |
| 401 | Missing or invalid API key |
| 404 | Job not found |
| 429 | Too many concurrent async jobs |
| 500 | Internal server error |
- The API accepts camelCase field names (matching the Firecrawl v2 protocol) and translates to snake_case internally.
- Unsupported v2 fields are accepted silently and ignored. Clients do not need modification.
- CORS is permissive (all origins) by default, suitable for local and homelab deployments.