A modular, YAML-driven web scraper framework. Describe any scraping target in a config file — Scrapit handles fetching, parsing, transforming, validating, and storing the data.
No code required for new targets. Just write a YAML.
| Feature | Description |
|---|---|
| YAML directives | Declarative scrape configs — selectors, transforms, validation, cache |
| Five backends | BeautifulSoup, Playwright (JS), httpx (async), GraphQL, Bright Data |
| Fallback selectors | Per-field list of CSS selectors tried in order |
| XPath support | Use xpath: prefix in any selector (requires lxml) |
all: true |
Extract all matches for a selector, not just the first |
| Pagination | Follow "next page" links automatically |
| Spider mode | Discover and scrape all linked pages from an index |
| Parallel spider | Set follow.parallel: 10 for concurrent async fetching with httpx |
| Incremental spider | follow.incremental: true — skip previously visited URLs across runs |
| Multi-site | Scrape multiple URLs with the same spec in one directive |
| Transform pipeline | 28+ declarative field transforms: strip, regex, date, hash, boolean… |
| Validation | Per-field rules: required, type, min/max, pattern, enum |
| Eight output backends | JSON, CSV, SQLite, MongoDB, PostgreSQL, Excel, Google Sheets, Parquet |
| HTTP cache | File-based or Redis-backed cache with TTL |
| Proxy rotation | Round-robin/random pool with per-proxy failure tracking |
| Stealth mode | Playwright fingerprint randomisation — UA, viewport, locale, timezone |
| Change detection | Diff result against previous run, fire webhook on change |
| Webhook notifications | POST JSON payload to Slack/Discord when changes detected |
| Built-in scheduler | schedule: "*/30 * * * *" + scrapit run daemon |
| Streaming output | --stream emits NDJSON lines as each spider page completes |
| Backend export | scrapit export --from sqlite --to csv — migrate between backends |
| Web dashboard | scrapit serve — browse results, run directives, download output |
| Stats reporter | Field coverage %, timing, error count per run |
| Hook system | Register callbacks for scrape lifecycle events |
| Plugin system | Publish custom transforms/backends as pip packages via entry_points |
| Async queue | RabbitMQ producer/consumer for background processing |
| Structured logging | Console + output/scraper.log |
git clone <repo-url>
cd scrapit
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# if you use the playwright backend:
playwright install chromiumCopy and fill in your credentials:
cp scraper/.env.example .env# MongoDB (optional)
MONGO_URI=mongodb+srv://user:pass@cluster/
MONGO_DATABASE=mydb
MONGO_COLLECTION=scraped
# RabbitMQ (optional)
RABBITMQ_HOST=localhost
RABBITMQ_PORT=5672
RABBITMQ_USER=guest
RABBITMQ_PASS=guest
# Webhook notifications (optional)
SCRAPIT_WEBHOOK_URL=https://hooks.example.com/...# create a new directive interactively
scrapit init
# scrape Wikipedia, save to JSON
scrapit scrape wikipedia --json
# scrape Hacker News (paginated), save to SQLite
scrapit scrape hn --sqlite
# spider Books to Scrape, preview only
scrapit scrape books --preview
# stream results as they arrive (spider mode)
scrapit scrape blog --json --stream
# scrape all directives in the default folder
scrapit batch --json
# list available directives
scrapit list
# open the web dashboard
scrapit serve
# run a scheduled directive as a daemon
scrapit run hn --json
# export SQLite → CSV
scrapit export --from sqlite --to csv --directive hn
# check environment
scrapit doctorpython -m scraper.main initGuides you through a series of prompts and generates a ready-to-edit YAML in scraper/directives/:
? Site URL: https://news.ycombinator.com
? Scraping backend (beautifulsoup/playwright) [beautifulsoup]:
? Output file name (without .yaml): hackernews
? Fields to scrape (comma-separated, e.g. titles, links, scores): titles, links, scores
→ Created scraper/directives/hackernews.yaml
Next steps:
1. Open scraper/directives/hackernews.yaml and replace each 'FIXME' with a real CSS selector.
2. Run: python -m scraper.main scrape hackernews --preview
3. Save results: python -m scraper.main scrape hackernews --json
Each field is stubbed with a FIXME placeholder — open the file, fill in your CSS selectors, and you're ready to scrape.
scrapit scrape <directive> [--json|--csv|--sqlite|--mongo|--postgres|--excel|--sheets|--parquet] [--preview] [--diff] [--stream]<directive> can be a name (wikipedia), filename (wikipedia.yaml), or path.
| Flag | Description |
|---|---|
--json |
Save to output/<name>.json (default) |
--csv |
Append to output/<name>.csv |
--sqlite |
Save to output/scrapit.db |
--mongo |
Save to MongoDB |
--postgres |
Save to PostgreSQL |
--excel |
Append to output/<name>.xlsx |
--sheets |
Append to Google Sheets (requires --sheets-id) |
--parquet |
Save to output/<name>.parquet |
--format |
JSON format: pretty (indented, default) or compact (minified) |
--preview |
Print result, do not save |
--diff |
Compare with previous JSON output and show changes |
--stream |
Emit NDJSON lines to stdout as each spider page completes |
--resume |
Resume interrupted spider/paginated scrape from checkpoint |
--reset-state |
Clear incremental spider state for this directive |
--timeout N |
Per-request timeout in seconds (overrides directive setting) |
scrapit batch [folder] [--json|--csv|--sqlite|--mongo|--excel] [--preview] [--diff]Default folder: scraper/directives/
scrapit list [--dir path/to/folder]Shows site, backend, fields, transforms, validation rules, cache, and schedule config.
scrapit run <directive> [--json|--sqlite|...]Reads the schedule: key from the directive YAML and runs it repeatedly on that schedule. Supports cron expressions (requires croniter) or simple intervals like 5m, 1h.
site: https://news.ycombinator.com
use: beautifulsoup
schedule: "*/30 * * * *" # every 30 minutes
scrape:
titles: ['.titleline > a', {attr: text, all: true}]scrapit export --from sqlite --to csv --directive hn
scrapit export --from sqlite --to mongo --directive product --since 2026-01-01
scrapit export --from json --to parquet --directive wikipedia
scrapit export --from sqlite --to csv --all # all directivesscrapit suggest-selectors https://books.toscrape.com --fields "title,price,rating"Fetches the page and asks Claude to suggest the best CSS selectors for each field. Outputs a ready-to-paste scrape: block. Requires pip install anthropic.
scrapit share wikipediaCreates a GitHub issue in the Scrapit repo with your directive YAML, making it available to everyone. Requires the gh CLI authenticated.
scrapit ai-init https://news.ycombinator.com --name hackernews
scrapit ai-init https://books.toscrape.com --fields "title,price,rating"Fetches the page, sends the content to Claude, and generates a ready-to-use YAML directive. Requires pip install anthropic and ANTHROPIC_API_KEY in your environment.
scrapit query --backend sqlite --limit 10
scrapit query --directive wikipedia
scrapit query --url wikipedia.orgscrapit cache stats # show cache size and entry count
scrapit cache clear # delete all cached responses
scrapit cache invalidate --url https://example.comscrapit diff old.json new.json
scrapit diff old.json new.json --key url # use URL as record key
scrapit diff old.json new.json --summary # counts only, no detailscrapit validate wikipedia # check required keys, transforms, selectorsscrapit serve # opens http://127.0.0.1:7331
scrapit serve --host 0.0.0.0 --port 8080 --no-browserscrapit doctor # checks all optional/required dependenciesVS Code autocomplete: add this line to the top of any directive YAML for inline docs and validation:
# yaml-language-server: $schema=https://raw.githubusercontent.com/joaobenedetmachado/scrapit/main/scrapit.schema.json
site: https://example.com
use: beautifulsoup # or playwright
scrape:
field_name:
- 'css-selector'
- attr: text # 'text' = inner text, or any HTML attribute (href, src, …)site: https://example.com
use: beautifulsoup # beautifulsoup | playwright | httpx | graphql | brightdata
# ── Mode ─────────────────────────────────────────────────────────────────────
mode: single # single (default) | spider
# ── Multiple sites (same scrape spec applied to each) ────────────────────────
sites:
- https://example.com/page-1
- https://example.com/page-2
# ── Request options ───────────────────────────────────────────────────────────
retries: 3 # HTTP retries with exponential backoff (bs4)
timeout: 15 # seconds (bs4) or milliseconds (playwright)
delay: 1.0 # seconds between requests (rate limiting)
headers: # extra HTTP headers
Authorization: Bearer ${TOKEN} # ${VAR} is interpolated from environment
cookies: # bs4: dict | playwright: list of {name,value,domain}
session_id: abc123
proxy: http://proxy:8080 # or: brightdata (uses BRIGHTDATA_* env vars)
respect_robots: true # check robots.txt before fetching (bs4 only)
# ── Proxy pool (rotation) ─────────────────────────────────────────────────────
proxies:
- http://proxy1:8080
- http://proxy2:8080
proxy_strategy: round_robin # round_robin (default) | random
# ── Throttle ─────────────────────────────────────────────────────────────────
throttle:
requests_per_second: 2
jitter: 0.5 # adds random 0–0.5s extra delay
# ── Cache ─────────────────────────────────────────────────────────────────────
cache:
ttl: 3600 # seconds (0 = disabled)
backend: file # file (default) | redis
key_prefix: scrapit: # Redis only
# ── Schedule (used by `scrapit run` daemon) ───────────────────────────────────
schedule: "*/30 * * * *" # cron expression, or simple: 5m, 1h
# ── Playwright-only ───────────────────────────────────────────────────────────
wait_for: '#content' # wait for selector before parsing
screenshot: true # save full-page screenshot to output/
stealth: true # randomise UA, viewport, locale, navigator fingerprint
# ── Scrape spec ───────────────────────────────────────────────────────────────
scrape:
title:
- 'h1' # single selector
- attr: text
image:
- ['img.hero', 'img.main', 'img'] # fallback selectors
- attr: src
all_links:
- 'a.result'
- attr: href
all: true # return list of all matches
# ── Pagination (bs4 only) ─────────────────────────────────────────────────────
paginate:
selector: 'a.next-page'
attr: href
max_pages: 5
# ── Spider mode ───────────────────────────────────────────────────────────────
follow:
selector: 'a.article-link'
attr: href
max: 50 # max pages to scrape
same_domain: true # stay on same domain
depth: 1 # link-following depth
incremental: true # skip URLs visited in previous runs (persistent state)
parallel: 5 # async concurrent fetching (requires httpx)
# ── Transform pipeline ────────────────────────────────────────────────────────
transform:
price:
- strip
- {replace: {"€": "", ",": "."}}
- float
title:
- strip
- upper
description:
- strip
- {slice: {end: 200}}
tags:
- {split: ","}
- first
# ── Validation ────────────────────────────────────────────────────────────────
validate:
title:
required: true
min_length: 2
max_length: 500
price:
type: float
min: 0
status:
in: [active, inactive, pending]
# ── Notifications ─────────────────────────────────────────────────────────────
notify:
webhook: https://hooks.slack.com/... # called when --diff detects changes| Transform | Argument | Description |
|---|---|---|
strip |
— | Strip leading/trailing whitespace |
lower / upper / title |
— | Change case |
capitalize |
— | First character upper, rest unchanged |
sentence_case |
— | First character upper, rest lower |
int / float |
— | Parse number (removes non-numeric chars, handles European notation) |
boolean |
— | "true"/"yes"/"1" → True, "false"/"no" → False |
count |
— | Length of a string or list |
regex |
pattern |
Extract first regex match |
regex_group |
{pattern, group} |
Extract specific capture group |
replace |
{old: new} |
String substitution (multiple pairs) |
split |
"," |
Split string into list |
join |
", " |
Join list into string |
first / last |
— | Pick first/last item from list |
default |
value |
Fallback if value is None |
slice |
{start, end} or N |
Substring / sublist |
prepend / append |
"str" |
Add text before/after |
remove_tags |
— | Strip HTML tags |
template |
"prefix {value}" |
String template with {value} or {other_field} |
slugify |
— | Convert text to a URL-friendly slug (Hello World → hello-world) |
truncate |
N |
Truncate to N characters without breaking words, appends ... |
normalize_whitespace |
— | Collapse multiple spaces/tabs into a single space and strip |
date |
— | Parse date string to ISO YYYY-MM-DD (auto-detects common formats) |
parse_date |
{input_format, output_format} |
Parse date with custom strptime format |
pad |
{width, char, side} |
Pad string to fixed width (pad: {width: 5, char: "0", side: left}) |
hash |
algorithm |
Hash value: md5, sha1, sha256, sha512 |
| Rule | Example | Description |
|---|---|---|
required |
true |
Must not be None |
type |
float |
Type check: str, int, float, list, bool |
not_empty |
true |
Must not be empty string/list |
min / max |
0 / 1000 |
Numeric range |
min_length / max_length |
2 / 500 |
String/list length |
pattern |
^\d{4}$ |
Regex must match |
in |
[a, b, c] |
Value must be in enum |
not_in |
[a, b, c] |
Value must NOT be in enum |
All outputs go to output/ at the project root.
| File | Description |
|---|---|
output/<name>.json |
Last scrape as JSON |
output/<name>.csv |
All scrapes in append-mode CSV |
output/scrapit.db |
SQLite database with all scrapes |
output/scraper.log |
Full log (also printed to console) |
output/<name>_<ts>.png |
Screenshot (Playwright + screenshot: true) |
PostgreSQL scrapes table |
All scrapes saved to PostgreSQL database |
scrapit/
scraper/
main.py CLI (scrape/batch/list/query/cache/export/run/serve/diff/validate/doctor…)
config.py Environment variables and paths
logger.py Logging → console + output/scraper.log
hooks.py Lifecycle hook registry
reporter.py Timing and field coverage stats
plugins.py Plugin loader — discovers transforms/backends via entry_points
proxy.py ProxyPool — round-robin / random rotation
colors.py ANSI color helpers for CLI output
dashboard.py FastAPI web dashboard (scrapit serve)
directives/ Built-in example directives
wikipedia.yaml
hn.yaml Hacker News (paginated)
books.yaml Books to Scrape (spider mode)
github_trending.yaml GitHub trending (all: true)
scrapers/
__init__.py Pipeline dispatcher
bs4_scraper.py BeautifulSoup + retry + proxy + cache
playwright_scraper.py Playwright + stealth mode
httpx_scraper.py httpx async backend
graphql_scraper.py GraphQL API backend
paginator.py Pagination support
spider.py Spider (incremental, parallel asyncio)
transforms/
__init__.py 28+ transform functions + plugin registry
validators/
__init__.py Validation engine
storage/
mongo.py MongoDB (lazy connect)
json_file.py JSON output
csv_file.py CSV output (append)
sqlite.py SQLite (zero-config, with read() for export)
excel.py Excel XLSX (append mode)
google_sheets.py Google Sheets live sync
postgres.py PostgreSQL
parquet_file.py Apache Parquet (pyarrow)
diff.py Change detection
cache/
__init__.py HTTP cache with TTL (file or Redis)
redis_cache.py Redis cache backend
integrations/
anthropic.py Anthropic SDK tools + agentic loop
openai.py OpenAI function calling + agent
langchain.py LangChain / CrewAI / LangGraph toolkit
llamaindex.py LlamaIndex reader
mcp.py MCP server (Claude Desktop / Cursor / Claude Code)
brightdata.py Bright Data Scraping Browser integration
notifications/
__init__.py Webhook notifications (Slack, Discord, custom)
queue/
producer.py RabbitMQ producer
consumer.py RabbitMQ consumer
output/ Generated data (gitignored)
.cache/ HTTP cache (gitignored)
pyproject.toml Extras: ui, anthropic, mcp, httpx, parquet, redis…
requirements.txt
.env
Register Python callbacks for scrape lifecycle events:
from scraper import hooks
@hooks.on("after_scrape")
def log_result(result, dados):
print(f"scraped {result['url']} — {len(result)} fields")
@hooks.on("on_change")
def alert(changes, result):
print(f"change in {result['url']}: {list(changes.keys())}")
@hooks.on("on_error")
def handle_error(exc, dados):
print(f"failed on {dados['site']}: {exc}")Available events: before_scrape, after_scrape, on_error, on_save, on_change
Scrapit integrates natively with every major AI agent framework. Give any agent the ability to scrape the web on demand — no boilerplate required.
The fastest way to add Scrapit to Claude:
# Claude Code
claude mcp add scrapit -- python -m scraper.integrations.mcpFor Claude Desktop, add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"scrapit": {
"command": "python",
"args": ["-m", "scraper.integrations.mcp"],
"cwd": "/path/to/scrapit"
}
}
}After adding, Claude will have 4 web scraping tools available automatically.
import anthropic
from scraper.integrations.anthropic import as_anthropic_tools, handle_tool_call
client = anthropic.Anthropic()
tools = as_anthropic_tools()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What are the top posts on Hacker News?"}],
)
for block in response.content:
if block.type == "tool_use":
result = handle_tool_call(block.name, block.input)
# Or use the built-in agent loop:
from scraper.integrations.anthropic import ScrapitAnthropicAgent
agent = ScrapitAnthropicAgent(model="claude-opus-4-6")
answer = agent.run("Summarize the top 3 Hacker News posts today.")from scraper.integrations.langchain import ScrapitToolkit
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
tools = ScrapitToolkit().get_tools()
# → [ScrapitTool, ScrapitPageTool, ScrapitSelectorTool]
agent = initialize_agent(
tools=tools,
llm=ChatOpenAI(model="gpt-4o"),
agent=AgentType.OPENAI_FUNCTIONS,
)
agent.run("What does the Wikipedia article on Python say?")Works with CrewAI — pass ScrapitToolkit().get_tools() to any Agent(tools=[...]).
from openai import OpenAI
from scraper.integrations.openai import as_openai_functions, handle_function_call
client = OpenAI()
tools = as_openai_functions()
response = client.chat.completions.create(
model="gpt-4o", tools=tools,
messages=[{"role": "user", "content": "Scrape the top GitHub trending repos."}],
)
# Or use the built-in agent loop:
from scraper.integrations.openai import ScrapitOpenAIAgent
agent = ScrapitOpenAIAgent(model="gpt-4o")
answer = agent.run("What are the trending Python repos on GitHub today?")from scraper.integrations.llamaindex import ScrapitReader
from llama_index.core import VectorStoreIndex
reader = ScrapitReader()
docs = reader.load_data(urls=["https://site1.com", "https://site2.com"]) # parallel
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine()
response = engine.query("Summarize the main points.")from scraper.integrations import scrape_url, scrape_page, scrape_with_selectors, scrape_many
# Clean text — ready to feed to an LLM
text = scrape_url("https://news.ycombinator.com")
# Structured metadata: title, description, links, word_count
page = scrape_page("https://example.com")
# Agent-defined extraction with CSS selectors — no YAML needed
data = scrape_with_selectors(
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000",
selectors={"title": "h1", "price": "p.price_color"},
)
# Parallel scraping
pages = scrape_many(["https://a.com", "https://b.com"], mode="page")
# Run a directive and get structured data
data = scrape_directive("wikipedia")All integration dependencies are lazy — Scrapit works without any of them installed. Install only what you need:
pip install anthropic # Anthropic SDK integration
pip install openai # OpenAI integration
pip install langchain-core # LangChain / CrewAI / LangGraph
pip install llama-index-core # LlamaIndex
pip install mcp # MCP server (Claude Desktop / Cursor / Claude Code)Send a directive to the background queue:
from scraper.queue.producer import call_producer
call_producer("directives/wikipedia.yaml")Start a consumer worker:
python -m scraper.queue.consumerWorkers scrape each received directive and save to MongoDB automatically.
import asyncio
from scraper.scrapers import grab_elements_by_directive
from scraper.storage import json_file
result = asyncio.run(grab_elements_by_directive("scraper/directives/wikipedia.yaml"))
json_file.save(result, "wikipedia")Contributions are welcome! Whether it's a bug fix, a new transform, a new storage backend, or just sharing a directive YAML that works for a site you scraped.
See CONTRIBUTING.md here, for a full guide on how to get started.
Quick ways to contribute:
- Share a directive — open an issue with the "Share a Directive" template
- New transform — add a function to
scraper/transforms/__init__.pyand open a PR - Bug report — use the bug report issue template
- Python 3.10+
requests,bs4,pyyaml— always requiredplaywright— only for playwright backendpymongo,python-dotenv— only for MongoDBpika— only for RabbitMQ queue- SQLite is included in Python's stdlib (no install needed)
MIT © João Benedet Machado
site: https://example.com
use: beautifulsoup
proxy: http://proxy.example.com:8080 # or ${PROXY_URL} from env
scrape:
title: ['h1']proxies:
- http://proxy1.example.com:8080
- http://proxy2.example.com:8080
- http://proxy3.example.com:8080
proxy_strategy: round_robin # or: random
# Scrapit automatically retries with the next proxy on failureuse: brightdata # full CDP via Scraping Browser
# requires BRIGHTDATA_CUSTOMER, BRIGHTDATA_ZONE, BRIGHTDATA_PASSWORD in .envpip install scrapit-scraper[brightdata]