Async website scraper that crawls an entire domain and downloads all pages (HTML), extracts clean Markdown (for LLMs/RAG knowledge bases), and saves documents (PDF, DOCX, XLSX, etc.). Stays within the target domain — it will never follow links to external sites.
- Fast async crawling — up to 100 concurrent requests (configurable)
- Async DNS — non-blocking DNS resolution with caching (via
aiodns) - Async file I/O — non-blocking writes with
aiofiles - Clean Markdown extraction — extracts main content as Markdown using
trafilatura(strips nav, headers, footers, boilerplate), with YAML front-matter metadata (title, url, hostname, sitename) at the top of each file - Per-page deduplication — repeated boilerplate is dropped only within a page; content that legitimately repeats across pages (e.g. an FAQ answer on both the FAQ page and its own page) is kept in full, so every page is a self-contained knowledge-base document
- Parallel HTML parsing —
lxmllink extraction + text extraction offloaded to process pool (uses all CPU cores) - SQLite-backed dedup — exact URL deduplication with minimal RAM usage (scales to millions of URLs)
- Crash recovery — auto-resumes from checkpoint on restart; use
--freshto start over - Multi-domain concurrency — all domains run in parallel via
asyncio.TaskGroup - Domain-scoped — only follows links within the starting domain
- Document downloads — PDF, DOC(X), PPT(X), XLS(X), CSV, ZIP, RTF, ODT, ODS, ODP
- Multiple input modes — single URL, file with URL list, or retry from failed URLs
- Access-denied detection — identifies HTTP 401/403 and CDN/WAF denial pages
- Automatic retry — failed URLs are saved for easy re-run
- Structured logging — per-URL events logged to file, progress summaries every 5 seconds to console
- Python 3.13+
- uv package manager
git clone https://github.com/ventz/scrape-website.git
cd scrape-website
uv syncuv run python app.py https://example.com/This crawls every page on example.com, saving HTML pages, extracted text, and any linked documents.
Create a file with one URL per line:
# urls.txt
# Lines starting with # are ignored, blank lines are skipped
https://example.com/
https://docs.example.com/
https://blog.example.com/
Then run:
uv run python app.py --file urls.txtAll domains run concurrently. Each domain gets its own output directory under data/.
You can also combine a URL argument with a file:
uv run python app.py https://example.com/ --file more-urls.txtFailed URLs are automatically saved to data/<domain>/logs/failed_urls.txt after each run. Retry them with:
uv run python app.py --retry data/example.com/logs/failed_urls.txtThe scraper automatically checkpoints its queue and stats to SQLite every 30 seconds. If interrupted, just re-run the same command — it will resume from where it left off.
To force a clean start (ignoring any saved checkpoint):
uv run python app.py https://example.com/ --fresh# Throttle to 20 concurrent requests with a 0.5s delay (be polite)
uv run python app.py https://example.com/ --concurrency 20 --delay 0.5
# Increase timeout for slow servers
uv run python app.py https://example.com/ --timeout 60
# All options together
uv run python app.py https://example.com/ --concurrency 50 --timeout 60 --delay 0.25| Flag | Default | Description |
|---|---|---|
--concurrency |
100 |
Max concurrent requests |
--timeout |
30 |
Request timeout in seconds |
--delay |
0.1 |
Delay between requests in seconds |
--file, -f |
— | File with URLs to scrape (one per line) |
--retry, -r |
— | File with failed URLs to retry |
--fresh |
— | Ignore saved checkpoint and start fresh |
--exclude-pattern |
see below | Regex to exclude URLs (repeatable; appends to defaults) |
--no-default-excludes |
— | Clear built-in exclude patterns (only use --exclude-pattern values) |
--no-strip-tracking-params |
— | Keep tracking query params (utm_*, fbclid, etc.) |
--no-use-sitemap |
— | Skip sitemap.xml discovery for seed URLs |
Three features are on by default and improve crawl quality on most sites:
URL exclude patterns — skip URLs matching common noise patterns (tag pages, author archives, pagination, print views, etc.):
# Add a custom exclude pattern (appended to defaults)
uv run python app.py https://blog.example.com/ --exclude-pattern '/category/'
# Use only your own patterns (no defaults)
uv run python app.py https://blog.example.com/ --no-default-excludes --exclude-pattern '/archive/'Default patterns: /tag/, /author/, /feed/, /print/, ?print=, /comments/, /page/\d+, /cdn-cgi/.
Tracking-param stripping — removes utm_source, fbclid, gclid, and similar query params so the same page isn't scraped twice with different tracking links:
# Opt out (keep all query params as-is)
uv run python app.py https://example.com/ --no-strip-tracking-paramsSitemap seeding — fetches sitemap.xml (and sitemap index files) to discover pages that might not be linked from the homepage:
# Opt out
uv run python app.py https://example.com/ --no-use-sitemapdata/
example.com/
pages/ # Raw HTML files
text/ # Clean extracted Markdown (.md) w/ metadata — LLM-ready
files/ # Downloaded documents (PDF, DOCX, etc.)
logs/
scrape.log # Full debug log
state.db # SQLite DB (visited URLs, queue, stats)
access_denied.txt # URLs that returned 401/403 (if any)
failed_urls.txt # URLs that failed after retries (if any)
docs.example.com/
pages/
text/
files/
logs/
Each domain is stored separately, so scraping multiple sites keeps everything organized.
The text/ directory contains clean, extracted main content as Markdown (.md) — ideal for feeding into LLMs, RAG pipelines, or text analysis. Navigation, headers, footers, and boilerplate are stripped by trafilatura. Each file opens with a YAML front-matter block (title, url, hostname, sitename) for provenance and better retrieval, followed by the page content with headings and links preserved.
Deduplication is per-page only: trafilatura's repetition cache is reset before every page, so a passage is removed only if it repeats within that same page. Text that legitimately appears on multiple pages (a shared FAQ answer, a reused policy blurb) is retained in full on every page — there is no cross-page/cross-domain content loss, which would otherwise leave some pages with a truncated file or no file at all.
% python app.py 'https://privsec.harvard.edu'
Output directory: data/privsec.harvard.edu
Starting domain: privsec.harvard.edu
Max concurrent requests: 100
Starting scraper at 2026-03-12 14:01:58
================================================================================
SCRAPING COMPLETED
================================================================================
Duration: 4.00 seconds
URLs visited: 104
Pages downloaded: 98
Text extracted: 91
Files downloaded: 3
Access denied: 3
Total data: 4.63 MB
Errors: 0
Output location: data/privsec.harvard.edu
Denied URLs logged to: data/privsec.harvard.edu/logs/access_denied.txt
================================================================================
% ls data/privsec.harvard.edu/
files/ logs/ pages/ text/
MIT