Skip to content

ventz/scrape-website

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrape-website

Async website scraper that crawls an entire domain and downloads all pages (HTML), extracts clean Markdown (for LLMs/RAG knowledge bases), and saves documents (PDF, DOCX, XLSX, etc.). Stays within the target domain — it will never follow links to external sites.

Features

  • Fast async crawling — up to 100 concurrent requests (configurable)
  • Async DNS — non-blocking DNS resolution with caching (via aiodns)
  • Async file I/O — non-blocking writes with aiofiles
  • Clean Markdown extraction — extracts main content as Markdown using trafilatura (strips nav, headers, footers, boilerplate), with YAML front-matter metadata (title, url, hostname, sitename) at the top of each file
  • Per-page deduplication — repeated boilerplate is dropped only within a page; content that legitimately repeats across pages (e.g. an FAQ answer on both the FAQ page and its own page) is kept in full, so every page is a self-contained knowledge-base document
  • Parallel HTML parsinglxml link extraction + text extraction offloaded to process pool (uses all CPU cores)
  • SQLite-backed dedup — exact URL deduplication with minimal RAM usage (scales to millions of URLs)
  • Crash recovery — auto-resumes from checkpoint on restart; use --fresh to start over
  • Multi-domain concurrency — all domains run in parallel via asyncio.TaskGroup
  • Domain-scoped — only follows links within the starting domain
  • Document downloads — PDF, DOC(X), PPT(X), XLS(X), CSV, ZIP, RTF, ODT, ODS, ODP
  • Multiple input modes — single URL, file with URL list, or retry from failed URLs
  • Access-denied detection — identifies HTTP 401/403 and CDN/WAF denial pages
  • Automatic retry — failed URLs are saved for easy re-run
  • Structured logging — per-URL events logged to file, progress summaries every 5 seconds to console

Requirements

  • Python 3.13+
  • uv package manager

Setup

git clone https://github.com/ventz/scrape-website.git
cd scrape-website
uv sync

Usage

Scrape a single website

uv run python app.py https://example.com/

This crawls every page on example.com, saving HTML pages, extracted text, and any linked documents.

Scrape multiple websites

Create a file with one URL per line:

# urls.txt
# Lines starting with # are ignored, blank lines are skipped
https://example.com/
https://docs.example.com/
https://blog.example.com/

Then run:

uv run python app.py --file urls.txt

All domains run concurrently. Each domain gets its own output directory under data/.

You can also combine a URL argument with a file:

uv run python app.py https://example.com/ --file more-urls.txt

Retry failed URLs

Failed URLs are automatically saved to data/<domain>/logs/failed_urls.txt after each run. Retry them with:

uv run python app.py --retry data/example.com/logs/failed_urls.txt

Resume after crash

The scraper automatically checkpoints its queue and stats to SQLite every 30 seconds. If interrupted, just re-run the same command — it will resume from where it left off.

To force a clean start (ignoring any saved checkpoint):

uv run python app.py https://example.com/ --fresh

Tuning options

# Throttle to 20 concurrent requests with a 0.5s delay (be polite)
uv run python app.py https://example.com/ --concurrency 20 --delay 0.5

# Increase timeout for slow servers
uv run python app.py https://example.com/ --timeout 60

# All options together
uv run python app.py https://example.com/ --concurrency 50 --timeout 60 --delay 0.25
Flag Default Description
--concurrency 100 Max concurrent requests
--timeout 30 Request timeout in seconds
--delay 0.1 Delay between requests in seconds
--file, -f File with URLs to scrape (one per line)
--retry, -r File with failed URLs to retry
--fresh Ignore saved checkpoint and start fresh
--exclude-pattern see below Regex to exclude URLs (repeatable; appends to defaults)
--no-default-excludes Clear built-in exclude patterns (only use --exclude-pattern values)
--no-strip-tracking-params Keep tracking query params (utm_*, fbclid, etc.)
--no-use-sitemap Skip sitemap.xml discovery for seed URLs

Crawl-quality knobs

Three features are on by default and improve crawl quality on most sites:

URL exclude patterns — skip URLs matching common noise patterns (tag pages, author archives, pagination, print views, etc.):

# Add a custom exclude pattern (appended to defaults)
uv run python app.py https://blog.example.com/ --exclude-pattern '/category/'

# Use only your own patterns (no defaults)
uv run python app.py https://blog.example.com/ --no-default-excludes --exclude-pattern '/archive/'

Default patterns: /tag/, /author/, /feed/, /print/, ?print=, /comments/, /page/\d+, /cdn-cgi/.

Tracking-param stripping — removes utm_source, fbclid, gclid, and similar query params so the same page isn't scraped twice with different tracking links:

# Opt out (keep all query params as-is)
uv run python app.py https://example.com/ --no-strip-tracking-params

Sitemap seeding — fetches sitemap.xml (and sitemap index files) to discover pages that might not be linked from the homepage:

# Opt out
uv run python app.py https://example.com/ --no-use-sitemap

Output structure

data/
  example.com/
    pages/              # Raw HTML files
    text/               # Clean extracted Markdown (.md) w/ metadata — LLM-ready
    files/              # Downloaded documents (PDF, DOCX, etc.)
    logs/
      scrape.log        # Full debug log
      state.db          # SQLite DB (visited URLs, queue, stats)
      access_denied.txt # URLs that returned 401/403 (if any)
      failed_urls.txt   # URLs that failed after retries (if any)
  docs.example.com/
    pages/
    text/
    files/
    logs/

Each domain is stored separately, so scraping multiple sites keeps everything organized.

The text/ directory contains clean, extracted main content as Markdown (.md) — ideal for feeding into LLMs, RAG pipelines, or text analysis. Navigation, headers, footers, and boilerplate are stripped by trafilatura. Each file opens with a YAML front-matter block (title, url, hostname, sitename) for provenance and better retrieval, followed by the page content with headings and links preserved.

Deduplication is per-page only: trafilatura's repetition cache is reset before every page, so a passage is removed only if it repeats within that same page. Text that legitimately appears on multiple pages (a shared FAQ answer, a reused policy blurb) is retained in full on every page — there is no cross-page/cross-domain content loss, which would otherwise leave some pages with a truncated file or no file at all.

Example

% python app.py 'https://privsec.harvard.edu'
Output directory: data/privsec.harvard.edu
Starting domain: privsec.harvard.edu
Max concurrent requests: 100
Starting scraper at 2026-03-12 14:01:58

================================================================================
SCRAPING COMPLETED
================================================================================
Duration: 4.00 seconds
URLs visited: 104
Pages downloaded: 98
Text extracted: 91
Files downloaded: 3
Access denied: 3
Total data: 4.63 MB
Errors: 0
Output location: data/privsec.harvard.edu
Denied URLs logged to: data/privsec.harvard.edu/logs/access_denied.txt
================================================================================

% ls data/privsec.harvard.edu/
files/  logs/  pages/  text/

License

MIT

About

Incredibly High Performant web scraper (seconds for what takes others lots of minutes) - battle tested on millions of websites

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages