tiered-scraper

6-Tier auto-escalation web scraper with AI Vision CAPTCHA solver.

Automatically escalates through increasingly powerful scraping strategies until it succeeds. Bypasses Cloudflare, DataDome, Turnstile, and other anti-bot systems.

How It Works

Request
  │
  ├─ Tier 1: httpx           → Plain HTTP (fastest, ~0.5s)
  ├─ Tier 2: StealthyFetcher → TLS fingerprint spoofing
  ├─ Tier 3: patchright       → Playwright with CDP leak patches
  ├─ Tier 4: nodriver         → Direct Chrome communication (no CDP traces)
  ├─ Tier 5: camoufox         → Firefox modified at C++ level
  └─ Tier 6: Vision Solver    → Screenshot → AI Vision → coordinate click

Each tier is tried in order. If a challenge page is detected, it automatically escalates to the next tier. Unavailable tiers (missing dependencies) are skipped.

Test Results

Site	Protection	Bypass Tier	Result
httpbin.org	None	Tier 1	PASS
nowsecure.nl	Cloudflare	Tier 5	PASS
G2.com	Cloudflare + DataDome	Tier 5	PASS
Indeed.com	CF Enterprise	Tier 5	PASS
Crunchbase.com	Cloudflare	Tier 5	PASS
Discord.com	Cloudflare	Tier 5	PASS

Installation

# Minimal (Tier 1 only)
pip install tiered-scraper

# With all tiers
pip install tiered-scraper[all]

# Individual tiers
pip install tiered-scraper[stealth]    # + Tier 2
pip install tiered-scraper[browser]    # + Tier 3
pip install tiered-scraper[nodriver]   # + Tier 4
pip install tiered-scraper[camoufox]   # + Tier 5
pip install tiered-scraper[vision]     # + Tier 6

Quick Start

import asyncio
from tiered_scraper import TieredScraper

async def main():
    scraper = TieredScraper()

    # Auto-escalation: tries Tier 1→2→3→4→5→6 until success
    html = await scraper.fetch("https://example.com")
    print(f"Got {len(html)} bytes")

    # Force a specific tier
    html = await scraper.fetch("https://cf-protected.com", tier=5)

    # Check stats
    print(scraper.stats)

asyncio.run(main())

Configuration

scraper = TieredScraper(
    timeout=30,                          # Per-tier timeout (seconds)
    proxy="socks5://user:pass@host:port", # Proxy for all tiers
    anthropic_api_key="sk-ant-...",       # For Tier 6 Vision Solver
)

Tier Details

Tier 1: httpx

Speed: ~0.5s | Cost: Free
Plain HTTP requests. No JS rendering.
Handles: RSS feeds, simple HTML, APIs.

Tier 2: StealthyFetcher

Speed: ~2s | Cost: Free
TLS fingerprint spoofing via scrapling.
Handles: Sites checking TLS handshake patterns.

Tier 3: patchright

Speed: ~3s | Cost: Free
Patchright — Playwright with CDP leak patches.
Handles: JS-rendered SPAs, basic bot detection.

Tier 4: nodriver

Speed: ~5s | Cost: Free
Nodriver — Direct Chrome communication without CDP traces.
Handles: Sites detecting Runtime.Enable CDP calls.
Cloudflare bypass rate: ~83% (benchmark).

Tier 5: camoufox

Speed: ~8s | Cost: Free
Camoufox — Firefox modified at C++ binary level.
Handles: Cloudflare, DataDome, Akamai, PerimeterX.
Detection score: 0% on major test suites.

Tier 6: Vision Solver

Speed: ~~15s | Cost: API call (~~$0.001/solve)
Screenshots the page → Claude Vision API identifies CAPTCHA location → clicks with human-like mouse movement.
Why it works: Uses actual screen coordinates (screenX/Y in hundreds), not CDP iframe coordinates (< 100). Cloudflare Turnstile can't distinguish from human clicks.
Handles: Turnstile, reCAPTCHA, hCaptcha, any visual challenge.
Requires: ANTHROPIC_API_KEY environment variable.

Challenge Detection

The scraper automatically detects challenge pages by looking for patterns like:

"Checking if the site connection is secure"
"Verify you are human"
Cloudflare ray IDs
Turnstile iframe markers

If a challenge is detected after fetching, the scraper escalates to the next tier instead of returning blocked content.

Why Each Tier Exists

Defense	T1	T2	T3	T4	T5	T6
JS rendering required	-	-	✓	✓	✓	✓
TLS fingerprinting	-	✓	-	✓	✓	✓
CDP detection	-	-	✓	✓	✓	✓
navigator.webdriver	-	-	✓	✓	✓	✓
Cloudflare challenge	-	-	-	-	✓	✓
DataDome	-	-	-	-	✓	✓
Turnstile mouse coords	-	-	-	-	?	✓
Per-customer ML	-	-	-	-	?	✓
Visual CAPTCHA	-	-	-	-	-	✓

State Management

Built-in utilities for tracking seen URLs and persisting state:

from tiered_scraper import load_state, save_state, is_seen, mark_seen

state = load_state("./scraper-state.json")

if not is_seen(state, url):
    html = await scraper.fetch(url)
    mark_seen(state, url)
    save_state("./scraper-state.json", state)  # Atomic write

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tiered_scraper		tiered_scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiered-scraper

How It Works

Test Results

Installation

Quick Start

Configuration

Tier Details

Tier 1: httpx

Tier 2: StealthyFetcher

Tier 3: patchright

Tier 4: nodriver

Tier 5: camoufox

Tier 6: Vision Solver

Challenge Detection

Why Each Tier Exists

State Management

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiered-scraper

How It Works

Test Results

Installation

Quick Start

Configuration

Tier Details

Tier 1: httpx

Tier 2: StealthyFetcher

Tier 3: patchright

Tier 4: nodriver

Tier 5: camoufox

Tier 6: Vision Solver

Challenge Detection

Why Each Tier Exists

State Management

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages