Skip to content

MadKangYu/tiered-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tiered-scraper

6-Tier auto-escalation web scraper with AI Vision CAPTCHA solver.

Automatically escalates through increasingly powerful scraping strategies until it succeeds. Bypasses Cloudflare, DataDome, Turnstile, and other anti-bot systems.

How It Works

Request
  │
  ├─ Tier 1: httpx           → Plain HTTP (fastest, ~0.5s)
  ├─ Tier 2: StealthyFetcher → TLS fingerprint spoofing
  ├─ Tier 3: patchright       → Playwright with CDP leak patches
  ├─ Tier 4: nodriver         → Direct Chrome communication (no CDP traces)
  ├─ Tier 5: camoufox         → Firefox modified at C++ level
  └─ Tier 6: Vision Solver    → Screenshot → AI Vision → coordinate click

Each tier is tried in order. If a challenge page is detected, it automatically escalates to the next tier. Unavailable tiers (missing dependencies) are skipped.

Test Results

Site Protection Bypass Tier Result
httpbin.org None Tier 1 PASS
nowsecure.nl Cloudflare Tier 5 PASS
G2.com Cloudflare + DataDome Tier 5 PASS
Indeed.com CF Enterprise Tier 5 PASS
Crunchbase.com Cloudflare Tier 5 PASS
Discord.com Cloudflare Tier 5 PASS

Installation

# Minimal (Tier 1 only)
pip install tiered-scraper

# With all tiers
pip install tiered-scraper[all]

# Individual tiers
pip install tiered-scraper[stealth]    # + Tier 2
pip install tiered-scraper[browser]    # + Tier 3
pip install tiered-scraper[nodriver]   # + Tier 4
pip install tiered-scraper[camoufox]   # + Tier 5
pip install tiered-scraper[vision]     # + Tier 6

Quick Start

import asyncio
from tiered_scraper import TieredScraper

async def main():
    scraper = TieredScraper()

    # Auto-escalation: tries Tier 1→2→3→4→5→6 until success
    html = await scraper.fetch("https://example.com")
    print(f"Got {len(html)} bytes")

    # Force a specific tier
    html = await scraper.fetch("https://cf-protected.com", tier=5)

    # Check stats
    print(scraper.stats)

asyncio.run(main())

Configuration

scraper = TieredScraper(
    timeout=30,                          # Per-tier timeout (seconds)
    proxy="socks5://user:pass@host:port", # Proxy for all tiers
    anthropic_api_key="sk-ant-...",       # For Tier 6 Vision Solver
)

Tier Details

Tier 1: httpx

  • Speed: ~0.5s | Cost: Free
  • Plain HTTP requests. No JS rendering.
  • Handles: RSS feeds, simple HTML, APIs.

Tier 2: StealthyFetcher

  • Speed: ~2s | Cost: Free
  • TLS fingerprint spoofing via scrapling.
  • Handles: Sites checking TLS handshake patterns.

Tier 3: patchright

  • Speed: ~3s | Cost: Free
  • Patchright — Playwright with CDP leak patches.
  • Handles: JS-rendered SPAs, basic bot detection.

Tier 4: nodriver

  • Speed: ~5s | Cost: Free
  • Nodriver — Direct Chrome communication without CDP traces.
  • Handles: Sites detecting Runtime.Enable CDP calls.
  • Cloudflare bypass rate: ~83% (benchmark).

Tier 5: camoufox

  • Speed: ~8s | Cost: Free
  • Camoufox — Firefox modified at C++ binary level.
  • Handles: Cloudflare, DataDome, Akamai, PerimeterX.
  • Detection score: 0% on major test suites.

Tier 6: Vision Solver

  • Speed: 15s | Cost: API call ($0.001/solve)
  • Screenshots the page → Claude Vision API identifies CAPTCHA location → clicks with human-like mouse movement.
  • Why it works: Uses actual screen coordinates (screenX/Y in hundreds), not CDP iframe coordinates (< 100). Cloudflare Turnstile can't distinguish from human clicks.
  • Handles: Turnstile, reCAPTCHA, hCaptcha, any visual challenge.
  • Requires: ANTHROPIC_API_KEY environment variable.

Challenge Detection

The scraper automatically detects challenge pages by looking for patterns like:

  • "Checking if the site connection is secure"
  • "Verify you are human"
  • Cloudflare ray IDs
  • Turnstile iframe markers

If a challenge is detected after fetching, the scraper escalates to the next tier instead of returning blocked content.

Why Each Tier Exists

Defense T1 T2 T3 T4 T5 T6
JS rendering required - -
TLS fingerprinting - -
CDP detection - -
navigator.webdriver - -
Cloudflare challenge - - - -
DataDome - - - -
Turnstile mouse coords - - - - ?
Per-customer ML - - - - ?
Visual CAPTCHA - - - - -

State Management

Built-in utilities for tracking seen URLs and persisting state:

from tiered_scraper import load_state, save_state, is_seen, mark_seen

state = load_state("./scraper-state.json")

if not is_seen(state, url):
    html = await scraper.fetch(url)
    mark_seen(state, url)
    save_state("./scraper-state.json", state)  # Atomic write

License

MIT

About

Multi-layer web scraper with AI-assisted CAPTCHA handling for Cloudflare, DataDome, and Turnstile.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages