🕷️ ScrapeIntel — AI-Powered Smart Web Scraper

Extract structured data from any website using natural language schemas.
Define what you want in plain English — ScrapeIntel figures out how to get it.

✨ What It Does

ScrapeIntel combines traditional web scraping with LLM-powered extraction to turn unstructured web pages into clean, structured data. Instead of writing brittle CSS selectors or XPath queries, you describe the data you want in a JSON schema — and the AI handles the rest.

Key Features

Schema-Driven Extraction — Define output structure in JSON Schema or plain English
Multi-Page Crawling — Automatic pagination detection and following
Concurrent Scraping — Async architecture with configurable rate limiting
Multiple LLM Backends — OpenAI, Anthropic, or local models via Ollama (free)
REST API + CLI — Use via FastAPI endpoints or command line
Export Formats — JSON, CSV, Excel, or direct SQLite insert
Anti-Detection — Rotating user agents, request delays, proxy support
Retry & Error Handling — Exponential backoff with configurable retries
Caching — Disk-based page caching with TTL to avoid redundant requests
Structured Logging — Full observability with JSON or human-readable logs

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                   FastAPI / CLI                      │
├─────────────────────────────────────────────────────┤
│                 Job Orchestrator                     │
│          (async task queue + scheduling)             │
├──────────┬──────────────┬───────────────────────────┤
│ Scraper  │  Extractor   │     Post-Processor        │
│ Engine   │  (LLM-based) │  (validation + export)    │
├──────────┼──────────────┼───────────────────────────┤
│ HTTP     │ OpenAI /     │  JSON / CSV / Excel /     │
│ Client   │ Anthropic /  │  SQLite / Webhook         │
│ (httpx)  │ Ollama       │                           │
└──────────┴──────────────┴───────────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.10 or higher
An LLM API key (OpenAI or Anthropic) OR Ollama installed locally for free inference

Step 1 — Clone & Install

git clone https://github.com/YOUR_USERNAME/scrapeintel.git
cd scrapeintel

Create a virtual environment and install dependencies:

# Linux / macOS
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Windows (PowerShell)
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Step 2 — Configure your LLM provider

cp .env.example .env

Open .env in your editor and configure one of the following:

Option A — OpenAI (paid, fastest):

OPENAI_API_KEY=sk-your-openai-key-here
DEFAULT_LLM_PROVIDER=openai

Get a key at platform.openai.com/api-keys. Requires billing enabled (minimum $5 credit).

Option B — Anthropic (paid):

ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
DEFAULT_LLM_PROVIDER=anthropic

Get a key at console.anthropic.com.

Option C — Ollama (free, runs locally):

# Install Ollama from https://ollama.com, then:
ollama pull llama3.1

DEFAULT_LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434

No API key needed. Requires ~4.7GB disk space for the model.

Step 3 — Run the tests

pytest tests/ -v

All tests should pass. This validates that your environment is set up correctly.

Step 4 — Start the API server

uvicorn src.api.main:app --reload --port 8000

You should see:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Step 5 — Open the Swagger UI

Navigate to http://127.0.0.1:8000/docs in your browser. You'll see the interactive API documentation with all endpoints.

Step 6 — Run your first scrape

In Swagger UI, click POST /api/v1/scrape → Try it out, paste this body, and hit Execute:

{
  "url": "https://books.toscrape.com",
  "schema": {
    "title": "Book title",
    "price": "Price as a float number",
    "rating": "Star rating as a word (One, Two, Three, Four, Five)",
    "in_stock": "Whether available, as a boolean"
  }
}

Expected response:

{
  "job_id": "a1b2c3d4",
  "status": "completed",
  "data": [
    {
      "title": "A Light in the Attic",
      "price": 51.77,
      "rating": "Three",
      "in_stock": true
    },
    {
      "title": "Tipping the Velvet",
      "price": 53.74,
      "rating": "One",
      "in_stock": true
    }
  ],
  "metadata": {
    "url": "https://books.toscrape.com",
    "scraped_at": "2025-03-19T14:30:00Z",
    "tokens_used": 1850,
    "latency_ms": 3200
  }
}

All 20 books on the page, cleanly extracted — no CSS selectors, no XPath, just plain English.

📖 Usage Guide

API Endpoints

Method	Endpoint	Description
`GET`	`/api/v1/health`	Health check — shows configured providers and uptime
`POST`	`/api/v1/scrape`	Scrape a single page and extract data
`POST`	`/api/v1/crawl`	Crawl multiple pages with automatic pagination
`POST`	`/api/v1/batch`	Scrape multiple URLs concurrently
`GET`	`/api/v1/cache/stats`	View cache statistics
`DELETE`	`/api/v1/cache`	Clear all cached responses

Scrape a single page (cURL)

curl -X POST http://localhost:8000/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    }
  }'

Crawl multiple pages

curl -X POST http://localhost:8000/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    },
    "max_pages": 3
  }'

This automatically detects the "next" pagination link and scrapes up to 3 pages (~60 books).

Batch scrape multiple URLs

curl -X POST http://localhost:8000/api/v1/batch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
      "https://books.toscrape.com/catalogue/category/books/fiction_10/index.html"
    ],
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    }
  }'

Override the LLM provider per request

{
  "url": "https://books.toscrape.com",
  "schema": { "title": "Book title" },
  "options": {
    "llm_provider": "anthropic",
    "cache": false
  }
}

CLI Mode

The same features are available from the command line:

# Scrape a single page → save as JSON
python -m src.cli scrape \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --output results.json

# Scrape a single page → save as CSV
python -m src.cli scrape \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --output results.csv

# Crawl multiple pages
python -m src.cli crawl \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --max-pages 5 \
  --output results.json

# Manage cache
python -m src.cli cache-stats
python -m src.cli cache-clear

📋 Example Schemas

Ready-made schemas are in the examples/ folder:

Product extraction (examples/product_schema.json):

{
  "name": "Product name or title",
  "price": "Price as a float number",
  "rating": "Average star rating as a float",
  "in_stock": "Whether the product is available (boolean)",
  "features": "List of key product features as an array"
}

News articles (examples/article_schema.json):

{
  "title": "Article headline",
  "author": "Author name(s)",
  "published_date": "Publication date in ISO format",
  "summary": "First 2-3 sentences summarizing the article"
}

Job listings (examples/job_listing_schema.json):

{
  "title": "Job title",
  "company": "Company name",
  "location": "City/state or Remote",
  "salary_min": "Minimum salary as integer",
  "skills": "List of required skills as an array"
}

You can create your own schema for any website — just describe each field in plain English.

📁 Project Structure

scrapeintel/
├── src/
│   ├── api/                  # FastAPI application
│   │   ├── main.py           # App entry point, middleware, error handlers
│   │   ├── routes.py         # API route definitions (scrape, crawl, batch)
│   │   └── deps.py           # Dependency injection (client, cache, extractor)
│   ├── scrapers/             # Web scraping engine
│   │   ├── http_client.py    # Async HTTP client with retries & rate limiting
│   │   ├── crawler.py        # Multi-page crawler with pagination detection
│   │   └── cache.py          # Disk-based response caching with TTL
│   ├── extractors/           # AI extraction layer
│   │   ├── base.py           # Abstract extractor + HTML preprocessing + prompt builder
│   │   ├── openai_extractor.py
│   │   ├── anthropic_extractor.py
│   │   └── ollama_extractor.py
│   ├── schemas/              # Pydantic models
│   │   ├── requests.py       # API request validation models
│   │   └── responses.py      # API response models
│   ├── utils/                # Shared utilities
│   │   ├── config.py         # Settings management (pydantic-settings)
│   │   ├── logging.py        # Structured logging (JSON + colored output)
│   │   └── exporters.py      # Output format handlers (JSON, CSV, Excel, SQLite)
│   └── cli.py                # Click-based CLI entry point
├── tests/                    # Test suite (pytest)
│   ├── test_scraper.py       # HTTP client & rate limiter tests
│   ├── test_extractors.py    # Extraction pipeline & HTML preprocessing tests
│   ├── test_exporters.py     # Export format tests
│   ├── test_cache.py         # Cache TTL & invalidation tests
│   └── test_api.py           # API endpoint & middleware tests
├── examples/                 # Example extraction schemas
├── docs/                     # Documentation & contributing guide
├── .github/workflows/ci.yml  # GitHub Actions CI/CD pipeline
├── .env.example              # Environment variable template
├── pyproject.toml            # Python packaging & tool configuration
├── requirements.txt          # Pinned dependencies
├── Dockerfile                # Container build
├── docker-compose.yml        # Docker Compose (API + optional Ollama)
└── LICENSE                   # MIT License

🔧 Configuration

All settings via environment variables or .env file:

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key	—
`ANTHROPIC_API_KEY`	Anthropic API key	—
`OLLAMA_BASE_URL`	Ollama server URL	`http://localhost:11434`
`DEFAULT_LLM_PROVIDER`	Default LLM backend (`openai`, `anthropic`, `ollama`)	`openai`
`OPENAI_MODEL`	OpenAI model name	`gpt-4o-mini`
`ANTHROPIC_MODEL`	Anthropic model name	`claude-sonnet-4-20250514`
`OLLAMA_MODEL`	Ollama model name	`llama3.1`
`MAX_CONCURRENT_REQUESTS`	Concurrent scrape limit	`5`
`REQUEST_DELAY_MS`	Delay between requests (ms)	`1000`
`REQUEST_TIMEOUT_S`	HTTP request timeout	`30`
`MAX_RETRIES`	Max retry attempts on failure	`3`
`CACHE_ENABLED`	Enable/disable response caching	`true`
`CACHE_TTL_SECONDS`	Cache time-to-live	`3600`
`LOG_LEVEL`	Logging level	`INFO`
`LOG_JSON`	Output logs as JSON (for production)	`false`
`PROXY_URL`	HTTP/SOCKS proxy URL	—

🧪 Testing

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=src --cov-report=html

# Run a specific test file
pytest tests/test_extractors.py -v

🐳 Docker

Build and run

docker build -t scrapeintel .
docker run -p 8000:8000 --env-file .env scrapeintel

Docker Compose (with optional local Ollama)

# API only
docker compose up

# API + local Ollama (free LLM inference)
docker compose --profile local-llm up

🚢 Deploy to GitHub

First-time setup

# Initialize git (if not already done)
git init
git add .
git commit -m "Initial commit: ScrapeIntel AI web scraper"

# Create a repo on GitHub, then:
git remote add origin https://github.com/YOUR_USERNAME/scrapeintel.git
git branch -M main
git push -u origin main

Important: never commit your `.env` file

The .gitignore already excludes .env. Double-check before pushing:

git status
# Make sure .env is NOT listed in the files to be committed

🛣️ Roadmap

Browser-based scraping (Playwright integration for JS-rendered pages)
Scheduling & cron jobs for recurring scrapes
Web dashboard UI for non-technical users
Webhook notifications on job completion
Plugin system for custom extractors
Vector storage for extracted data (RAG-ready)

📄 License

MIT License — see LICENSE for details.

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Fork the repo
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🕷️ ScrapeIntel — AI-Powered Smart Web Scraper

✨ What It Does

Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Step 1 — Clone & Install

Step 2 — Configure your LLM provider

Step 3 — Run the tests

Step 4 — Start the API server

Step 5 — Open the Swagger UI

Step 6 — Run your first scrape

📖 Usage Guide

API Endpoints

Scrape a single page (cURL)

Crawl multiple pages

Batch scrape multiple URLs

Override the LLM provider per request

CLI Mode

📋 Example Schemas

📁 Project Structure

🔧 Configuration

🧪 Testing

🐳 Docker

Build and run

Docker Compose (with optional local Ollama)

🚢 Deploy to GitHub

First-time setup

Important: never commit your .env file

🛣️ Roadmap

📄 License

🤝 Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Important: never commit your `.env` file

Packages