Skip to content

viveknarayanan1408/Smart-Web-Scraper-AI

Repository files navigation

🕷️ ScrapeIntel — AI-Powered Smart Web Scraper

Python 3.10+ FastAPI License: MIT Code style: black

Extract structured data from any website using natural language schemas.
Define what you want in plain English — ScrapeIntel figures out how to get it.

ScrapeIntel Swagger UI


✨ What It Does

ScrapeIntel combines traditional web scraping with LLM-powered extraction to turn unstructured web pages into clean, structured data. Instead of writing brittle CSS selectors or XPath queries, you describe the data you want in a JSON schema — and the AI handles the rest.

Key Features

  • Schema-Driven Extraction — Define output structure in JSON Schema or plain English
  • Multi-Page Crawling — Automatic pagination detection and following
  • Concurrent Scraping — Async architecture with configurable rate limiting
  • Multiple LLM Backends — OpenAI, Anthropic, or local models via Ollama (free)
  • REST API + CLI — Use via FastAPI endpoints or command line
  • Export Formats — JSON, CSV, Excel, or direct SQLite insert
  • Anti-Detection — Rotating user agents, request delays, proxy support
  • Retry & Error Handling — Exponential backoff with configurable retries
  • Caching — Disk-based page caching with TTL to avoid redundant requests
  • Structured Logging — Full observability with JSON or human-readable logs

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                   FastAPI / CLI                      │
├─────────────────────────────────────────────────────┤
│                 Job Orchestrator                     │
│          (async task queue + scheduling)             │
├──────────┬──────────────┬───────────────────────────┤
│ Scraper  │  Extractor   │     Post-Processor        │
│ Engine   │  (LLM-based) │  (validation + export)    │
├──────────┼──────────────┼───────────────────────────┤
│ HTTP     │ OpenAI /     │  JSON / CSV / Excel /     │
│ Client   │ Anthropic /  │  SQLite / Webhook         │
│ (httpx)  │ Ollama       │                           │
└──────────┴──────────────┴───────────────────────────┘

🚀 Quick Start

Prerequisites

  • Python 3.10 or higher
  • An LLM API key (OpenAI or Anthropic) OR Ollama installed locally for free inference

Step 1 — Clone & Install

git clone https://github.com/YOUR_USERNAME/scrapeintel.git
cd scrapeintel

Create a virtual environment and install dependencies:

# Linux / macOS
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Windows (PowerShell)
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Step 2 — Configure your LLM provider

cp .env.example .env

Open .env in your editor and configure one of the following:

Option A — OpenAI (paid, fastest):

OPENAI_API_KEY=sk-your-openai-key-here
DEFAULT_LLM_PROVIDER=openai

Get a key at platform.openai.com/api-keys. Requires billing enabled (minimum $5 credit).

Option B — Anthropic (paid):

ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
DEFAULT_LLM_PROVIDER=anthropic

Get a key at console.anthropic.com.

Option C — Ollama (free, runs locally):

# Install Ollama from https://ollama.com, then:
ollama pull llama3.1
DEFAULT_LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434

No API key needed. Requires ~4.7GB disk space for the model.

Step 3 — Run the tests

pytest tests/ -v

All tests should pass. This validates that your environment is set up correctly.

Step 4 — Start the API server

uvicorn src.api.main:app --reload --port 8000

You should see:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Step 5 — Open the Swagger UI

Navigate to http://127.0.0.1:8000/docs in your browser. You'll see the interactive API documentation with all endpoints.

Step 6 — Run your first scrape

In Swagger UI, click POST /api/v1/scrapeTry it out, paste this body, and hit Execute:

{
  "url": "https://books.toscrape.com",
  "schema": {
    "title": "Book title",
    "price": "Price as a float number",
    "rating": "Star rating as a word (One, Two, Three, Four, Five)",
    "in_stock": "Whether available, as a boolean"
  }
}

Expected response:

{
  "job_id": "a1b2c3d4",
  "status": "completed",
  "data": [
    {
      "title": "A Light in the Attic",
      "price": 51.77,
      "rating": "Three",
      "in_stock": true
    },
    {
      "title": "Tipping the Velvet",
      "price": 53.74,
      "rating": "One",
      "in_stock": true
    }
  ],
  "metadata": {
    "url": "https://books.toscrape.com",
    "scraped_at": "2025-03-19T14:30:00Z",
    "tokens_used": 1850,
    "latency_ms": 3200
  }
}

All 20 books on the page, cleanly extracted — no CSS selectors, no XPath, just plain English.


📖 Usage Guide

API Endpoints

Method Endpoint Description
GET /api/v1/health Health check — shows configured providers and uptime
POST /api/v1/scrape Scrape a single page and extract data
POST /api/v1/crawl Crawl multiple pages with automatic pagination
POST /api/v1/batch Scrape multiple URLs concurrently
GET /api/v1/cache/stats View cache statistics
DELETE /api/v1/cache Clear all cached responses

Scrape a single page (cURL)

curl -X POST http://localhost:8000/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    }
  }'

Crawl multiple pages

curl -X POST http://localhost:8000/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    },
    "max_pages": 3
  }'

This automatically detects the "next" pagination link and scrapes up to 3 pages (~60 books).

Batch scrape multiple URLs

curl -X POST http://localhost:8000/api/v1/batch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
      "https://books.toscrape.com/catalogue/category/books/fiction_10/index.html"
    ],
    "schema": {
      "title": "Book title",
      "price": "Price as a float"
    }
  }'

Override the LLM provider per request

{
  "url": "https://books.toscrape.com",
  "schema": { "title": "Book title" },
  "options": {
    "llm_provider": "anthropic",
    "cache": false
  }
}

CLI Mode

The same features are available from the command line:

# Scrape a single page → save as JSON
python -m src.cli scrape \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --output results.json

# Scrape a single page → save as CSV
python -m src.cli scrape \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --output results.csv

# Crawl multiple pages
python -m src.cli crawl \
  --url "https://books.toscrape.com" \
  --schema examples/product_schema.json \
  --max-pages 5 \
  --output results.json

# Manage cache
python -m src.cli cache-stats
python -m src.cli cache-clear

📋 Example Schemas

Ready-made schemas are in the examples/ folder:

Product extraction (examples/product_schema.json):

{
  "name": "Product name or title",
  "price": "Price as a float number",
  "rating": "Average star rating as a float",
  "in_stock": "Whether the product is available (boolean)",
  "features": "List of key product features as an array"
}

News articles (examples/article_schema.json):

{
  "title": "Article headline",
  "author": "Author name(s)",
  "published_date": "Publication date in ISO format",
  "summary": "First 2-3 sentences summarizing the article"
}

Job listings (examples/job_listing_schema.json):

{
  "title": "Job title",
  "company": "Company name",
  "location": "City/state or Remote",
  "salary_min": "Minimum salary as integer",
  "skills": "List of required skills as an array"
}

You can create your own schema for any website — just describe each field in plain English.


📁 Project Structure

scrapeintel/
├── src/
│   ├── api/                  # FastAPI application
│   │   ├── main.py           # App entry point, middleware, error handlers
│   │   ├── routes.py         # API route definitions (scrape, crawl, batch)
│   │   └── deps.py           # Dependency injection (client, cache, extractor)
│   ├── scrapers/             # Web scraping engine
│   │   ├── http_client.py    # Async HTTP client with retries & rate limiting
│   │   ├── crawler.py        # Multi-page crawler with pagination detection
│   │   └── cache.py          # Disk-based response caching with TTL
│   ├── extractors/           # AI extraction layer
│   │   ├── base.py           # Abstract extractor + HTML preprocessing + prompt builder
│   │   ├── openai_extractor.py
│   │   ├── anthropic_extractor.py
│   │   └── ollama_extractor.py
│   ├── schemas/              # Pydantic models
│   │   ├── requests.py       # API request validation models
│   │   └── responses.py      # API response models
│   ├── utils/                # Shared utilities
│   │   ├── config.py         # Settings management (pydantic-settings)
│   │   ├── logging.py        # Structured logging (JSON + colored output)
│   │   └── exporters.py      # Output format handlers (JSON, CSV, Excel, SQLite)
│   └── cli.py                # Click-based CLI entry point
├── tests/                    # Test suite (pytest)
│   ├── test_scraper.py       # HTTP client & rate limiter tests
│   ├── test_extractors.py    # Extraction pipeline & HTML preprocessing tests
│   ├── test_exporters.py     # Export format tests
│   ├── test_cache.py         # Cache TTL & invalidation tests
│   └── test_api.py           # API endpoint & middleware tests
├── examples/                 # Example extraction schemas
├── docs/                     # Documentation & contributing guide
├── .github/workflows/ci.yml  # GitHub Actions CI/CD pipeline
├── .env.example              # Environment variable template
├── pyproject.toml            # Python packaging & tool configuration
├── requirements.txt          # Pinned dependencies
├── Dockerfile                # Container build
├── docker-compose.yml        # Docker Compose (API + optional Ollama)
└── LICENSE                   # MIT License

🔧 Configuration

All settings via environment variables or .env file:

Variable Description Default
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
OLLAMA_BASE_URL Ollama server URL http://localhost:11434
DEFAULT_LLM_PROVIDER Default LLM backend (openai, anthropic, ollama) openai
OPENAI_MODEL OpenAI model name gpt-4o-mini
ANTHROPIC_MODEL Anthropic model name claude-sonnet-4-20250514
OLLAMA_MODEL Ollama model name llama3.1
MAX_CONCURRENT_REQUESTS Concurrent scrape limit 5
REQUEST_DELAY_MS Delay between requests (ms) 1000
REQUEST_TIMEOUT_S HTTP request timeout 30
MAX_RETRIES Max retry attempts on failure 3
CACHE_ENABLED Enable/disable response caching true
CACHE_TTL_SECONDS Cache time-to-live 3600
LOG_LEVEL Logging level INFO
LOG_JSON Output logs as JSON (for production) false
PROXY_URL HTTP/SOCKS proxy URL

🧪 Testing

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=src --cov-report=html

# Run a specific test file
pytest tests/test_extractors.py -v

🐳 Docker

Build and run

docker build -t scrapeintel .
docker run -p 8000:8000 --env-file .env scrapeintel

Docker Compose (with optional local Ollama)

# API only
docker compose up

# API + local Ollama (free LLM inference)
docker compose --profile local-llm up

🚢 Deploy to GitHub

First-time setup

# Initialize git (if not already done)
git init
git add .
git commit -m "Initial commit: ScrapeIntel AI web scraper"

# Create a repo on GitHub, then:
git remote add origin https://github.com/YOUR_USERNAME/scrapeintel.git
git branch -M main
git push -u origin main

Important: never commit your .env file

The .gitignore already excludes .env. Double-check before pushing:

git status
# Make sure .env is NOT listed in the files to be committed

🛣️ Roadmap

  • Browser-based scraping (Playwright integration for JS-rendered pages)
  • Scheduling & cron jobs for recurring scrapes
  • Web dashboard UI for non-technical users
  • Webhook notifications on job completion
  • Plugin system for custom extractors
  • Vector storage for extracted data (RAG-ready)

📄 License

MIT License — see LICENSE for details.


🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

  1. Fork the repo
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

About

ScrapeIntel AI web scraper

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors