Extract structured data from any website using natural language schemas.
Define what you want in plain English — ScrapeIntel figures out how to get it.
ScrapeIntel combines traditional web scraping with LLM-powered extraction to turn unstructured web pages into clean, structured data. Instead of writing brittle CSS selectors or XPath queries, you describe the data you want in a JSON schema — and the AI handles the rest.
- Schema-Driven Extraction — Define output structure in JSON Schema or plain English
- Multi-Page Crawling — Automatic pagination detection and following
- Concurrent Scraping — Async architecture with configurable rate limiting
- Multiple LLM Backends — OpenAI, Anthropic, or local models via Ollama (free)
- REST API + CLI — Use via FastAPI endpoints or command line
- Export Formats — JSON, CSV, Excel, or direct SQLite insert
- Anti-Detection — Rotating user agents, request delays, proxy support
- Retry & Error Handling — Exponential backoff with configurable retries
- Caching — Disk-based page caching with TTL to avoid redundant requests
- Structured Logging — Full observability with JSON or human-readable logs
┌─────────────────────────────────────────────────────┐
│ FastAPI / CLI │
├─────────────────────────────────────────────────────┤
│ Job Orchestrator │
│ (async task queue + scheduling) │
├──────────┬──────────────┬───────────────────────────┤
│ Scraper │ Extractor │ Post-Processor │
│ Engine │ (LLM-based) │ (validation + export) │
├──────────┼──────────────┼───────────────────────────┤
│ HTTP │ OpenAI / │ JSON / CSV / Excel / │
│ Client │ Anthropic / │ SQLite / Webhook │
│ (httpx) │ Ollama │ │
└──────────┴──────────────┴───────────────────────────┘
- Python 3.10 or higher
- An LLM API key (OpenAI or Anthropic) OR Ollama installed locally for free inference
git clone https://github.com/YOUR_USERNAME/scrapeintel.git
cd scrapeintelCreate a virtual environment and install dependencies:
# Linux / macOS
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt# Windows (PowerShell)
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txtcp .env.example .envOpen .env in your editor and configure one of the following:
Option A — OpenAI (paid, fastest):
OPENAI_API_KEY=sk-your-openai-key-here
DEFAULT_LLM_PROVIDER=openaiGet a key at platform.openai.com/api-keys. Requires billing enabled (minimum $5 credit).
Option B — Anthropic (paid):
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
DEFAULT_LLM_PROVIDER=anthropicGet a key at console.anthropic.com.
Option C — Ollama (free, runs locally):
# Install Ollama from https://ollama.com, then:
ollama pull llama3.1DEFAULT_LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434No API key needed. Requires ~4.7GB disk space for the model.
pytest tests/ -vAll tests should pass. This validates that your environment is set up correctly.
uvicorn src.api.main:app --reload --port 8000You should see:
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Application startup complete.
Navigate to http://127.0.0.1:8000/docs in your browser. You'll see the interactive API documentation with all endpoints.
In Swagger UI, click POST /api/v1/scrape → Try it out, paste this body, and hit Execute:
{
"url": "https://books.toscrape.com",
"schema": {
"title": "Book title",
"price": "Price as a float number",
"rating": "Star rating as a word (One, Two, Three, Four, Five)",
"in_stock": "Whether available, as a boolean"
}
}Expected response:
{
"job_id": "a1b2c3d4",
"status": "completed",
"data": [
{
"title": "A Light in the Attic",
"price": 51.77,
"rating": "Three",
"in_stock": true
},
{
"title": "Tipping the Velvet",
"price": 53.74,
"rating": "One",
"in_stock": true
}
],
"metadata": {
"url": "https://books.toscrape.com",
"scraped_at": "2025-03-19T14:30:00Z",
"tokens_used": 1850,
"latency_ms": 3200
}
}All 20 books on the page, cleanly extracted — no CSS selectors, no XPath, just plain English.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/health |
Health check — shows configured providers and uptime |
POST |
/api/v1/scrape |
Scrape a single page and extract data |
POST |
/api/v1/crawl |
Crawl multiple pages with automatic pagination |
POST |
/api/v1/batch |
Scrape multiple URLs concurrently |
GET |
/api/v1/cache/stats |
View cache statistics |
DELETE |
/api/v1/cache |
Clear all cached responses |
curl -X POST http://localhost:8000/api/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com",
"schema": {
"title": "Book title",
"price": "Price as a float"
}
}'curl -X POST http://localhost:8000/api/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com",
"schema": {
"title": "Book title",
"price": "Price as a float"
},
"max_pages": 3
}'This automatically detects the "next" pagination link and scrapes up to 3 pages (~60 books).
curl -X POST http://localhost:8000/api/v1/batch \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
"https://books.toscrape.com/catalogue/category/books/fiction_10/index.html"
],
"schema": {
"title": "Book title",
"price": "Price as a float"
}
}'{
"url": "https://books.toscrape.com",
"schema": { "title": "Book title" },
"options": {
"llm_provider": "anthropic",
"cache": false
}
}The same features are available from the command line:
# Scrape a single page → save as JSON
python -m src.cli scrape \
--url "https://books.toscrape.com" \
--schema examples/product_schema.json \
--output results.json
# Scrape a single page → save as CSV
python -m src.cli scrape \
--url "https://books.toscrape.com" \
--schema examples/product_schema.json \
--output results.csv
# Crawl multiple pages
python -m src.cli crawl \
--url "https://books.toscrape.com" \
--schema examples/product_schema.json \
--max-pages 5 \
--output results.json
# Manage cache
python -m src.cli cache-stats
python -m src.cli cache-clearReady-made schemas are in the examples/ folder:
Product extraction (examples/product_schema.json):
{
"name": "Product name or title",
"price": "Price as a float number",
"rating": "Average star rating as a float",
"in_stock": "Whether the product is available (boolean)",
"features": "List of key product features as an array"
}News articles (examples/article_schema.json):
{
"title": "Article headline",
"author": "Author name(s)",
"published_date": "Publication date in ISO format",
"summary": "First 2-3 sentences summarizing the article"
}Job listings (examples/job_listing_schema.json):
{
"title": "Job title",
"company": "Company name",
"location": "City/state or Remote",
"salary_min": "Minimum salary as integer",
"skills": "List of required skills as an array"
}You can create your own schema for any website — just describe each field in plain English.
scrapeintel/
├── src/
│ ├── api/ # FastAPI application
│ │ ├── main.py # App entry point, middleware, error handlers
│ │ ├── routes.py # API route definitions (scrape, crawl, batch)
│ │ └── deps.py # Dependency injection (client, cache, extractor)
│ ├── scrapers/ # Web scraping engine
│ │ ├── http_client.py # Async HTTP client with retries & rate limiting
│ │ ├── crawler.py # Multi-page crawler with pagination detection
│ │ └── cache.py # Disk-based response caching with TTL
│ ├── extractors/ # AI extraction layer
│ │ ├── base.py # Abstract extractor + HTML preprocessing + prompt builder
│ │ ├── openai_extractor.py
│ │ ├── anthropic_extractor.py
│ │ └── ollama_extractor.py
│ ├── schemas/ # Pydantic models
│ │ ├── requests.py # API request validation models
│ │ └── responses.py # API response models
│ ├── utils/ # Shared utilities
│ │ ├── config.py # Settings management (pydantic-settings)
│ │ ├── logging.py # Structured logging (JSON + colored output)
│ │ └── exporters.py # Output format handlers (JSON, CSV, Excel, SQLite)
│ └── cli.py # Click-based CLI entry point
├── tests/ # Test suite (pytest)
│ ├── test_scraper.py # HTTP client & rate limiter tests
│ ├── test_extractors.py # Extraction pipeline & HTML preprocessing tests
│ ├── test_exporters.py # Export format tests
│ ├── test_cache.py # Cache TTL & invalidation tests
│ └── test_api.py # API endpoint & middleware tests
├── examples/ # Example extraction schemas
├── docs/ # Documentation & contributing guide
├── .github/workflows/ci.yml # GitHub Actions CI/CD pipeline
├── .env.example # Environment variable template
├── pyproject.toml # Python packaging & tool configuration
├── requirements.txt # Pinned dependencies
├── Dockerfile # Container build
├── docker-compose.yml # Docker Compose (API + optional Ollama)
└── LICENSE # MIT License
All settings via environment variables or .env file:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | — |
ANTHROPIC_API_KEY |
Anthropic API key | — |
OLLAMA_BASE_URL |
Ollama server URL | http://localhost:11434 |
DEFAULT_LLM_PROVIDER |
Default LLM backend (openai, anthropic, ollama) |
openai |
OPENAI_MODEL |
OpenAI model name | gpt-4o-mini |
ANTHROPIC_MODEL |
Anthropic model name | claude-sonnet-4-20250514 |
OLLAMA_MODEL |
Ollama model name | llama3.1 |
MAX_CONCURRENT_REQUESTS |
Concurrent scrape limit | 5 |
REQUEST_DELAY_MS |
Delay between requests (ms) | 1000 |
REQUEST_TIMEOUT_S |
HTTP request timeout | 30 |
MAX_RETRIES |
Max retry attempts on failure | 3 |
CACHE_ENABLED |
Enable/disable response caching | true |
CACHE_TTL_SECONDS |
Cache time-to-live | 3600 |
LOG_LEVEL |
Logging level | INFO |
LOG_JSON |
Output logs as JSON (for production) | false |
PROXY_URL |
HTTP/SOCKS proxy URL | — |
# Run all tests
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=src --cov-report=html
# Run a specific test file
pytest tests/test_extractors.py -vdocker build -t scrapeintel .
docker run -p 8000:8000 --env-file .env scrapeintel# API only
docker compose up
# API + local Ollama (free LLM inference)
docker compose --profile local-llm up# Initialize git (if not already done)
git init
git add .
git commit -m "Initial commit: ScrapeIntel AI web scraper"
# Create a repo on GitHub, then:
git remote add origin https://github.com/YOUR_USERNAME/scrapeintel.git
git branch -M main
git push -u origin mainThe .gitignore already excludes .env. Double-check before pushing:
git status
# Make sure .env is NOT listed in the files to be committed- Browser-based scraping (Playwright integration for JS-rendered pages)
- Scheduling & cron jobs for recurring scrapes
- Web dashboard UI for non-technical users
- Webhook notifications on job completion
- Plugin system for custom extractors
- Vector storage for extracted data (RAG-ready)
MIT License — see LICENSE for details.
Contributions welcome! Please read CONTRIBUTING.md first.
- Fork the repo
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
