AI Web Crawler

A production-ready, modular intelligent web crawler powered by local LLMs via Ollama.

Built for AMD GPU (ROCm) workstations running Ollama at http://localhost:11434. Supports three workflow modes: simple, LangGraph (stateful), and CrewAI (multi-agent).

Prerequisites

Python 3.10+ (3.12 recommended; crewai requires Python ≤3.13)
Ollama installed and running: https://ollama.ai/download
ROCm (for AMD GPU acceleration) — Ollama handles this automatically on Linux

1. Clone & Install

git clone https://github.com/hectorsvill/ai-crawler.git
cd ai-crawler

# Create a virtual environment
# Use python3.12 if available; python3 works on 3.13/3.14 (crewai will be skipped)
python3.12 -m venv .venv   # preferred
# python3 -m venv .venv    # fallback if python3.12 is not in PATH

source .venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

# Note: on Python 3.14, crewai will fail to install (it requires ≤3.13).
# All other features work. crewai mode degrades silently to simple mode.

# Install Playwright browser (for JS-heavy sites)
playwright install chromium

Post-clone verification

After installing, confirm everything is wired up correctly:

# Syntax-check all source files (no Ollama needed)
python -c "import ast, pathlib; [ast.parse(f.read_text()) for f in pathlib.Path('.').rglob('*.py') if '.venv' not in str(f)]"

# Run the full test suite (no Ollama or network needed)
pip install pytest pytest-asyncio   # one-time
pytest tests/ -v
# Expected: 143 passed

2. Pull Ollama Models

# Start Ollama (if not already running)
ollama serve &

# Required: extraction model (~4GB)
ollama pull qwen2.5:7b

# Required: fast navigation model (~1GB)
ollama pull qwen2.5:1.5b

# Verify models are available
ollama list

Alternative models (update default_config.yaml to use them):

ollama pull phi4-mini      # fast, efficient (~2.5GB)
ollama pull llama3.2:3b    # good balance (~2GB)
ollama pull mistral:7b     # high quality extraction (~4.4GB)

3. Quick Test

# Verify everything works — crawls Hacker News and prints top stories
python main.py crawl \
  --goal "Get the top 5 stories with titles and scores" \
  --start-url "https://news.ycombinator.com" \
  --workflow simple \
  --max-pages 3

Expected output: a summary table with extracted HN stories and token usage stats.

4. Usage Examples

Crypto — Latest Binance Coin Listings

Binance is a JS-heavy site, so Playwright is required. Use a capable model (gemma3 or better) and set the navigator model via env var since the default small models struggle with structured output.

CRAWLER_OLLAMA__NAVIGATOR_MODEL=gemma3:latest \
python main.py crawl \
  --goal "Find the latest Binance coin listings and new token launches — extract coin name, symbol, listing date, and description" \
  --start-url "https://www.coingecko.com/en/exchanges/binance" \
  --start-url "https://cryptorank.io/exchanges/binance/new-listings" \
  --workflow simple \
  --max-pages 8 \
  --max-depth 2 \
  --model gemma3:latest

Tip: For JS-heavy sites (Binance, CoinGecko), Playwright must be installed:
playwright install chromium

Research / Knowledge Base Building

python main.py crawl \
  --goal "Build a comprehensive knowledge base on solar energy startups in Europe" \
  --start-url "https://solarenergyeurope.org" \
  --start-url "https://www.pv-magazine.com" \
  --workflow langgraph \
  --max-pages 100 \
  --max-depth 3

E-Commerce Price Monitoring

python main.py crawl \
  --goal "Extract product names, prices, and availability for all laptops" \
  --start-url "https://example-shop.com/laptops" \
  --workflow simple \
  --max-pages 20

Competitive Intelligence

python main.py crawl \
  --goal "Analyze the features, pricing, and positioning of cloud database providers" \
  --start-url "https://aws.amazon.com/rds/" \
  --start-url "https://cloud.google.com/sql" \
  --start-url "https://azure.microsoft.com/products/azure-sql-database" \
  --workflow crewai \
  --max-pages 50

Lead Generation

python main.py crawl \
  --goal "Generate a list of B2B SaaS companies in fintech with contact information" \
  --start-url "https://www.crunchbase.com/hub/fintech-startups" \
  --workflow crewai \
  --max-pages 200

Content Aggregation

python main.py crawl \
  --goal "Collect all blog posts about machine learning published in 2024" \
  --start-url "https://example-blog.com/ml" \
  --workflow langgraph \
  --max-pages 500 \
  --max-depth 5

RAG Dataset Building

python main.py crawl \
  --goal "Crawl the Python documentation and extract all function signatures and descriptions" \
  --start-url "https://docs.python.org/3/library/" \
  --workflow langgraph \
  --max-pages 300 \
  --max-depth 4

5. Resuming a Crawl

If a crawl is interrupted, resume it using:

# List recent sessions to find the session ID
python main.py list-sessions

# Resume by session ID
python main.py resume <session-id>

# Or use --resume flag to auto-resume the latest session
python main.py crawl --goal "..." --start-url "..." --resume

6. Exporting Results

# Export structured JSON extractions (default)
python main.py export <session-id> --output results.json

# Export every crawled page as clean Markdown — ready for RAG, vector DBs, and LLM pipelines
python main.py export <session-id> --format markdown --output pages.md

JSON output — one record per page, LLM-extracted structured data:

[{"data": {...}, "confidence": 0.85, "schema": "product_listing"}, ...]

Markdown output — one section per crawled page, clean and LLM-ready:

# Crawled Pages — Session f5dbfea0
*8 pages · exported by ai-crawler*

## pokemon · GitHub Topics · GitHub
**URL:** https://github.com/topics/pokemon
**Fetched:** 2026-03-31T04:41:36

# Search code, repositories, users, issues, pull requests...
Here are 5,866 public repositories matching this topic...
...

---

7. Changing Models

Override models per-run via CLI:

python main.py crawl \
  --goal "..." \
  --start-url "..." \
  --model mistral:7b  # override extractor model

Or edit default_config.yaml:

ollama:
  navigator_model: "qwen2.5:1.5b"   # fast decisions
  extractor_model: "qwen2.5:7b"     # quality extraction
  router_model: "qwen2.5:7b"        # workflow selection

8. Custom Configuration

Copy and modify the default config:

cp default_config.yaml my_config.yaml

Then pass it to any command:

python main.py crawl --goal "..." --start-url "..." --config my_config.yaml

Key settings:

crawl:
  max_depth: 5              # how deep to follow links
  max_pages: 500            # session page limit
  rate_limit_per_domain: 2  # requests/sec per domain
  delay_range: [1.0, 3.0]   # random delay between requests (seconds)
  domain_allowlist:         # leave empty for no restriction
    - "*.example.com"
  domain_denylist:          # always block these
    - "ads.example.com"
  use_playwright_for:       # force JS rendering for these domains
    - "*.react-app.com"
    - "*.nextjs-site.io"

9. Architecture Overview

Goal → Router → Workflow
                  ├── Simple:     fetch → navigate → extract → store (loop)
                  ├── LangGraph:  StateGraph with conditional edges + SQLite checkpointing
                  └── CrewAI:     Navigator + Extractor + Researcher + Summarizer agents

Each workflow uses:

Navigator agent (qwen2.5:1.5b) — scores relevance, selects links
Extractor agent (qwen2.5:7b) — extracts structured JSON from pages
Respectful crawling — robots.txt, rate limits, domain filters, depth limits

10. Environment Variable Overrides

Override any config setting without modifying YAML:

export CRAWLER_OLLAMA__BASE_URL="http://remote-server:11434"
export CRAWLER_CRAWL__MAX_PAGES=1000
export CRAWLER_STORAGE__DB_PATH="/data/crawl.db"

11. Running the Test Suite

Tests require no Ollama instance or network connection — they cover all pure-Python logic.

# Install test dependencies (one-time)
pip install pytest pytest-asyncio

# Run all tests
pytest tests/ -v

# Run a specific module
pytest tests/test_url_utils.py -v
pytest tests/test_models.py -v

Test coverage includes:

Module	What's tested
`utils/url.py`	URL normalization, fragment stripping, port removal, query sorting
`storage/models.py`	Score clamping, action validation, Pydantic validators
`llm/client.py`	Token counting, chunking, LLM cache, JSON fence stripping
`crawler/engine.py`	Link extraction, title parsing, HTML→Markdown conversion
`crawler/respectful.py`	Domain filter (allowlist/denylist/wildcard), rate limiter, Playwright detector
`agents/`	Link filtering, history summarization, extraction merging
`storage/db.py`	Content hash determinism and correctness

12. Recent Improvements

Area	Change
URL normalization	All URLs are canonicalized before queuing (lowercase, strip fragments/default ports/trailing slashes, sort query params) preventing duplicate crawls of semantically identical URLs
Score clamping	LLM-returned scores outside [0, 1] are clamped rather than rejected, preventing crawl interruption from out-of-range model output
Action validation	`NavigatorDecision.action` values not in `{deepen, backtrack, complete}` safely default to `deepen`
Multi-chunk extraction	Long pages now use `extract_chunks` so data spread across multiple context windows is captured and merged
Domain filter	`*.example.com` patterns now correctly match `example.com` itself (parent domain)
Playwright timeout	Added outer `asyncio.wait_for` around the full browser lifecycle to prevent hang if the browser process stalls
MIME logging	Skipped non-HTML responses are now logged at INFO level with the actual content-type
robots.txt logging	Distinguishes 404 (no robots.txt) from network errors vs unexpected HTTP status codes

13. Troubleshooting

Ollama not responding:

ollama serve  # start the Ollama server
curl http://localhost:11434/api/tags  # verify it's running

Playwright browser not found:

playwright install chromium
playwright install-deps  # install system dependencies

LangGraph / CrewAI not installed: The crawler degrades gracefully to simple mode. Install with:

pip install langgraph crewai[tools]

crewai fails to install (Python 3.14+): crewai>=0.80 requires Python ≤3.13. Either install Python 3.12 (brew install python@3.12) and recreate the venv, or use the crawler without crewai — it falls back to simple mode automatically.

# Recreate venv with Python 3.12
brew install python@3.12
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

AMD GPU not utilized:

# Check ROCm is detected by Ollama
ollama run qwen2.5:1.5b "hello"
# Watch GPU usage
watch -n1 rocm-smi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Crawler

Prerequisites

1. Clone & Install

Post-clone verification

2. Pull Ollama Models

3. Quick Test

4. Usage Examples

Crypto — Latest Binance Coin Listings

Research / Knowledge Base Building

E-Commerce Price Monitoring

Competitive Intelligence

Lead Generation

Content Aggregation

RAG Dataset Building

5. Resuming a Crawl

6. Exporting Results

7. Changing Models

8. Custom Configuration

9. Architecture Overview

10. Environment Variable Overrides

11. Running the Test Suite

12. Recent Improvements

13. Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
agents		agents
crawler		crawler
docs		docs
llm		llm
storage		storage
tests		tests
use_cases		use_cases
utils		utils
web		web
workflows		workflows
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
config.py		config.py
default_config.yaml		default_config.yaml
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
test_results.md		test_results.md

Folders and files

Latest commit

History

Repository files navigation

AI Web Crawler

Prerequisites

1. Clone & Install

Post-clone verification

2. Pull Ollama Models

3. Quick Test

4. Usage Examples

Crypto — Latest Binance Coin Listings

Research / Knowledge Base Building

E-Commerce Price Monitoring

Competitive Intelligence

Lead Generation

Content Aggregation

RAG Dataset Building

5. Resuming a Crawl

6. Exporting Results

7. Changing Models

8. Custom Configuration

9. Architecture Overview

10. Environment Variable Overrides

11. Running the Test Suite

12. Recent Improvements

13. Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages