ScrapLLM

Intelligent e-commerce scraping platform powered by LLMs to extract structured product data from dynamic websites without hardcoded selectors.

ScrapLLM combines scraping infrastructure, HTML preprocessing, LLM-based structured extraction, validation, and monitoring into a production-ready pipeline.

Overview

ScrapLLM is designed to automate large-scale product data extraction across heterogeneous e-commerce websites.

The system:

Scrapes product pages from 50+ websites
Cleans and preprocesses raw HTML to reduce noise
Uses LLMs to extract structured product fields
Validates and normalizes outputs
Monitors performance, latency, and cost across multiple models

Results

Metric	Result
Structured extraction accuracy	95%
Token consumption reduction	60%
API cost reduction	$400/month
Response time improvement	40%

Problem

Traditional scraping systems rely on manual CSS selector mapping per website. This approach:

Does not scale across many brands
Breaks when websites update layouts
Requires continuous maintenance
Increases operational cost

ScrapLLM eliminates the need for static selectors by leveraging LLMs to dynamically interpret and extract structured product data directly from cleaned HTML.

Architecture

High-level flow:

URL Ingestion
     ↓
Scraping Layer (HTML Retrieval)
     ↓
HTML Cleaning & Preprocessing
     ↓
LLM-Based Structured Extraction
     ↓
Output Validation & Normalization
     ↓
Monitoring & Evaluation

The system is modular, extensible, and designed for multi-model experimentation.

Key Engineering Decisions

HTML Preprocessing Pipeline

To reduce token usage and cost:

Removed non-semantic DOM elements
Extracted product-relevant sections
Applied intelligent truncation strategies
Reduced prompt size by 60% without degrading extraction quality

Structured Output Enforcement

JSON schema-based prompting
Output validation layer
Automatic retry on malformed responses
Normalization of price, currency, and availability formats

Multi-Model Evaluation

Integrated monitoring to:

Compare multiple LLM providers
Track latency and token usage
Measure cost efficiency
Evaluate structured field accuracy

This enabled data-driven model selection and performance optimization.

Tech Stack

Layer	Technology
Backend	FastAPI
AI Orchestration	LangChain
Monitoring	Langfuse
Frontend	React
Containerization	Docker
Cloud	GCP

Impact

Eliminated manual product data entry
Scaled extraction across 50+ heterogeneous websites
Reduced API costs through preprocessing optimization
Improved response latency through structured benchmarking
Built a reusable architecture adaptable to new brands without selector engineering

Demos

Scrapping

Check out the Link

Benchmark

Check out the Link

Future Improvements

Distributed scraping workers
Queue-based processing architecture
Extraction confidence scoring
Hybrid selector + LLM fallback strategy
Domain-specific fine-tuned extraction model

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
backend		backend
cloud-functions/hakrawler		cloud-functions/hakrawler
frontend		frontend
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapLLM

Overview

Results

Problem

Architecture

Key Engineering Decisions

HTML Preprocessing Pipeline

Structured Output Enforcement

Multi-Model Evaluation

Tech Stack

Impact

Demos

Scrapping

Benchmark

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapLLM

Overview

Results

Problem

Architecture

Key Engineering Decisions

HTML Preprocessing Pipeline

Structured Output Enforcement

Multi-Model Evaluation

Tech Stack

Impact

Demos

Scrapping

Benchmark

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages