Schema-first, self-healing HTML data extraction powered by LLMs. Define what you want with a Zod schema, and Pluckr figures out how to extract it — no selectors to write, no scraper to maintain.
Traditional scrapers break when websites change. Pluckr uses an LLM to generate CSS selectors, validates them against your Zod schema, caches what works, and automatically heals when pages change. You write the schema once and never touch selectors again.
npm install @pluckr/core @ai-sdk/google zodimport { Pluckr } from '@pluckr/core'
import { google } from '@ai-sdk/google'
import { z } from 'zod'
const pluckr = new Pluckr({
model: google('gemini-2.5-pro'),
})
const result = await pluckr.extract({
html: '<html>...</html>', // BYO HTML from any source
schema: z.object({
title: z.string(),
price: z.coerce.number().positive(),
inStock: z.coerce.boolean(),
}),
cacheKey: 'product-page',
})
if (result.success) {
console.log(result.data)
// { title: "A Light in the Attic", price: 51.77, inStock: true }
}
await pluckr.close()- You provide raw HTML and a Zod schema describing the data you want
- An LLM generates CSS selectors for each field via an agentic tool loop
- Selectors are tested and validated before being committed
- Working selectors are cached — subsequent extractions are instant (zero LLM calls)
- If selectors break due to page changes, Pluckr self-heals automatically
| Package | Description |
|---|---|
@pluckr/core |
Core extraction library |
@pluckr/sqlite |
SQLite storage — persistent caching to disk |
@pluckr/redis |
Redis storage — shared caching for distributed deployments |
npm install @pluckr/core zodPick an AI provider (any Vercel AI SDK provider works):
npm install @ai-sdk/google # or @ai-sdk/anthropic, @ai-sdk/openaiAdd a storage backend for persistent caching (optional):
npm install @pluckr/sqlite # file-based, single-process
npm install @pluckr/redis # distributed, multi-process- Schema-first — define what you want with Zod, get fully typed results
- Self-healing — selectors automatically repair when pages change
- BYO HTML — bring your own HTML from any source (fetch, Puppeteer, Playwright, curl)
- BYO model — works with any Vercel AI SDK model (Google, Anthropic, OpenAI, etc.)
- Pluggable storage — in-memory default, SQLite and Redis included, or implement
Storagefor any backend - No exceptions —
extract()returns a discriminated union (success: true | false)
In-memory (default) — no setup, cache lives for the process lifetime:
const pluckr = new Pluckr({ model })SQLite — persists selectors to disk across restarts:
import { SqliteStorage } from '@pluckr/sqlite'
const pluckr = new Pluckr({
model,
storage: new SqliteStorage(), // .pluckr/cache.db
})Redis — shared cache for distributed deployments:
import { RedisStorage } from '@pluckr/redis'
const pluckr = new Pluckr({
model,
storage: new RedisStorage({ url: 'redis://localhost:6379', ttl: 86400 }),
})Standalone, runnable examples for different use cases:
| Example | Description |
|---|---|
with-fetch |
Simplest setup — native fetch() |
with-cheerio |
Pre-process HTML before extraction |
with-playwright |
Extract from JS-rendered SPAs |
with-puppeteer |
Stealth browser with bot-detection avoidance |
with-sqlite |
Persistent caching across runs |
with-redis |
Shared caching with TTL |
cd examples/with-fetch
npm install
echo "GOOGLE_GENERATIVE_AI_API_KEY=your-key" > .env
npm startnpm install # install all workspace dependencies
npm run build # build all packages
npm test # run all tests (65 tests, fully mocked)MIT