A CLI-first web extraction runtime for browser-driven crawling and durable ingestion
Quarry is a web extraction runtime for imperative, browser-backed scraping workflows. It is designed for adversarial sites, bespoke extraction logic, and long-lived ETL pipelines where correctness, observability, and durability matter more than convenience abstractions.
Quarry executes user-authored Puppeteer scripts under a strict runtime contract, streams observations incrementally, and hands off persistence to an external substrate (typically Lode). It is intentionally not a crawler framework, workflow engine, or low-code platform.
mise install github:justapithecus/quarry@0.5.0npx jsr add @justapithecus/quarry-sdkSee PUBLIC_API.md for full setup and usage guide.
- A runtime, not a framework
- CLI-first, not embedded
- Designed for imperative Puppeteer scripts
- Explicit about ordering, backpressure, and failure
- Agnostic to storage, retries, scheduling, and downstream processing
- Not a crawling DSL or workflow orchestrator
- Not a SaaS scraper or low-code pipeline
Quarry's responsibility ends at observing and emitting what happened.
Quarry enforces a clean boundary between extraction logic and ingestion mechanics:
User Script (Puppeteer, imperative)
↓
emit.* (stable event contract)
↓
Quarry Runtime
↓
Ingestion Policy (strict, buffered, etc.)
↓
Persistence Substrate (e.g. Lode)
Scripts emit observations.
Policies decide how those observations are handled.
Persistence decides what survives.
Quarry is designed to be composed around, not extended from:
# Extract
quarry run \
--script streeteasy.ts \
--run-id "streeteasy-$(date +%s)" \
--source nyc-rent \
--category streeteasy \
--job '{"url": "https://streeteasy.com/rentals"}' \
--storage-backend fs \
--storage-path /var/quarry/data \
--policy buffered
# Transform (outside Quarry)
nyc-rent-transform \
--input /var/quarry/data/source=nyc-rent \
--output /var/quarry/normalized
# Index / analyze (outside Quarry)
nyc-rent-index \
--input /var/quarry/normalizedQuarry owns only the extraction step.
Quarry scripts are freestanding programs, not libraries.
They should:
- Accept all inputs via the job payload
- Use real Puppeteer objects (
page,browser) - Emit all outputs via
emit.* - Avoid shared global state
- Remain agnostic to durability and retries
import type { QuarryContext } from '@justapithecus/quarry-sdk'
export default async function run(ctx: QuarryContext): Promise<void> {
await ctx.page.goto(ctx.job.url)
const listings = await ctx.page.evaluate(() => {
// scrape DOM
return []
})
for (const listing of listings) {
await ctx.emit.item({
item_type: 'listing',
data: listing
})
}
await ctx.emit.runComplete()
}Scripts are imperative, explicit, and boring by design.
- Emit API — all script output flows through
emit.*→ docs/guides/emit.md - Policies — strict or buffered ingestion control → docs/guides/policy.md
- Storage — FS and S3 backends via Lode → docs/guides/lode.md
- Proxies — pool-based rotation with multiple strategies → docs/guides/proxy.md
- Streaming — chunked artifacts with backpressure → docs/guides/streaming.md
- Configuration — YAML project defaults via
--config→ docs/guides/cli.md - Integration — webhook adapter for downstream triggers → docs/guides/integration.md
- Run Lifecycle — terminal states and exit codes → docs/guides/run.md
| Resource | Path |
|---|---|
| Guides | docs/guides/ |
| Contracts | docs/contracts/ |
| Public API | PUBLIC_API.md |
| SDK | sdk/README.md |
| Examples | examples/ |
| Support | SUPPORT.md |
| Changelog | CHANGELOG.md |
Quarry is under active development.
- Contracts frozen, SDK stable
- FS storage supported, S3 experimental
- Platforms: linux/darwin, x64/arm64
Breaking changes are gated by contract versioning.
Apache 2.0