Skip to content

Web extraction runtime for browser-driven crawling and durable ingestion

License

Notifications You must be signed in to change notification settings

justapithecus/quarry

Repository files navigation

Quarry

CI

A CLI-first web extraction runtime for browser-driven crawling and durable ingestion

Quarry is a web extraction runtime for imperative, browser-backed scraping workflows. It is designed for adversarial sites, bespoke extraction logic, and long-lived ETL pipelines where correctness, observability, and durability matter more than convenience abstractions.

Quarry executes user-authored Puppeteer scripts under a strict runtime contract, streams observations incrementally, and hands off persistence to an external substrate (typically Lode). It is intentionally not a crawler framework, workflow engine, or low-code platform.


Installation

CLI

mise install github:justapithecus/quarry@0.5.0

SDK

npx jsr add @justapithecus/quarry-sdk

See PUBLIC_API.md for full setup and usage guide.


What Quarry Is

  • A runtime, not a framework
  • CLI-first, not embedded
  • Designed for imperative Puppeteer scripts
  • Explicit about ordering, backpressure, and failure
  • Agnostic to storage, retries, scheduling, and downstream processing
  • Not a crawling DSL or workflow orchestrator
  • Not a SaaS scraper or low-code pipeline

Quarry's responsibility ends at observing and emitting what happened.


Conceptual Model

Quarry enforces a clean boundary between extraction logic and ingestion mechanics:

User Script (Puppeteer, imperative)
        ↓
emit.*  (stable event contract)
        ↓
Quarry Runtime
        ↓
Ingestion Policy (strict, buffered, etc.)
        ↓
Persistence Substrate (e.g. Lode)

Scripts emit observations.
Policies decide how those observations are handled.
Persistence decides what survives.


Pipeline Composition

Quarry is designed to be composed around, not extended from:

# Extract
quarry run \
  --script streeteasy.ts \
  --run-id "streeteasy-$(date +%s)" \
  --source nyc-rent \
  --category streeteasy \
  --job '{"url": "https://streeteasy.com/rentals"}' \
  --storage-backend fs \
  --storage-path /var/quarry/data \
  --policy buffered

# Transform (outside Quarry)
nyc-rent-transform \
  --input /var/quarry/data/source=nyc-rent \
  --output /var/quarry/normalized

# Index / analyze (outside Quarry)
nyc-rent-index \
  --input /var/quarry/normalized

Quarry owns only the extraction step.


Quick Example

Quarry scripts are freestanding programs, not libraries.

They should:

  • Accept all inputs via the job payload
  • Use real Puppeteer objects (page, browser)
  • Emit all outputs via emit.*
  • Avoid shared global state
  • Remain agnostic to durability and retries

Example

import type { QuarryContext } from '@justapithecus/quarry-sdk'

export default async function run(ctx: QuarryContext): Promise<void> {
  await ctx.page.goto(ctx.job.url)

  const listings = await ctx.page.evaluate(() => {
    // scrape DOM
    return []
  })

  for (const listing of listings) {
    await ctx.emit.item({
      item_type: 'listing',
      data: listing
    })
  }

  await ctx.emit.runComplete()
}

Scripts are imperative, explicit, and boring by design.


Key Concepts


Documentation

Resource Path
Guides docs/guides/
Contracts docs/contracts/
Public API PUBLIC_API.md
SDK sdk/README.md
Examples examples/
Support SUPPORT.md
Changelog CHANGELOG.md

Status

Quarry is under active development.

  • Contracts frozen, SDK stable
  • FS storage supported, S3 experimental
  • Platforms: linux/darwin, x64/arm64

Breaking changes are gated by contract versioning.


License

Apache 2.0