Quarry

A CLI-first web extraction runtime for browser-driven crawling and durable ingestion

Quarry is a web extraction runtime for imperative, browser-backed scraping workflows. It is designed for adversarial sites, bespoke extraction logic, and long-lived ETL pipelines where correctness, observability, and durability matter more than convenience abstractions.

Quarry executes user-authored Puppeteer scripts under a strict runtime contract, streams observations incrementally, and hands off persistence to an external substrate (typically Lode). It is intentionally not a crawler framework, workflow engine, or low-code platform.

Installation

CLI

mise install github:justapithecus/quarry@0.5.0

SDK

npx jsr add @justapithecus/quarry-sdk

See PUBLIC_API.md for full setup and usage guide.

What Quarry Is

A runtime, not a framework
CLI-first, not embedded
Designed for imperative Puppeteer scripts
Explicit about ordering, backpressure, and failure
Agnostic to storage, retries, scheduling, and downstream processing
Not a crawling DSL or workflow orchestrator
Not a SaaS scraper or low-code pipeline

Quarry's responsibility ends at observing and emitting what happened.

Conceptual Model

Quarry enforces a clean boundary between extraction logic and ingestion mechanics:

User Script (Puppeteer, imperative)
        ↓
emit.*  (stable event contract)
        ↓
Quarry Runtime
        ↓
Ingestion Policy (strict, buffered, etc.)
        ↓
Persistence Substrate (e.g. Lode)

Scripts emit observations.
Policies decide how those observations are handled.
Persistence decides what survives.

Pipeline Composition

Quarry is designed to be composed around, not extended from:

# Extract
quarry run \
  --script streeteasy.ts \
  --run-id "streeteasy-$(date +%s)" \
  --source nyc-rent \
  --category streeteasy \
  --job '{"url": "https://streeteasy.com/rentals"}' \
  --storage-backend fs \
  --storage-path /var/quarry/data \
  --policy buffered

# Transform (outside Quarry)
nyc-rent-transform \
  --input /var/quarry/data/source=nyc-rent \
  --output /var/quarry/normalized

# Index / analyze (outside Quarry)
nyc-rent-index \
  --input /var/quarry/normalized

Quarry owns only the extraction step.

Quick Example

Quarry scripts are freestanding programs, not libraries.

They should:

Accept all inputs via the job payload
Use real Puppeteer objects (page, browser)
Emit all outputs via emit.*
Avoid shared global state
Remain agnostic to durability and retries

Example

import type { QuarryContext } from '@justapithecus/quarry-sdk'

export default async function run(ctx: QuarryContext): Promise<void> {
  await ctx.page.goto(ctx.job.url)

  const listings = await ctx.page.evaluate(() => {
    // scrape DOM
    return []
  })

  for (const listing of listings) {
    await ctx.emit.item({
      item_type: 'listing',
      data: listing
    })
  }

  await ctx.emit.runComplete()
}

Scripts are imperative, explicit, and boring by design.

Key Concepts

Emit API — all script output flows through emit.* → docs/guides/emit.md
Policies — strict or buffered ingestion control → docs/guides/policy.md
Storage — FS and S3 backends via Lode → docs/guides/lode.md
Proxies — pool-based rotation with multiple strategies → docs/guides/proxy.md
Streaming — chunked artifacts with backpressure → docs/guides/streaming.md
Configuration — YAML project defaults via --config → docs/guides/cli.md
Integration — webhook adapter for downstream triggers → docs/guides/integration.md
Run Lifecycle — terminal states and exit codes → docs/guides/run.md

Documentation

Resource	Path
Guides	docs/guides/
Contracts	docs/contracts/
Public API	PUBLIC_API.md
SDK	sdk/README.md
Examples	examples/
Support	SUPPORT.md
Changelog	CHANGELOG.md

Status

Quarry is under active development.

Contracts frozen, SDK stable
FS storage supported, S3 experimental
Platforms: linux/darwin, x64/arm64

Breaking changes are gated by contract versioning.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
executor-node		executor-node
quarry		quarry
scripts		scripts
sdk		sdk
.editorconfig		.editorconfig
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PUBLIC_API.md		PUBLIC_API.md
README.md		README.md
SUPPORT.md		SUPPORT.md
Taskfile.yaml		Taskfile.yaml
biome.json		biome.json
mise.toml		mise.toml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quarry

Installation

CLI

SDK

What Quarry Is

Conceptual Model

Pipeline Composition

Quick Example

Example

Key Concepts

Documentation

Status

License

About

Uh oh!

Releases 14

Packages

Uh oh!

Contributors 2

Languages

License

justapithecus/quarry

Folders and files

Latest commit

History

Repository files navigation

Quarry

Installation

CLI

SDK

What Quarry Is

Conceptual Model

Pipeline Composition

Quick Example

Example

Key Concepts

Documentation

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Contributors 2

Languages

Packages