Pipeline that extracts open roles from a curated set of public job boards, filters to data/engineering/GIS/geospatial titles, writes Parquet to Cloudflare R2, and exposes stable MotherDuck views for full inventory and personally relevant roles.
Source kinds and ATS slugs covered today:
Prerequisites:
- W0 acceptance state: R2 bucket reachable, MotherDuck-side R2 secret
in place (proven via
scripts/smoke_r2.pyandscripts/smoke_motherduck.py). .envpopulated withR2_ACCESS_KEY_ID,R2_SECRET_ACCESS_KEY,R2_ACCOUNT_ID,R2_BUCKET,MOTHERDUCK_TOKEN.
Run a single source kind:
# Greenhouse slice (onx + planetlabs) → s3://$R2_BUCKET/raw/greenhouse/<slug>/
uv run python -m dream_job_radar.pipelines.greenhouse
# Ashby slice → s3://$R2_BUCKET/raw/ashby/<slug>/
uv run python -m dream_job_radar.pipelines.ashby
# Sitemap-monitor slice → s3://$R2_BUCKET/raw/sitemap/<slug>/
# (0-row outcomes are honest; the resource yields only roles whose
# titles match the v1 keyword filter.)
uv run python -m dream_job_radar.pipelines.sitemap
# Page-monitor slice → s3://$R2_BUCKET/raw/page/<slug>/
# Per-site HTML parsing for sources without an API.
uv run python -m dream_job_radar.pipelines.page
# Rippling slice → s3://$R2_BUCKET/raw/rippling/<slug>/
# Public Rippling job-board API (flat JSON list).
uv run python -m dream_job_radar.pipelines.rippling
# Polymer slice → s3://$R2_BUCKET/raw/polymer/<slug>/
# Parent careers page enumerates role IDs; per-role JSON-LD on
# jobs.<company>.<tld> subdomain.
uv run python -m dream_job_radar.pipelines.polymer
# RemoteOK discovery slice → s3://$R2_BUCKET/raw/remoteok/remoteok/
# Public RemoteOK API; strict technical title/seniority filter before raw write.
# User-facing visibility is gated by company-domain review/rules in relevant_open_roles.
uv run python -m dream_job_radar.pipelines.remoteok
# 80,000 Hours discovery slice → s3://$R2_BUCKET/raw/eightythousandhours/jobs/
# Public Algolia browser search; strict technical title filter before raw write.
# Stores minimal metadata only; user-facing visibility is broad-source gated.
uv run python -m dream_job_radar.pipelines.eightythousandhours
# Tech Jobs for Good discovery slice → s3://$R2_BUCKET/raw/techjobsforgood/jobs/
# Public visible Software Engineering and Data + Analytics listings in approved
# civic/climate/infrastructure impact areas. Premium/login-only results are out of scope.
uv run python -m dream_job_radar.pipelines.techjobsforgood
# GIS Jobs Clearinghouse slice → s3://$R2_BUCKET/raw/gjc/rss/
# Public RSS feed; company/location are parsed conservatively from RSS descriptions.
uv run python -m dream_job_radar.pipelines.gjc
# Green Jobs Board slice → s3://$R2_BUCKET/raw/greenjobsboard/jobs/
# Public visible listing/detail pages with very strict technical title filtering.
uv run python -m dream_job_radar.pipelines.greenjobsboardRun all sources sequentially (Greenhouse → Ashby → sitemap → page → rippling → polymer → RemoteOK → 80,000 Hours → Tech Jobs for Good → GJC → Green Jobs Board locally):
uv run python -m dream_job_radar.pipelines.radarMaterialize the MotherDuck view (idempotent — CREATE OR REPLACE):
uv run python scripts/apply_company_domain_review.py
uv run python scripts/apply_views.pyBroad-discovery company review decisions live in
motherduck/company_domain_review.sql. Update that SQL seed for manual
approved, rejected, or pending company decisions, then re-run the seed and
view scripts above.
Health check (per-(source_kind, ats_slug) freshness; non-zero exit when a previously-observed slug went stale):
uv run python scripts/health_check.pyThe public Dive is a relevance-first radar over
"acorn-granary"."main"."relevant_open_roles".
current_open_rolesremains the full matching open-role inventory for recovery, debugging, and snapshot history.relevant_open_rolesis the normal user-facing surface. It includes remote roles unless the location explicitly names a non-US/non-worldwide remote region, plus explicit Arizona-local roles; non-Arizona onsite/hybrid roles stay out.- Broad-discovery sources such as RemoteOK, 80,000 Hours, Tech Jobs for Good, GJC, and Green Jobs Board must also pass company/domain review
or deterministic mission-fit rules before they appear in
relevant_open_roles. Unknown, pending, or rejected broad-discovery companies remain available incurrent_open_rolesfor review/debugging but are hidden from the Dive. - Primary KPIs count current relevant open roles, companies, and locations.
- Daily snapshot history comes from
"acorn-granary"."mart"."job_postings_daily_snapshot"and tracks the full current matching open-role inventory once per UTC day. - The recent lens is separate: it counts and lists roles whose
posted_at, or fallbackfirst_seen_at, is within the last 7 days and pass the relevance view.
Do not describe the whole Dive as "last 7 days" unless the query is scoped to the
recent lens. Also do not describe current_open_roles as personalized; it is full
inventory, while relevant_open_roles is the location-relevant surface.
Then in MotherDuck:
SELECT source_kind, ats_slug, count(*) FROM current_open_roles
GROUP BY 1, 2 ORDER BY 1, 2;
SELECT source_kind, ats_slug, count(*) FROM relevant_open_roles
GROUP BY 1, 2 ORDER BY 1, 2;
SELECT title, location FROM current_open_roles
WHERE source_kind = 'ashby' AND ats_slug = 'Mapbox'
LIMIT 10;.github/workflows/refresh.yml runs the pipeline on cron:
- schedule: daily at
0 12 * * *(12:00 UTC = 5am Pacific / 8am Eastern) - manual:
gh workflow run refresh.ymlor the GitHub UI's "Run workflow" button
Workflow shape:
- Checkout, install
uv,uv sync. - Run each source kind in its own step with
continue-on-error: true(one source failing does not block others). scripts/apply_views.pyruns withif: always()so the view DDL is reapplied even if a source step failed.scripts/health_check.pyruns as a required step. Failure here fails the workflow and triggers GitHub's email-on-failure.
The five required secrets (R2_ACCESS_KEY_ID,
R2_SECRET_ACCESS_KEY, R2_ACCOUNT_ID, R2_BUCKET,
MOTHERDUCK_TOKEN) are configured at the repo level per Wave 0
acceptance.
concurrency: { group: refresh, cancel-in-progress: false } keeps
two runs from racing on _dlt_pipeline_state writes.