GitHub - kyndig/job-scraper

KOIS Phases 1-3

This repository implements KOIS as a provenance-first opportunity intelligence pipeline:

Phase 1 foundation (ingestion, extraction, clustering, review states, digest persistence)
Phase 2 market intelligence (agreement signals, gap discovery, analytics endpoints)
Phase 3 sales filtering (role classification, availability-aware relevance scoring, configurable digest thresholds/cadence, relevant-opportunity API view)

Per run pipeline:

scrape broker portals in parallel (Mercell, Verama, Folq, Emagine, Witted)
ingest IMAP mailbox items from oppdrag@kynd.no
persist immutable raw source evidence (raw_source_items)
extract structured records (extracted_records)
cluster likely duplicates with source comparisons (opportunity_clusters, source_comparisons)
classify cluster role fit + relevance against lightweight availability profile
create review states (review_states) and conservative digest items (digest_items)

Requirements

Python 3.12+
PostgreSQL reachable via DATABASE_URL
Playwright Chromium for scraping

Environment Variables

Core:

DATABASE_URL (example: postgresql+psycopg://postgres:postgres@localhost:5432/kois)
RUN_LIVE_SLACK (false by default; set true to post live)
SLACK_CHANNEL (job-posting default)
DIGEST_MODE (balanced, high_precision, high_recall)
DIGEST_MIN_RELEVANCE_SCORE (default 0.35)
DIGEST_MIN_SOURCE_CONFIDENCE (default 0.75)
DIGEST_CADENCE_MINUTES (default 0, meaning no cadence gate)
AVAILABILITY_PROFILE_JSON (optional role capacity map, e.g. {"data_engineering":2,"backend":1})
ROLE_TAXONOMY_JSON (optional role-to-keywords map to override defaults)

Integrations:

GEMINI_API_KEY (optional; summarization/extraction enhancement)
SLACK_TOKEN (required only when RUN_LIVE_SLACK=true)
{PLATFORM}_USERNAME / {PLATFORM}_PASSWORD for each scraper

IMAP (oppdrag@kynd.no):

IMAP_HOST
IMAP_PORT (default 993)
IMAP_USERNAME
IMAP_PASSWORD
IMAP_MAILBOX (default INBOX)
IMAP_SINCE_UID (default 1)
IMAP_SOURCE_NAME (default oppdrag@kynd.no)

Setup

python3.12 -m venv .venv
source .venv/bin/activate
pip install ".[dev]"
python -m playwright install chromium

Run KOIS Pipeline

python -m job_scraper.main

Review and Intelligence API

uvicorn job_scraper.kois.review_api:app --reload

Useful endpoints:

GET /health
GET /clusters
GET /clusters?q=<search>
GET /review-queue
GET /opportunities/relevant
PATCH /clusters/{cluster_id}/status with auto_accepted, needs_review, manually_merged, manually_split, ignored, watch_only
GET /analytics/summary
GET /agreement-signals
GET /agreement-gaps
PATCH /agreement-gaps/{gap_id}/status

Checks

ruff check .
python -m pytest

Notes

Raw evidence is preserved even when extraction fails.
Deduplication is cluster-based; source records are retained.
Slack digest output is driven by persisted cluster state, not transient scraper output.
Phase 3 filtering only affects presentation (digest/API relevance), not archive retention.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
job_scraper		job_scraper
tests		tests
.gitignore		.gitignore
.python-version		.python-version
PID.md		PID.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KOIS Phases 1-3

Requirements

Environment Variables

Setup

Run KOIS Pipeline

Review and Intelligence API

Checks

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KOIS Phases 1-3

Requirements

Environment Variables

Setup

Run KOIS Pipeline

Review and Intelligence API

Checks

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages