This repository implements KOIS as a provenance-first opportunity intelligence pipeline:
- Phase 1 foundation (ingestion, extraction, clustering, review states, digest persistence)
- Phase 2 market intelligence (agreement signals, gap discovery, analytics endpoints)
- Phase 3 sales filtering (role classification, availability-aware relevance scoring, configurable digest thresholds/cadence, relevant-opportunity API view)
Per run pipeline:
- scrape broker portals in parallel (
Mercell,Verama,Folq,Emagine,Witted) - ingest IMAP mailbox items from
oppdrag@kynd.no - persist immutable raw source evidence (
raw_source_items) - extract structured records (
extracted_records) - cluster likely duplicates with source comparisons (
opportunity_clusters,source_comparisons) - classify cluster role fit + relevance against lightweight availability profile
- create review states (
review_states) and conservative digest items (digest_items)
- Python 3.12+
- PostgreSQL reachable via
DATABASE_URL - Playwright Chromium for scraping
Core:
DATABASE_URL(example:postgresql+psycopg://postgres:postgres@localhost:5432/kois)RUN_LIVE_SLACK(falseby default; settrueto post live)SLACK_CHANNEL(job-postingdefault)DIGEST_MODE(balanced,high_precision,high_recall)DIGEST_MIN_RELEVANCE_SCORE(default0.35)DIGEST_MIN_SOURCE_CONFIDENCE(default0.75)DIGEST_CADENCE_MINUTES(default0, meaning no cadence gate)AVAILABILITY_PROFILE_JSON(optional role capacity map, e.g.{"data_engineering":2,"backend":1})ROLE_TAXONOMY_JSON(optional role-to-keywords map to override defaults)
Integrations:
GEMINI_API_KEY(optional; summarization/extraction enhancement)SLACK_TOKEN(required only whenRUN_LIVE_SLACK=true){PLATFORM}_USERNAME/{PLATFORM}_PASSWORDfor each scraper
IMAP (oppdrag@kynd.no):
IMAP_HOSTIMAP_PORT(default993)IMAP_USERNAMEIMAP_PASSWORDIMAP_MAILBOX(defaultINBOX)IMAP_SINCE_UID(default1)IMAP_SOURCE_NAME(defaultoppdrag@kynd.no)
python3.12 -m venv .venv
source .venv/bin/activate
pip install ".[dev]"
python -m playwright install chromiumpython -m job_scraper.mainuvicorn job_scraper.kois.review_api:app --reloadUseful endpoints:
GET /healthGET /clustersGET /clusters?q=<search>GET /review-queueGET /opportunities/relevantPATCH /clusters/{cluster_id}/statuswithauto_accepted,needs_review,manually_merged,manually_split,ignored,watch_onlyGET /analytics/summaryGET /agreement-signalsGET /agreement-gapsPATCH /agreement-gaps/{gap_id}/status
ruff check .
python -m pytest- Raw evidence is preserved even when extraction fails.
- Deduplication is cluster-based; source records are retained.
- Slack digest output is driven by persisted cluster state, not transient scraper output.
- Phase 3 filtering only affects presentation (digest/API relevance), not archive retention.