Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
333 commits
Select commit Hold shift + click to select a range
5e69be7
feat(pipeline): enrich metadata with filtered projects list
spideystreet Dec 8, 2025
67827f5
feat(pipeline): relax language filtering threshold to 30%
spideystreet Dec 8, 2025
42eb30f
feat(pipeline): cleanup asset metadata sample
spideystreet Dec 8, 2025
6e14901
test: fixtures for staging
spideystreet Dec 8, 2025
c272dec
test: fixtures for staging
spideystreet Dec 8, 2025
8ddcf0e
fix: lineage dependancies
spideystreet Dec 8, 2025
a9a78e1
docs: add dbt models documentation
spideystreet Dec 8, 2025
f7c1fc6
feat(dbt): add staging and intermediate models for scraper ELT
spideystreet Dec 8, 2025
de7a18f
feat(dbt): update pivot and prod models for ELT
spideystreet Dec 8, 2025
54bab84
feat(scraper): update assets to write to raw tables and link to dbt
spideystreet Dec 8, 2025
52274c2
feat(embedding): update context preparation to use flat dbt columns
spideystreet Dec 8, 2025
56437b5
refactor(pipeline): remove legacy python enrichment assets
spideystreet Dec 8, 2025
6b2d285
refactor(elt): migrate schema, implement upsert, and streamline dbt m…
spideystreet Dec 8, 2025
9e86468
docs: up env example
spideystreet Dec 8, 2025
ee57018
refactor(elt): rename prod model and update env example
spideystreet Dec 8, 2025
ff49a09
refactor: no map config needed anymore
spideystreet Dec 14, 2025
7a73b16
feat(pipeline): implement tech stack sync and fix classification assets
spideystreet Dec 17, 2025
1322b96
fix(ingestion): update readme asset schema, group and persist logic
spideystreet Dec 17, 2025
7a365c8
fix(ingestion): update languages asset schema, group and persist logic
spideystreet Dec 17, 2025
f506294
fix(ingestion): update topics asset schema, group and persist logic
spideystreet Dec 17, 2025
b68c599
fix(ingestion): update extract asset group and cleanup logic
spideystreet Dec 17, 2025
9ad0d2d
fix(ingestion): update load asset group name
spideystreet Dec 17, 2025
8365121
chore(jobs): remove legacy embedding_jobs.py and cleanup
spideystreet Dec 17, 2025
aef2713
style(resources): translate comments to english
spideystreet Dec 17, 2025
4a98387
chore(config): update dagster definitions and sensor
spideystreet Dec 17, 2025
60807f6
build(deps): add transformers and accelerate
spideystreet Dec 17, 2025
c368046
chore(db): update prisma schema with new models and trending field
spideystreet Dec 17, 2025
1473775
fix: readme link
spideystreet Dec 17, 2025
72ac35c
refactor(dbt): reorganize models by domain (users/projects) and clean…
spideystreet Dec 17, 2025
f396eda
chore(db): remove dbt-managed IntGithubProject from prisma schema
spideystreet Dec 17, 2025
835fab9
chore(dbt): update project configuration for new model structure
spideystreet Dec 17, 2025
984db56
feat(dbt): add context generation and utility macros
spideystreet Dec 17, 2025
f5e8f07
chore(scripts): update language fixtures generator to use correct schema
spideystreet Dec 17, 2025
3443503
Merge pull request #17 from opensource-together/ost-410-feat-projects…
spideystreet Dec 17, 2025
96c9883
Merge branch 'ost-408-feat-embeddings-for-cosine-similarities' of htt…
spideystreet Dec 17, 2025
dd321b9
fix(pipeline): remove shadowing sensors.py to allow package import
spideystreet Dec 17, 2025
288aa71
docs: simplify README description to be product-focused
spideystreet Dec 20, 2025
e582636
docs: up README
spideystreet Dec 20, 2025
899df7b
docs: update quick start guide with poetry and docker commands
spideystreet Dec 20, 2025
cea0449
style(resources): translate comments to english in LLM classifier
spideystreet Dec 20, 2025
cf15a04
perf(llm): optimize prompt to reduce tokens and strict json format
spideystreet Dec 20, 2025
fbc41c2
feat: improve context with cat & domain only
spideystreet Dec 20, 2025
e2e84ad
test(dbt): add unique, not_null and relationship tests to staging/int…
spideystreet Dec 21, 2025
0951e05
test(dbt): ensure projects have a url
spideystreet Dec 21, 2025
283112b
feat(dbt): implement ml context pipeline (stg_public_project, raw_pro…
spideystreet Dec 21, 2025
0ae9821
feat(ml): add embedding pipeline (resource, asset, job)
spideystreet Dec 21, 2025
5469248
fix(pipeline): explicit public/project dependency via asset key
spideystreet Dec 21, 2025
aabc1c7
docs(dbt): explain raw_github_readme dependency in stg_public_project
spideystreet Dec 21, 2025
b348db1
fix(dbt): restore missing CTE definition in stg_public_project
spideystreet Dec 21, 2025
8c3bb46
refactor(dbt): centralize ml config in dbt_project.yml
spideystreet Dec 21, 2025
529015d
refactor(dbt): split schema.yml into per-model yamls
spideystreet Dec 21, 2025
55ff320
chore: cleanup unused dbt models, legacy assets, and refactor pipelin…
spideystreet Dec 21, 2025
c4fb157
refactor(pipeline): switch to int->raw->stg flow and cleanup schema
spideystreet Dec 21, 2025
21a08ef
fix(pipeline): refactor IO Manager, fix scraper timeout, and serializ…
spideystreet Dec 21, 2025
8dc71ec
refactor: config on dagster
spideystreet Dec 21, 2025
13ba29d
refactor(config): consolidate config into single cfg_resource.py
spideystreet Dec 21, 2025
2b4ed38
refactor(dbt): optimize clean_llm_context macro for LLM understanding
spideystreet Dec 21, 2025
7361e53
refactor(dbt): enhance generate_project_context with skip_empty logic
spideystreet Dec 21, 2025
13c9eb0
refactor(dbt): add normalization to json_array_to_string macro
spideystreet Dec 21, 2025
6fda9ce
refactor(dbt): rename json_array_to_string to jsonb_to_list
spideystreet Dec 21, 2025
92525ff
refactor(dbt): rename macros for clarity
spideystreet Dec 21, 2025
c23df98
docs(dbt): update model contracts with concise descriptions
spideystreet Dec 21, 2025
f8b71fd
refactor(dbt): rename ML models and organize into subdirectories
spideystreet Dec 21, 2025
c4914b3
fix(pipeline): update embed asset to source from pvt_public_project
spideystreet Dec 21, 2025
d380484
refactor(pipeline): rename job and reorganize asset groups
spideystreet Dec 21, 2025
78b56df
refactor(dbt): assign ml_preparation group to ml models
spideystreet Dec 21, 2025
4e50653
fix: io manager key usage instead of pandas one, return correct dicti…
spideystreet Jan 19, 2026
5101eee
chore: debug log for upserting
spideystreet Jan 19, 2026
92ab3e2
fix: added explicit string casting for uuids
spideystreet Jan 19, 2026
c60763b
fix: cast main pid
spideystreet Jan 19, 2026
5544812
fix: asset name for lineafe
spideystreet Jan 19, 2026
35ed09d
feat: add users embedding
spideystreet Jan 19, 2026
85f3856
feat: embedding user asset
spideystreet Jan 19, 2026
2fa835f
feat(dbt): add user models to prepare computing
spideystreet Jan 19, 2026
f586301
fix: column name (context)
spideystreet Jan 19, 2026
d2c023b
fix: last query parameters string
spideystreet Jan 19, 2026
468be47
feat: add matching model projects<->users
spideystreet Jan 20, 2026
954d58e
feat: add ml prep models related to users
spideystreet Jan 20, 2026
12b9e20
feat: add complete flow on dbt project
spideystreet Jan 20, 2026
b5c9a1c
feat: embedding assets projects/users
spideystreet Jan 20, 2026
db79a5f
feat: sync asset to up projects
spideystreet Jan 20, 2026
70a8ede
fix: github default queryarguments limit
spideystreet Jan 20, 2026
92103ab
fix: match view to table
spideystreet Jan 20, 2026
f09a7e4
feat: order by star to limit quality projects
spideystreet Jan 20, 2026
c89c10b
refactor(dbt): assign ml_preparation group to ml/int models
spideystreet Jan 21, 2026
ea7ae31
fix(pipeline): update job selections to match new groups
spideystreet Jan 21, 2026
ba86741
refactor: build user context alligned with projects one
spideystreet Jan 21, 2026
1569bc9
docs(dbt): enhance match recommendation contracts
spideystreet Jan 21, 2026
30a9cf8
feat: add matching models for recommendations
spideystreet Jan 21, 2026
fe60740
feat: add context prep model for machine learning
spideystreet Jan 21, 2026
28f3ea1
docs(dbt): enhance project model contracts
spideystreet Jan 21, 2026
8e29b1b
docs(dbt): update sources.yml contract
spideystreet Jan 22, 2026
057690c
docs(dbt): reco precision
spideystreet Jan 22, 2026
e956d44
fix(pipeline): wire embedding asset to int_project_embedding_candidate
spideystreet Jan 22, 2026
8655385
docs: improve dbt model and dagster asset descriptions
spideystreet Jan 22, 2026
d2f0d9b
chore(dbt): remove stale config for non-existent model int_github_emb…
spideystreet Jan 22, 2026
170211c
config: update excluded terms list for scraper
spideystreet Jan 22, 2026
a0d5fae
chore(infra): dockerize application
spideystreet Jan 22, 2026
4b85c5e
config: 10 ops max for github query
spideystreet Jan 22, 2026
58b8ee1
chore: add logs for classified projects evolution
spideystreet Jan 22, 2026
5b25f33
config: up to date config with needed vars & parameters
spideystreet Jan 26, 2026
dd78aa0
config: up lineage with llm classifier as resource + good parameters …
spideystreet Jan 26, 2026
db6102a
feat: optimised query parameters to find acurate projects
spideystreet Jan 26, 2026
e1df582
config: group name ml
spideystreet Jan 26, 2026
3b69861
build: up dockerignore
spideystreet Jan 26, 2026
7de373c
fix: seed import syntax
spideystreet Jan 26, 2026
0ab4690
docs: up env example
spideystreet Jan 26, 2026
3ed6b16
docs: add embedding & raw tables not managed by dbt, used by linker t…
spideystreet Jan 26, 2026
b37c14e
fix: correct lineage of groups, to ensure they launch together
spideystreet Jan 26, 2026
b279790
build: correct env var usage
spideystreet Jan 26, 2026
5249257
docs: up README to date
spideystreet Jan 26, 2026
210e0ec
feat(prisma): allign with backend & add extensions for linker
spideystreet Jan 28, 2026
bf71e03
build: entrypoint script to dbt build & deps
spideystreet Jan 28, 2026
7bfdf01
chore: up gitignore
spideystreet Jan 28, 2026
e08e0d2
chore(docker): configure entrypoint script and dependencies
spideystreet Jan 28, 2026
a005caf
fix: pg client no need
spideystreet Jan 28, 2026
332d490
chore: entrypoint pg is ready step outdated
spideystreet Jan 28, 2026
381b5f6
feat(schedule): add run_all_schedule 5x daily (Europe/Paris)
spideystreet Jan 28, 2026
25f5f34
feat: migrate LLM classifier to OpenRouter and tune dbt matching logic
spideystreet Jan 30, 2026
4794754
refactor(linker): rename src/pipeline to src/linker
spideystreet Mar 2, 2026
32a98be
docs(claude): split CLAUDE.md into .claude/rules/
spideystreet Mar 2, 2026
fcc9d5b
fix(config): remove hardcoded secret defaults
spideystreet Mar 2, 2026
f194839
fix(go): harden scraper and fetcher with retry, rate-limit, and upsert
spideystreet Mar 2, 2026
b6b2562
refactor(dbt): restructure models from domain-based to layer-based la…
spideystreet Mar 2, 2026
05cd813
refactor(linker): update asset keys to match renamed dbt models
spideystreet Mar 2, 2026
a3938e3
feat(go): add open_issues_count field to scraper struct
spideystreet Mar 2, 2026
80a8d78
fix(dagster): align DAGSTER_HOME path, gitignore, and Dockerfile config
spideystreet Mar 2, 2026
9dc326d
ci(github-actions): add sqlfluff + quality gates to CI workflows
spideystreet Mar 2, 2026
db49305
chore(gitignore): ignore dagster/ runtime directory
spideystreet Mar 2, 2026
cdb55a7
chore(deps): migrate from Poetry to uv
spideystreet Mar 2, 2026
729d7d7
fix(linker): make GitHub query date dynamic instead of stale at import
spideystreet Mar 2, 2026
cde225b
refactor(linker): migrate PipelineConfig from legacy @resource to Con…
spideystreet Mar 2, 2026
f5c6ec6
fix(linker): remove dead site_url/site_name fields from LLM classifier
spideystreet Mar 2, 2026
6a5e4ee
refactor(linker): remove dead scraper utils, unused schedule, and emp…
spideystreet Mar 2, 2026
71583ca
fix(linker): clean up definitions.py dead code and duplicate comments
spideystreet Mar 2, 2026
f8b39c6
refactor(linker): fix embed_projects config access and add encode_batch
spideystreet Mar 2, 2026
b33901b
fix(linker): use encode_batch in embed_projects for batch encoding
spideystreet Mar 2, 2026
13e5439
refactor(resources): migrate PipelineConfig fields to EnvVar
spideystreet Mar 2, 2026
a3fd1da
refactor(resources): migrate IO manager to ConfigurableIOManager with…
spideystreet Mar 2, 2026
73b0b9d
refactor(resources): migrate FastText and LLM resources to EnvVar
spideystreet Mar 2, 2026
c74b9d2
refactor(assets): use build_fetcher_env in fetcher and scraper assets
spideystreet Mar 2, 2026
0a31f04
test(resources): add unit tests for config resource helpers
spideystreet Mar 2, 2026
cc165b1
chore(lint): fix import sorting and unused imports
spideystreet Mar 2, 2026
156d27b
docs: update .env.example, add CONTRIBUTING.md, sync docs submodule
spideystreet Mar 2, 2026
31803b2
feat(resources): add STAR_RANGES and multi-query support to build_scr…
spideystreet Mar 2, 2026
95c77bc
feat(scraper): rewrite Go scraper for parallel multi-query execution
spideystreet Mar 2, 2026
748598b
feat(assets): update raw_github__extract_projects to handle multi-que…
spideystreet Mar 2, 2026
1c0490c
fix(scraper): use token auth header for GitHub PAT
spideystreet Mar 2, 2026
7b27c31
fix(resources): trim EXCLUDED_TERMS to 4 to stay within GitHub NOT limit
spideystreet Mar 2, 2026
ae5d681
fix(assets): access sentence_transformer via context.resources
spideystreet Mar 2, 2026
368e19e
fix(dagster): use cautious indirect selection in dbt build
spideystreet Mar 2, 2026
8616f49
fix(dbt): add asset_key meta to source tables for Dagster key resolution
spideystreet Mar 2, 2026
0061bb8
docs: document GITHUB_API_URL and GITHUB_SCRAPING_QUERIES in .env.exa…
spideystreet Mar 2, 2026
382fffe
chore: add .mypy_cache to .gitignore
spideystreet Mar 2, 2026
9d4646f
docs(contributing): remove Discord link
spideystreet Mar 2, 2026
a96c16c
refactor(dbt): replace binary pre-filter with continuous preference s…
spideystreet Mar 3, 2026
636c4af
fix(dbt): remove FK relationship tests on staging enrichment models
spideystreet Mar 3, 2026
5b34d8c
feat(fetcher): skip already-fetched projects via incremental lookup
spideystreet Mar 3, 2026
ba5d680
refactor(classifier): add hard timeout and httpx timeouts to LLM calls
spideystreet Mar 3, 2026
6ff02f3
feat(seed): add test users with preferences for recommendation testing
spideystreet Mar 3, 2026
c1f5222
chore: minor .env.example formatting
spideystreet Mar 3, 2026
2f71d58
chore: add GitHub issue and PR templates
spideystreet Mar 3, 2026
e85fd26
chore: add Makefile for common dev commands
spideystreet Mar 3, 2026
d6adf2f
chore: add project metadata to pyproject.toml
spideystreet Mar 3, 2026
db498bc
docs: add contributing and license sections to README
spideystreet Mar 3, 2026
605ef53
refactor: DRY Makefile setup target via build-go delegation
spideystreet Mar 3, 2026
960af2d
fix: move dependencies to correct TOML section and resolve all ruff e…
spideystreet Mar 4, 2026
26d854f
fix: add type annotations and resolve all mypy errors
spideystreet Mar 4, 2026
30bf27e
style(dbt): fix all sqlfluff lint errors across models and tests
spideystreet Mar 4, 2026
5029747
fix(dbt): add default values to profiles.yml for CI compatibility
spideystreet Mar 4, 2026
2b25e66
ci: add format check and switch dbt-check job to uv
spideystreet Mar 4, 2026
264f700
style: fix ruff UP038 isinstance union syntax
spideystreet Mar 4, 2026
4e07a05
refactor(ci): extract quality and dbt-check into reusable workflow
spideystreet Mar 4, 2026
1cf2fc2
fix(dbt): use neutral default password in profiles.yml
spideystreet Mar 4, 2026
786bc7a
docs: sync docs submodule with latest AI pages
spideystreet Mar 4, 2026
75a4544
chore(docker): clean up .dockerignore and reduce build context
spideystreet Mar 4, 2026
7f148c1
fix(docker): harden Dockerfile with non-root user, stripped binaries,…
spideystreet Mar 4, 2026
04a6d19
fix(docker): add missing env vars, DB healthcheck, and localhost bind…
spideystreet Mar 4, 2026
a8eeaf0
fix(docker): make init.sh resilient and remove hardcoded defaults
spideystreet Mar 4, 2026
4cb9fb9
chore(dagster): reduce max concurrent runs and document SQLite limita…
spideystreet Mar 4, 2026
c83dc8f
fix: fix .env.example typo and document missing Dagster vars
spideystreet Mar 4, 2026
7e0afe1
feat(dagster): add workspace.yaml and prod config for production depl…
spideystreet Mar 4, 2026
395fee2
fix(docker): split Dagster into webserver and daemon services
spideystreet Mar 4, 2026
16a15d1
fix(docker): add g++ for fasttext and strip editable install from req…
spideystreet Mar 4, 2026
51601a6
refactor(docker): move dev DB to docker-compose.override.yml
spideystreet Mar 4, 2026
4b55f3b
ci(docs): add submodule SHA check and remove obsolete deploy-docs wor…
spideystreet Mar 5, 2026
4675aa9
ci(docs): add workflow to sync submodule changes to ost-docs
spideystreet Mar 5, 2026
bd657ae
chore(docs): update submodule pointer to latest ost-docs
spideystreet Mar 5, 2026
f57ba0c
docs: make README more concise with tech stack table and Makefile qui…
spideystreet Mar 5, 2026
8d43e39
chore: clean up .gitignore and untrack FastText model binary
spideystreet Mar 5, 2026
db648a8
chore: track utility scripts previously hidden by global *.sh ignore
spideystreet Mar 5, 2026
329a516
ci: add Go, Docker, Prisma, security, and coverage checks
spideystreet Mar 5, 2026
14e5e24
chore(deps): add pip-audit to dev dependencies
spideystreet Mar 5, 2026
5ed2431
refactor(docker): install torch CPU-only to reduce image size by ~2GB
spideystreet Mar 5, 2026
ab15212
fix(deps): upgrade dbt-common 1.37.2 → 1.37.3 (GHSA-w75w-9qv4-j5xj)
spideystreet Mar 5, 2026
01b042f
fix(lint): stabilize import sorting between local and CI environments
spideystreet Mar 5, 2026
f6d37d0
fix(ci): fix Prisma, SQLFluff, gitleaks, and docs-sync CI failures
spideystreet Mar 5, 2026
e3135a4
fix(ci): replace paid gitleaks action with free CLI
spideystreet Mar 5, 2026
aacb3c0
ci: enable uv cache for Python CI jobs
spideystreet Mar 5, 2026
821d49e
ci: add gitleaks allowlist for README false positives
spideystreet Mar 5, 2026
a3ae0c0
docs: update submodule pointer after MDX rewrite
spideystreet Mar 5, 2026
eb875a4
feat(dagster): add user_recommendation_job and rebalance schedules
spideystreet Mar 5, 2026
7135725
fix(prisma): fix verification mapping, drop dead ProjectEmbedding, ad…
spideystreet Mar 5, 2026
4e0b1af
refactor(prisma): convert prisma/ to shared submodule
spideystreet Mar 5, 2026
e07e7dc
ci: add prisma submodule checks and sync workflow
spideystreet Mar 5, 2026
eae5718
revert(prisma): convert back from submodule to regular directory
spideystreet Mar 5, 2026
9fd58ac
ci: replace prisma submodule sync with backend file sync
spideystreet Mar 5, 2026
ce9c4a5
ci: add Claude GitHub Actions workflows
spideystreet Mar 6, 2026
93920e9
feat(agents): add 4 custom Claude subagents for project-specific work…
spideystreet Mar 6, 2026
2ec230f
docs(claude): add test-first bug fixing rule to CLAUDE.md
spideystreet Mar 6, 2026
facf2bf
Merge pull request #19 from opensource-together/refactor/project-stru…
spideystreet Mar 6, 2026
1a5ff8a
ci(review): set Claude Sonnet as model for PR review workflow
spideystreet Mar 6, 2026
17dbefa
docs: add CODE_OF_CONDUCT, SECURITY policy, and update CLAUDE.md
spideystreet Mar 6, 2026
82821c5
fix(ci): set write permissions for Claude GitHub Action
spideystreet Mar 6, 2026
7e385d9
fix(ci): skip quality checks and sync workflows on PRs to develop
spideystreet Mar 6, 2026
b0f67db
revert(ci): remove redundant base_ref guards from workflows
spideystreet Mar 6, 2026
32f976b
Merge pull request #20 from opensource-together/fix/post-review-fixes
spideystreet Mar 6, 2026
3f4a56f
test(ci): verify @claude responds on PR comments
spideystreet Mar 6, 2026
a998511
feat(agents): rename agents with JJK theme, add infra agent and CI rules
spideystreet Mar 6, 2026
e646e98
Merge pull request #24 from opensource-together/feat/agents-infra-readme
spideystreet Mar 6, 2026
0bb9872
fix(dbt): remove hardcoded credentials and fix O(n³) join + score clamp
spideystreet Mar 6, 2026
57a1b62
fix: resolve critical and high-severity audit findings across all layers
spideystreet Mar 6, 2026
92858aa
docs(agents): mark fixed vulnerabilities in agent known issues lists
spideystreet Mar 6, 2026
7ccaa6e
Merge pull request #25 from opensource-together/fix/audit-findings
spideystreet Mar 6, 2026
868892a
fix(dagster): resolve job orchestration issues and concurrency conflicts
spideystreet Mar 6, 2026
e6b5b62
refactor(dagster): split ml_preparation into user/project groups and …
spideystreet Mar 6, 2026
806283c
refactor(dagster): merge classification and embedding into project_en…
spideystreet Mar 6, 2026
fce4540
refactor(dagster): restructure groups into project_ml and user_ml flows
spideystreet Mar 6, 2026
ab9648a
chore(dagster): rename files to match exports and remove dead sensor
spideystreet Mar 6, 2026
392d6c6
feat(dbt): add data contracts, tests, and utility macros on mart models
spideystreet Mar 6, 2026
8435dd5
refactor(dbt): integrate clamp/safe_divide macros and enrich intermed…
spideystreet Mar 6, 2026
d1e177a
docs(dbt): add yml contracts for all 8 macros
spideystreet Mar 6, 2026
6066067
docs(dbt): split macro contracts into one yml per macro
spideystreet Mar 6, 2026
7df50ee
docs(dbt): add yml contracts for singular data tests
spideystreet Mar 6, 2026
69a3d26
docs(agents): update dbt-six-eyes with file convention, group mapping…
spideystreet Mar 6, 2026
5b69021
docs: update docs submodule with new orchestration documentation
spideystreet Mar 6, 2026
d08e1f4
docs: update submodule ref with review fixes
spideystreet Mar 6, 2026
6873e0e
fix: resolve findings from final agent review
spideystreet Mar 6, 2026
edc5b3b
refactor(dagster): merge scraper into project_enrichment_job
spideystreet Mar 6, 2026
3125b6d
perf(classification): skip already-classified projects
spideystreet Mar 6, 2026
cdc1328
fix(dbt): cast freshness_score to double precision for contract compl…
spideystreet Mar 6, 2026
06a3a0c
refactor: extract shared utils, harden resources, and fix scraper log…
spideystreet Mar 6, 2026
8ef1f28
test: add comprehensive test suite for Python and Go services
spideystreet Mar 6, 2026
c6cbad7
docs: update project rules, CLAUDE.md, and agent memory
spideystreet Mar 6, 2026
1d5158c
Merge pull request #26 from opensource-together/feat/test-strategy
spideystreet Mar 7, 2026
627d5e2
fix(ci): add git author config in sync workflows (#27)
spideystreet Mar 7, 2026
15a8e64
fix(ci): resolve dbt-check, quality, and docs-submodule CI failures (…
spideystreet Mar 7, 2026
9b562e4
chore(ci): unify sync tokens and add security contact email (#30)
spideystreet Mar 7, 2026
00c445b
fix(ci): rename token to OST_LINKER_SYNC_TOKEN and lower coverage to …
spideystreet Mar 7, 2026
ca07d90
fix(ci): make dagster startup smoke test non-blocking in CI
spideystreet Mar 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .claude/agent-memory/dbt-analyst/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# dbt Analyst Memory

## Key Fixes Applied (confirmed working)

- **profiles.yml local target**: `POSTGRES_USER` and `POSTGRES_PASSWORD` must use `env_var(...)` with NO fallback default. The docker target intentionally keeps empty-string defaults — do not change those.
- **match_user_recommendation.sql user_totals CTE**: use pre-aggregated subqueries (GROUP BY inside LEFT JOIN subquery) to avoid O(n³) row explosion from joining raw junction tables directly.
- **freshness_score**: always clamp both bounds — `greatest(0, least(1.0, ...))`. Missing the upper `least(1.0, ...)` is a confirmed bug when `pushed_at` is in the future.

## dbt parse workflow

Run `dbt parse` to validate SQL without a live DB:
```bash
cd dbt && POSTGRES_USER=test POSTGRES_PASSWORD=test uv run dbt parse --profiles-dir .
```
`--profiles-dir .` tells dbt to read `dbt/profiles.yml` rather than `~/.dbt/profiles.yml`.
Since `POSTGRES_USER`/`POSTGRES_PASSWORD` have no defaults in the local profile (by design), dummy env vars are needed to pass parsing.

## Known open issues (not yet fixed)

- No `relationships` tests on any FK columns across all models
- No source freshness configured (`loaded_at_field` / `freshness` in sources.yml)
- `stg_public__project.sql:53` UUID namespace mismatch risk between `Project.id` and github `project_id`
- All models materialized as `table` — intermediates should be `view` unless perf requires otherwise
48 changes: 48 additions & 0 deletions .claude/agent-memory/dbt-six-eyes/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# dbt-six-eyes Agent Memory

## Critical: Dagster Group Names (verified against dbt_project.yml)

The system prompt lists old group names — the ACTUAL groups in `dbt_project.yml` are:
- `ingestion` — stg_github__*, int_project_enriched, fct_github_project
- `project_ml` — stg_public__project, int_project_contextualized, int_project_embedding_candidate, match_global_recommendation
- `user_ml` — stg_public__user, int_user_enriched, fct_public_user, match_user_recommendation

The system prompt references `ml_preparation` and `matching` — these are STALE names.

## Schema Mapping (verified against dbt_project.yml)

| Model | Schema |
|-------|--------|
| stg_github__* | github |
| stg_public__user | ml |
| stg_public__project | ml |
| int_project_enriched | github |
| int_user_enriched | ml |
| int_project_contextualized | ml |
| int_project_embedding_candidate | ml |
| fct_github_project | github |
| fct_public_user | ml |
| match_global_recommendation | public |
| match_user_recommendation | public |

NOTE: match_* models write to `public` schema, NOT `match` schema.
The `match` schema exists in Prisma but NO dbt model writes to it.

## Known Documentation Errors (docs/ai/)

- `structure.mdx:82` — claims `match` schema holds "dbt-materialized tables"; wrong, dbt writes match_* to `public`
- `overview.mdx:37` — dbt card says "4 PostgreSQL schemas"; dbt actually uses 3 (github, ml, public)

## Macros Present (dbt/macros/)

build_project_context, build_user_context, clamp, clean_text, deduplicate,
generate_schema_name, jsonb_to_list, safe_divide

## Custom Tests (dbt/tests/)

unique_user_project_recommendation, valid_hybrid_score_bounds

## All Materializations

All models (staging, intermediate, marts) are `table` in dbt_project.yml.
No intermediates use `view` despite the known-issue recommendation.
32 changes: 32 additions & 0 deletions .claude/agent-memory/go-black-flash/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Go Black Flash — Agent Memory

## Key file paths

- Scraper: `src/services/go/scraper/` (main.go, common.go, common_test.go, main_test.go)
- Fetcher: `src/services/go/fetcher/` (main.go, common.go, fetch_readme.go, fetch_languages.go, fetch_topics.go, common_test.go)

## Confirmed fixes (as of feat/test-strategy branch)

- Fetcher top-level context timeout: `fetcher/main.go:50` — 30-minute timeout in place
- SQL injection mitigation: `fetcher/common.go:98-117` — allowlist `validTargetTables` guards table name interpolation
- Partial body on status 200: `fetcher/common.go:222-226` — returns error only, no partial body
- README body size limit: `fetch_readme.go:40` — `io.LimitReader(resp.Body, 10*1024*1024)`
- Scraper proactive rate limiting: `scraper/common.go` — `searchRateLimiter.wait()` + `update()` on every request
- `dbCancel` scoping: scraper now scopes cancel inside batch block correctly

## Open issues (do not re-report as new)

1. **Unlock-sleep-relock race** — both `fetcher/common.go:29-38` and `scraper/common.go:28-38` still use the pattern. Medium severity, latent, fires under full concurrency with remaining==1.
2. **`br.Close()` error discarded** — `scraper/main.go:117`. Fetcher files check it correctly.
3. **`FetchReadmes` count inflation** — `fetch_readme.go:141` adds `len(batch)` but empty-content items are skipped in the DB queue. Languages/topics are unaffected (always write).
4. **`retryRequest` sleeps ignore context** — `fetcher/common.go:213,240,244` use bare `time.Sleep`. Fix: `select { case <-time.After(dur): case <-ctx.Done(): return nil, ctx.Err() }`.
5. **Scan errors silently dropped** — `fetcher/common.go:132,167`. Should log at WARN level.

## Patterns confirmed in this codebase

- `pgx.Batch` + `SendBatch` + iterate `br.Exec()` N times + check `br.Close()` — fetcher does this correctly; scraper discards `br.Close()` error
- `io.LimitReader` on raw README content; not needed on JSON API endpoints (GitHub enforces size)
- Rate limiters: `wait()` before request, `update(resp)` after regardless of status code
- Worker pool pattern: buffered `sem` channel + `wg.Wait()` in goroutine to close results channel
- Context propagation: all fetch functions take `ctx context.Context` as first arg
- No mock tests — tests cover pure functions only (env parsing, URL parsing, UTF-8 truncation, header parsing)
30 changes: 30 additions & 0 deletions .claude/agent-memory/go-service-reviewer/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Go Service Reviewer — Persistent Memory

## Key file paths

- Fetcher: `src/services/go/fetcher/` (main.go, common.go, fetch_readme.go, fetch_languages.go, fetch_topics.go)
- Scraper: `src/services/go/scraper/` (main.go, common.go)
- Each binary has its own `go.mod`; build with `go build ./...` from the package directory

## Confirmed fixes applied (branch fix/post-review-fixes)

1. **rateLimiter.wait() mutex pattern** — Never use `defer mu.Unlock()` when the function also manually calls `mu.Unlock()` + `mu.Lock()` inside the body. The defer fires at return and double-unlocks. Correct pattern: lock at top, conditional unlock/sleep/relock in body, explicit unlock at bottom.

2. **fetcher top-level context** — `context.Background()` in `fetcher/main.go` replaced with `context.WithTimeout(..., 30*time.Minute)` + `defer cancel()`.

3. **SQL injection in getNewProjects** — `validTargetTables map[string]string` allowlist added in `fetcher/common.go`. Callers (`FetchReadmes`, `FetchLanguages`, `FetchTopics`) now pass mode keys (`"readme"`, `"languages"`, `"topics"`) not raw table names.

4. **io.LimitReader on README** — `fetch_readme.go` wraps `resp.Body` with `io.LimitReader(resp.Body, 10*1024*1024)` before `io.ReadAll`.

5. **Partial body on readErr** — `retryRequest` in `common.go` returns `(nil, readErr)` instead of `(body, readErr)` when `io.ReadAll` fails on a 200 response.

6. **Scraper shared rate limiter** — `searchRateLimiter` struct added to `scraper/common.go`. `fetchGitHubRepos` now accepts `*searchRateLimiter`, calls `rl.wait()` before the request and `rl.update(resp)` after. `scrapeQuery` and the goroutine loop in `main()` share one `newSearchRateLimiter()` instance.

## Patterns to check on every review

- `defer mu.Unlock()` combined with manual unlock/relock = double-unlock panic. Use explicit unlock at the bottom instead.
- `io.ReadAll` on HTTP bodies without `io.LimitReader` = unbounded memory risk.
- `fmt.Sprintf` with user-supplied or caller-supplied table names = SQL injection. Always use an allowlist.
- Top-level `context.Background()` in long-running binaries must have `WithTimeout`.
- Returning `(partialData, err)` on read failure misleads callers; always return `(nil, err)`.
- Concurrent goroutines sharing one `http.Client` need a proactive shared rate limiter, not just reactive 403 handling.
17 changes: 17 additions & 0 deletions .claude/agent-memory/pipeline-doctor/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Pipeline Doctor Memory

## Known Fix Patterns

- **IO Manager allowlist**: `_ALLOWED_TABLES` set in `src/linker/resources/io_manager.py` must be updated when new assets/tables are added
- **IO Manager strategy**: Uses truncate-then-append (not `if_exists="replace"` which drops tables)
- **LLM classifier**: Raises exceptions (`ValueError`, `TimeoutError`, `RuntimeError`) instead of returning error dicts. Caller in `core_match__classify_projects.py` catches per-project exceptions via existing try/except
- **LLM client**: Lazy singleton via `PrivateAttr` + `@property` pattern (Dagster `ConfigurableResource` requires this)
- **db.py commit param**: `get_db_connection(commit=)` now properly controls commit vs rollback. When `commit=False` (default), transaction is rolled back on exit
- **Nested try/except swallowing**: In `core_public__sync_projects.py`, used `_CriticalSyncError` custom exception to escape outer except block
- **Subprocess timeouts**: All Go fetcher assets use `timeout=600` on `subprocess.run()`

## Project Conventions

- Python binary is `python3` (not `python`) on this system
- Linter auto-runs on file save and removes unused imports
- `uv` is the package manager (not pip)
72 changes: 72 additions & 0 deletions .claude/agents/dagster-reverse-cursed-technique.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
name: dagster-reverse-cursed-technique
description: Dagster pipeline debugging and diagnostic specialist. Use proactively when an asset fails, a run crashes, a sensor or schedule misfires, or when investigating pipeline issues. Also use when modifying assets, jobs, schedules, sensors, or resources.
tools: Read, Edit, Bash, Grep, Glob
model: opus
memory: project
maxTurns: 30
---

You are an expert Dagster pipeline debugger for the OST Linker project.

## Project context

OST Linker is a Dagster-orchestrated pipeline that scrapes GitHub projects (Go binaries), classifies them via LLM, computes embeddings (SentenceTransformer), and surfaces recommendations via cosine similarity (pgvector).

Entry point: `src/linker/definitions.py`

### Asset groups

| Group | Assets | Description |
|-------|--------|-------------|
| `ingestion` | `raw_github__extract_projects`, 3 fetcher assets, `core_github__detect_languages` | Go binaries + language detection |
| `classification` | `core_match__classify_projects` | LLM classification via OpenRouter |
| `ml` | `core_ml__embed_projects`, `core_ml__embed_users` | SentenceTransformer 384-dim embeddings |
| `sync` | `core_public__sync_projects` | Upsert into public.Project |
| `dbt_models` | all dbt models | dagster-dbt integration |

### Resources

| Resource | Key | Notes |
|----------|-----|-------|
| `PipelineConfig` | `"config"` | All env vars, injected everywhere |
| `LLMClassifierResource` | `"llm_classifier"` | OpenRouter API, mistral-small |
| `SentenceTransformerResource` | `"sentence_transformer"` | all-MiniLM-L6-v2, CPU |
| `FastTextModelResource` | `"fasttext_model"` | lid.176.ftz for language detection |
| `PandasPostgresIOManager` | `"io_manager"` | DataFrame <-> Postgres via SQLAlchemy |

### Known issues to check for (updated 2026-03-06)

- ~~`get_db_cursor(commit=)` param is ignored~~ FIXED: commit parameter now implemented
- ~~IO manager uses `to_sql(if_exists="replace")`~~ FIXED: truncate+append strategy
- ~~IO manager SQL injection via f-string~~ FIXED: table name allowlist validation
- ~~LLM classifier returns error dicts~~ FIXED: raises exceptions
- ~~LLM classifier creates new client per call~~ FIXED: singleton via PrivateAttr
- ~~Fetcher assets have no `timeout` on `subprocess.run()`~~ FIXED: timeout=600 added
- ~~`core_github__detect_languages` returns success on DB failure~~ FIXED: re-raises exception
- `core_ml__embed_projects` creates `SQLAlchemy.create_engine()` per run without dispose
- ~~`core_public__sync_projects` inner raise swallowed~~ FIXED: custom exception type propagates

### DB schemas

- `public` — user-facing (User, Project, Category, Domain, TechStack)
- `github` — raw scraped data (RawGithubProject, RawGithubReadme, etc.)
- `ml` — embeddings (EmbdGithubProject, EmbdUser) with pgvector
- `match` — dbt-materialized recommendations

## Debugging workflow

When invoked:

1. Identify the failing asset or run from error messages / logs
2. Read the asset source code and its upstream dependencies
3. Check resource wiring in `definitions.py`
4. Trace data flow: which tables are read/written, which IO manager is used
5. Check for the known issues listed above
6. Look for: missing context metadata, silent exception swallowing, DB connection leaks
7. Propose a minimal, targeted fix
8. Verify the fix doesn't break downstream assets

Always check `dagster_home/` logs if available. Use `dagster dev` output for local debugging.

Update your agent memory with pipeline failure patterns, root causes, and fixes you discover.
75 changes: 75 additions & 0 deletions .claude/agents/dbt-six-eyes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
name: dbt-six-eyes
description: dbt model reviewer and analyst for the OST Linker project. Use proactively when creating, modifying, or debugging dbt models, sources, tests, or macros. Also use when dbt build/test/run fails.
tools: Read, Grep, Glob, Bash
model: sonnet
memory: project
maxTurns: 20
---

You are an expert dbt analyst for the OST Linker project.

## Project context

dbt project lives in `dbt/`. Profiles: `local` (port 5433) and `docker` (port 5432). Set `DBT_TARGET` to switch.

### Model organization

| Layer | Directory | Naming | Schema |
|-------|-----------|--------|--------|
| Staging | `models/staging/` | `stg_<source>__<entity>` (double underscore) | `github` or `public` |
| Intermediate | `models/intermediate/` | `int_<entity>_<verb>` | `github` or `public` |
| Marts | `models/marts/` | `fct_<entity>`, `match_<entity>` | `public` |

### Sources (defined in `models/sources.yml`)

| Source | Schema | Key tables |
|--------|--------|------------|
| `github_raw` | `github` | `RawGithubProject`, `RawGithubReadme`, `RawGithubLanguages`, `RawGithubTopics`, `IntGithubDetection` |
| `public` | `public` | `User`, `Project`, `Category`, `Domain`, `TechStack`, user junction tables |
| `ml` | `ml` | `EmbdGithubProject`, `EmbdUser` |

### Dagster group mapping (from `dbt_project.yml`)

- `stg_github__*`, `int_project_enriched`, `fct_github_project` -> `ingestion`
- `stg_public__project`, `int_project_contextualized`, `int_project_embedding_candidate`, `match_global_recommendation` -> `project_ml`
- `stg_public__user`, `int_user_enriched`, `fct_public_user`, `match_user_recommendation` -> `user_ml`

### Known issues to check for

- `stg_public__project.sql:53` joins `Project.id::uuid` with github `project_id` — these may be different UUID namespaces, verify the sync asset preserves IDs
- No source freshness configured (`loaded_at_field` / `freshness`)
- All models materialized as `table` — intermediates could be `view`

### Fixed (do not re-report)

- ~~`freshness_score` not clamped~~ — now uses `{{ clamp() }}` macro
- ~~No `relationships` tests on FKs~~ — added to all mart models
- ~~`user_totals` CTE cross-join O(n³)~~ — refactored
- ~~`profiles.yml` hardcoded password~~ — removed

## Review checklist

When reviewing or creating dbt models:

1. **File convention** — every `.sql` file MUST have a matching `.yml` file (models, macros, and singular tests)
2. **Naming** — verify `stg_`/`int_`/`fct_`/`match_` prefix matches the layer
3. **Double underscore** — staging models use `stg_source__entity` (not single underscore)
4. **Schema tests** — every model YAML must have `unique` and `not_null` on primary keys
5. **Relationships** — FK columns should have `relationships` tests (use `arguments:` syntax for dbt 1.10+)
6. **Data contracts** — mart models should have `contract: {enforced: true}` with `data_type` and `constraints`
7. **Materialization** — marts as `table`, intermediates as `view` unless performance requires `table`
8. **Source freshness** — sources should declare `loaded_at_field`
9. **ref() usage** — never hardcode table names, always use `{{ ref() }}` or `{{ source() }}`
10. **Score bounds** — any computed score must be clamped with `{{ clamp() }}` macro
11. **Join safety** — verify UUID namespaces match across schemas before joining
12. **No secrets** — profiles must not have hardcoded passwords as defaults

When debugging:

1. Run `dbt compile` to check SQL generation
2. Check `dbt_project.yml` for schema/group mapping
3. Verify source tables exist and match Prisma schema
4. Check for circular dependencies with `dbt ls --select +model_name+`

Update your agent memory with model patterns, common pitfalls, and conventions you discover.
Loading
Loading