Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
293 commits
Select commit Hold shift + click to select a range
ec637ff
fix: truncate readmes to avoid oom
spideystreet Nov 30, 2025
16514c5
fix: up to 8g mem limit & 4gb shm size
spideystreet Nov 30, 2025
63ee60a
feat: add topics recup for projects embedding job
spideystreet Dec 2, 2025
f0345cd
fix: torch 2.2.2 -> 2.6.0 to fix security failure CVE-2025-32434
spideystreet Dec 2, 2025
f60895e
docker compose exec dagster-webserver ls -R /app/models
spideystreet Dec 2, 2025
0fef9b2
refactor: switch to 384 vector dim
spideystreet Dec 2, 2025
a082ce5
refactor: switch to generic class name for embedding model
spideystreet Dec 2, 2025
c548417
fix: embedding model resource name
spideystreet Dec 2, 2025
431b404
fix: correct embeding model resource name
spideystreet Dec 2, 2025
c5e7112
fix: add embeding model as required resource
spideystreet Dec 2, 2025
f679535
fix: switch to 384 dim
spideystreet Dec 2, 2025
f0d79c5
build: up torch v
spideystreet Dec 3, 2025
311a30e
feat: embedding flow OK
spideystreet Dec 4, 2025
572049b
refactor: Makefile -> entrypoint
spideystreet Dec 4, 2025
1b85902
refactor: Makefile content on entrypoint for auto-deployment
spideystreet Dec 4, 2025
d2ea5b6
build: fasttext model download on stage
spideystreet Dec 4, 2025
f71aa46
build: add dbt dependancies
spideystreet Dec 7, 2025
df52962
build: entrypoint fix to running migrations
spideystreet Dec 7, 2025
ae29037
chore: dbt profile
spideystreet Dec 7, 2025
b2eed2d
chore: user
spideystreet Dec 7, 2025
46e8c0e
feat: add dbt project
spideystreet Dec 7, 2025
fb08d18
refactor: dagster home to dagster/
spideystreet Dec 7, 2025
53ad831
chore: dagster.yaml
spideystreet Dec 7, 2025
86bc222
chore: fix syntax
spideystreet Dec 7, 2025
5ce561a
refactor: pivot model before prod, for embedding usage
spideystreet Dec 8, 2025
d29dc44
refactor(infra): update project structure and config for local execution
spideystreet Dec 8, 2025
a916500
fix(assets): correct table names and uuid casting in raw sql
spideystreet Dec 8, 2025
9a0a795
feat(scraper): add explicit rate limit logging and error handling
spideystreet Dec 8, 2025
34a0bd1
feat(db): update schema and add migrations for raw, int, and embd tables
spideystreet Dec 8, 2025
cbcd834
refactor(pipeline): cleanup logic, naming, and comments in scraper an…
spideystreet Dec 8, 2025
66fc91f
build(prisma): upgrade to v6.19.0 and remove python client
spideystreet Dec 8, 2025
188c444
refactor(pipeline): replace prisma with psycopg2 and flatten assets
spideystreet Dec 8, 2025
fc335a6
chore(dbt): update model schemas to analytics
spideystreet Dec 8, 2025
6b0ee50
feat(scripts): add model download script
spideystreet Dec 8, 2025
7728c1d
refactor(pipeline): rename load asset and fix lineage dependencies
spideystreet Dec 8, 2025
7482d14
fix(pipeline): update definitions with renamed asset
spideystreet Dec 8, 2025
a2d52ee
fix(dbt): resolve syntax error in prod model
spideystreet Dec 8, 2025
0ec3958
feat(dbt): macro for schema
spideystreet Dec 8, 2025
6e6f1e2
fix(pipeline): handle partial db failures with savepoints
spideystreet Dec 8, 2025
2d69736
fix(pipeline): serialize datetime in asset metadata
spideystreet Dec 8, 2025
5e69be7
feat(pipeline): enrich metadata with filtered projects list
spideystreet Dec 8, 2025
67827f5
feat(pipeline): relax language filtering threshold to 30%
spideystreet Dec 8, 2025
42eb30f
feat(pipeline): cleanup asset metadata sample
spideystreet Dec 8, 2025
6e14901
test: fixtures for staging
spideystreet Dec 8, 2025
c272dec
test: fixtures for staging
spideystreet Dec 8, 2025
8ddcf0e
fix: lineage dependancies
spideystreet Dec 8, 2025
a9a78e1
docs: add dbt models documentation
spideystreet Dec 8, 2025
f7c1fc6
feat(dbt): add staging and intermediate models for scraper ELT
spideystreet Dec 8, 2025
de7a18f
feat(dbt): update pivot and prod models for ELT
spideystreet Dec 8, 2025
54bab84
feat(scraper): update assets to write to raw tables and link to dbt
spideystreet Dec 8, 2025
52274c2
feat(embedding): update context preparation to use flat dbt columns
spideystreet Dec 8, 2025
56437b5
refactor(pipeline): remove legacy python enrichment assets
spideystreet Dec 8, 2025
6b2d285
refactor(elt): migrate schema, implement upsert, and streamline dbt m…
spideystreet Dec 8, 2025
9e86468
docs: up env example
spideystreet Dec 8, 2025
ee57018
refactor(elt): rename prod model and update env example
spideystreet Dec 8, 2025
ff49a09
refactor: no map config needed anymore
spideystreet Dec 14, 2025
7a73b16
feat(pipeline): implement tech stack sync and fix classification assets
spideystreet Dec 17, 2025
1322b96
fix(ingestion): update readme asset schema, group and persist logic
spideystreet Dec 17, 2025
7a365c8
fix(ingestion): update languages asset schema, group and persist logic
spideystreet Dec 17, 2025
f506294
fix(ingestion): update topics asset schema, group and persist logic
spideystreet Dec 17, 2025
b68c599
fix(ingestion): update extract asset group and cleanup logic
spideystreet Dec 17, 2025
9ad0d2d
fix(ingestion): update load asset group name
spideystreet Dec 17, 2025
8365121
chore(jobs): remove legacy embedding_jobs.py and cleanup
spideystreet Dec 17, 2025
aef2713
style(resources): translate comments to english
spideystreet Dec 17, 2025
4a98387
chore(config): update dagster definitions and sensor
spideystreet Dec 17, 2025
60807f6
build(deps): add transformers and accelerate
spideystreet Dec 17, 2025
c368046
chore(db): update prisma schema with new models and trending field
spideystreet Dec 17, 2025
1473775
fix: readme link
spideystreet Dec 17, 2025
72ac35c
refactor(dbt): reorganize models by domain (users/projects) and clean…
spideystreet Dec 17, 2025
f396eda
chore(db): remove dbt-managed IntGithubProject from prisma schema
spideystreet Dec 17, 2025
835fab9
chore(dbt): update project configuration for new model structure
spideystreet Dec 17, 2025
984db56
feat(dbt): add context generation and utility macros
spideystreet Dec 17, 2025
f5e8f07
chore(scripts): update language fixtures generator to use correct schema
spideystreet Dec 17, 2025
3443503
Merge pull request #17 from opensource-together/ost-410-feat-projects…
spideystreet Dec 17, 2025
96c9883
Merge branch 'ost-408-feat-embeddings-for-cosine-similarities' of htt…
spideystreet Dec 17, 2025
dd321b9
fix(pipeline): remove shadowing sensors.py to allow package import
spideystreet Dec 17, 2025
288aa71
docs: simplify README description to be product-focused
spideystreet Dec 20, 2025
e582636
docs: up README
spideystreet Dec 20, 2025
899df7b
docs: update quick start guide with poetry and docker commands
spideystreet Dec 20, 2025
cea0449
style(resources): translate comments to english in LLM classifier
spideystreet Dec 20, 2025
cf15a04
perf(llm): optimize prompt to reduce tokens and strict json format
spideystreet Dec 20, 2025
fbc41c2
feat: improve context with cat & domain only
spideystreet Dec 20, 2025
e2e84ad
test(dbt): add unique, not_null and relationship tests to staging/int…
spideystreet Dec 21, 2025
0951e05
test(dbt): ensure projects have a url
spideystreet Dec 21, 2025
283112b
feat(dbt): implement ml context pipeline (stg_public_project, raw_pro…
spideystreet Dec 21, 2025
0ae9821
feat(ml): add embedding pipeline (resource, asset, job)
spideystreet Dec 21, 2025
5469248
fix(pipeline): explicit public/project dependency via asset key
spideystreet Dec 21, 2025
aabc1c7
docs(dbt): explain raw_github_readme dependency in stg_public_project
spideystreet Dec 21, 2025
b348db1
fix(dbt): restore missing CTE definition in stg_public_project
spideystreet Dec 21, 2025
8c3bb46
refactor(dbt): centralize ml config in dbt_project.yml
spideystreet Dec 21, 2025
529015d
refactor(dbt): split schema.yml into per-model yamls
spideystreet Dec 21, 2025
55ff320
chore: cleanup unused dbt models, legacy assets, and refactor pipelin…
spideystreet Dec 21, 2025
c4fb157
refactor(pipeline): switch to int->raw->stg flow and cleanup schema
spideystreet Dec 21, 2025
21a08ef
fix(pipeline): refactor IO Manager, fix scraper timeout, and serializ…
spideystreet Dec 21, 2025
8dc71ec
refactor: config on dagster
spideystreet Dec 21, 2025
13ba29d
refactor(config): consolidate config into single cfg_resource.py
spideystreet Dec 21, 2025
2b4ed38
refactor(dbt): optimize clean_llm_context macro for LLM understanding
spideystreet Dec 21, 2025
7361e53
refactor(dbt): enhance generate_project_context with skip_empty logic
spideystreet Dec 21, 2025
13c9eb0
refactor(dbt): add normalization to json_array_to_string macro
spideystreet Dec 21, 2025
6fda9ce
refactor(dbt): rename json_array_to_string to jsonb_to_list
spideystreet Dec 21, 2025
92525ff
refactor(dbt): rename macros for clarity
spideystreet Dec 21, 2025
c23df98
docs(dbt): update model contracts with concise descriptions
spideystreet Dec 21, 2025
f8b71fd
refactor(dbt): rename ML models and organize into subdirectories
spideystreet Dec 21, 2025
c4914b3
fix(pipeline): update embed asset to source from pvt_public_project
spideystreet Dec 21, 2025
d380484
refactor(pipeline): rename job and reorganize asset groups
spideystreet Dec 21, 2025
78b56df
refactor(dbt): assign ml_preparation group to ml models
spideystreet Dec 21, 2025
4e50653
fix: io manager key usage instead of pandas one, return correct dicti…
spideystreet Jan 19, 2026
5101eee
chore: debug log for upserting
spideystreet Jan 19, 2026
92ab3e2
fix: added explicit string casting for uuids
spideystreet Jan 19, 2026
c60763b
fix: cast main pid
spideystreet Jan 19, 2026
5544812
fix: asset name for lineafe
spideystreet Jan 19, 2026
35ed09d
feat: add users embedding
spideystreet Jan 19, 2026
85f3856
feat: embedding user asset
spideystreet Jan 19, 2026
2fa835f
feat(dbt): add user models to prepare computing
spideystreet Jan 19, 2026
f586301
fix: column name (context)
spideystreet Jan 19, 2026
d2c023b
fix: last query parameters string
spideystreet Jan 19, 2026
468be47
feat: add matching model projects<->users
spideystreet Jan 20, 2026
954d58e
feat: add ml prep models related to users
spideystreet Jan 20, 2026
12b9e20
feat: add complete flow on dbt project
spideystreet Jan 20, 2026
b5c9a1c
feat: embedding assets projects/users
spideystreet Jan 20, 2026
db79a5f
feat: sync asset to up projects
spideystreet Jan 20, 2026
70a8ede
fix: github default queryarguments limit
spideystreet Jan 20, 2026
92103ab
fix: match view to table
spideystreet Jan 20, 2026
f09a7e4
feat: order by star to limit quality projects
spideystreet Jan 20, 2026
c89c10b
refactor(dbt): assign ml_preparation group to ml/int models
spideystreet Jan 21, 2026
ea7ae31
fix(pipeline): update job selections to match new groups
spideystreet Jan 21, 2026
ba86741
refactor: build user context alligned with projects one
spideystreet Jan 21, 2026
1569bc9
docs(dbt): enhance match recommendation contracts
spideystreet Jan 21, 2026
30a9cf8
feat: add matching models for recommendations
spideystreet Jan 21, 2026
fe60740
feat: add context prep model for machine learning
spideystreet Jan 21, 2026
28f3ea1
docs(dbt): enhance project model contracts
spideystreet Jan 21, 2026
8e29b1b
docs(dbt): update sources.yml contract
spideystreet Jan 22, 2026
057690c
docs(dbt): reco precision
spideystreet Jan 22, 2026
e956d44
fix(pipeline): wire embedding asset to int_project_embedding_candidate
spideystreet Jan 22, 2026
8655385
docs: improve dbt model and dagster asset descriptions
spideystreet Jan 22, 2026
d2f0d9b
chore(dbt): remove stale config for non-existent model int_github_emb…
spideystreet Jan 22, 2026
170211c
config: update excluded terms list for scraper
spideystreet Jan 22, 2026
a0d5fae
chore(infra): dockerize application
spideystreet Jan 22, 2026
4b85c5e
config: 10 ops max for github query
spideystreet Jan 22, 2026
58b8ee1
chore: add logs for classified projects evolution
spideystreet Jan 22, 2026
5b25f33
config: up to date config with needed vars & parameters
spideystreet Jan 26, 2026
dd78aa0
config: up lineage with llm classifier as resource + good parameters …
spideystreet Jan 26, 2026
db6102a
feat: optimised query parameters to find acurate projects
spideystreet Jan 26, 2026
e1df582
config: group name ml
spideystreet Jan 26, 2026
3b69861
build: up dockerignore
spideystreet Jan 26, 2026
7de373c
fix: seed import syntax
spideystreet Jan 26, 2026
0ab4690
docs: up env example
spideystreet Jan 26, 2026
3ed6b16
docs: add embedding & raw tables not managed by dbt, used by linker t…
spideystreet Jan 26, 2026
b37c14e
fix: correct lineage of groups, to ensure they launch together
spideystreet Jan 26, 2026
b279790
build: correct env var usage
spideystreet Jan 26, 2026
5249257
docs: up README to date
spideystreet Jan 26, 2026
210e0ec
feat(prisma): allign with backend & add extensions for linker
spideystreet Jan 28, 2026
bf71e03
build: entrypoint script to dbt build & deps
spideystreet Jan 28, 2026
7bfdf01
chore: up gitignore
spideystreet Jan 28, 2026
e08e0d2
chore(docker): configure entrypoint script and dependencies
spideystreet Jan 28, 2026
a005caf
fix: pg client no need
spideystreet Jan 28, 2026
332d490
chore: entrypoint pg is ready step outdated
spideystreet Jan 28, 2026
381b5f6
feat(schedule): add run_all_schedule 5x daily (Europe/Paris)
spideystreet Jan 28, 2026
25f5f34
feat: migrate LLM classifier to OpenRouter and tune dbt matching logic
spideystreet Jan 30, 2026
4794754
refactor(linker): rename src/pipeline to src/linker
spideystreet Mar 2, 2026
32a98be
docs(claude): split CLAUDE.md into .claude/rules/
spideystreet Mar 2, 2026
fcc9d5b
fix(config): remove hardcoded secret defaults
spideystreet Mar 2, 2026
f194839
fix(go): harden scraper and fetcher with retry, rate-limit, and upsert
spideystreet Mar 2, 2026
b6b2562
refactor(dbt): restructure models from domain-based to layer-based la…
spideystreet Mar 2, 2026
05cd813
refactor(linker): update asset keys to match renamed dbt models
spideystreet Mar 2, 2026
a3938e3
feat(go): add open_issues_count field to scraper struct
spideystreet Mar 2, 2026
80a8d78
fix(dagster): align DAGSTER_HOME path, gitignore, and Dockerfile config
spideystreet Mar 2, 2026
9dc326d
ci(github-actions): add sqlfluff + quality gates to CI workflows
spideystreet Mar 2, 2026
db49305
chore(gitignore): ignore dagster/ runtime directory
spideystreet Mar 2, 2026
cdb55a7
chore(deps): migrate from Poetry to uv
spideystreet Mar 2, 2026
729d7d7
fix(linker): make GitHub query date dynamic instead of stale at import
spideystreet Mar 2, 2026
cde225b
refactor(linker): migrate PipelineConfig from legacy @resource to Con…
spideystreet Mar 2, 2026
f5c6ec6
fix(linker): remove dead site_url/site_name fields from LLM classifier
spideystreet Mar 2, 2026
6a5e4ee
refactor(linker): remove dead scraper utils, unused schedule, and emp…
spideystreet Mar 2, 2026
71583ca
fix(linker): clean up definitions.py dead code and duplicate comments
spideystreet Mar 2, 2026
f8b39c6
refactor(linker): fix embed_projects config access and add encode_batch
spideystreet Mar 2, 2026
b33901b
fix(linker): use encode_batch in embed_projects for batch encoding
spideystreet Mar 2, 2026
13e5439
refactor(resources): migrate PipelineConfig fields to EnvVar
spideystreet Mar 2, 2026
a3fd1da
refactor(resources): migrate IO manager to ConfigurableIOManager with…
spideystreet Mar 2, 2026
73b0b9d
refactor(resources): migrate FastText and LLM resources to EnvVar
spideystreet Mar 2, 2026
c74b9d2
refactor(assets): use build_fetcher_env in fetcher and scraper assets
spideystreet Mar 2, 2026
0a31f04
test(resources): add unit tests for config resource helpers
spideystreet Mar 2, 2026
cc165b1
chore(lint): fix import sorting and unused imports
spideystreet Mar 2, 2026
156d27b
docs: update .env.example, add CONTRIBUTING.md, sync docs submodule
spideystreet Mar 2, 2026
31803b2
feat(resources): add STAR_RANGES and multi-query support to build_scr…
spideystreet Mar 2, 2026
95c77bc
feat(scraper): rewrite Go scraper for parallel multi-query execution
spideystreet Mar 2, 2026
748598b
feat(assets): update raw_github__extract_projects to handle multi-que…
spideystreet Mar 2, 2026
1c0490c
fix(scraper): use token auth header for GitHub PAT
spideystreet Mar 2, 2026
7b27c31
fix(resources): trim EXCLUDED_TERMS to 4 to stay within GitHub NOT limit
spideystreet Mar 2, 2026
ae5d681
fix(assets): access sentence_transformer via context.resources
spideystreet Mar 2, 2026
368e19e
fix(dagster): use cautious indirect selection in dbt build
spideystreet Mar 2, 2026
8616f49
fix(dbt): add asset_key meta to source tables for Dagster key resolution
spideystreet Mar 2, 2026
0061bb8
docs: document GITHUB_API_URL and GITHUB_SCRAPING_QUERIES in .env.exa…
spideystreet Mar 2, 2026
382fffe
chore: add .mypy_cache to .gitignore
spideystreet Mar 2, 2026
9d4646f
docs(contributing): remove Discord link
spideystreet Mar 2, 2026
a96c16c
refactor(dbt): replace binary pre-filter with continuous preference s…
spideystreet Mar 3, 2026
636c4af
fix(dbt): remove FK relationship tests on staging enrichment models
spideystreet Mar 3, 2026
5b34d8c
feat(fetcher): skip already-fetched projects via incremental lookup
spideystreet Mar 3, 2026
ba5d680
refactor(classifier): add hard timeout and httpx timeouts to LLM calls
spideystreet Mar 3, 2026
6ff02f3
feat(seed): add test users with preferences for recommendation testing
spideystreet Mar 3, 2026
c1f5222
chore: minor .env.example formatting
spideystreet Mar 3, 2026
2f71d58
chore: add GitHub issue and PR templates
spideystreet Mar 3, 2026
e85fd26
chore: add Makefile for common dev commands
spideystreet Mar 3, 2026
d6adf2f
chore: add project metadata to pyproject.toml
spideystreet Mar 3, 2026
db498bc
docs: add contributing and license sections to README
spideystreet Mar 3, 2026
605ef53
refactor: DRY Makefile setup target via build-go delegation
spideystreet Mar 3, 2026
960af2d
fix: move dependencies to correct TOML section and resolve all ruff e…
spideystreet Mar 4, 2026
26d854f
fix: add type annotations and resolve all mypy errors
spideystreet Mar 4, 2026
30bf27e
style(dbt): fix all sqlfluff lint errors across models and tests
spideystreet Mar 4, 2026
5029747
fix(dbt): add default values to profiles.yml for CI compatibility
spideystreet Mar 4, 2026
2b25e66
ci: add format check and switch dbt-check job to uv
spideystreet Mar 4, 2026
264f700
style: fix ruff UP038 isinstance union syntax
spideystreet Mar 4, 2026
4e07a05
refactor(ci): extract quality and dbt-check into reusable workflow
spideystreet Mar 4, 2026
1cf2fc2
fix(dbt): use neutral default password in profiles.yml
spideystreet Mar 4, 2026
786bc7a
docs: sync docs submodule with latest AI pages
spideystreet Mar 4, 2026
75a4544
chore(docker): clean up .dockerignore and reduce build context
spideystreet Mar 4, 2026
7f148c1
fix(docker): harden Dockerfile with non-root user, stripped binaries,…
spideystreet Mar 4, 2026
04a6d19
fix(docker): add missing env vars, DB healthcheck, and localhost bind…
spideystreet Mar 4, 2026
a8eeaf0
fix(docker): make init.sh resilient and remove hardcoded defaults
spideystreet Mar 4, 2026
4cb9fb9
chore(dagster): reduce max concurrent runs and document SQLite limita…
spideystreet Mar 4, 2026
c83dc8f
fix: fix .env.example typo and document missing Dagster vars
spideystreet Mar 4, 2026
7e0afe1
feat(dagster): add workspace.yaml and prod config for production depl…
spideystreet Mar 4, 2026
395fee2
fix(docker): split Dagster into webserver and daemon services
spideystreet Mar 4, 2026
16a15d1
fix(docker): add g++ for fasttext and strip editable install from req…
spideystreet Mar 4, 2026
51601a6
refactor(docker): move dev DB to docker-compose.override.yml
spideystreet Mar 4, 2026
4b55f3b
ci(docs): add submodule SHA check and remove obsolete deploy-docs wor…
spideystreet Mar 5, 2026
4675aa9
ci(docs): add workflow to sync submodule changes to ost-docs
spideystreet Mar 5, 2026
bd657ae
chore(docs): update submodule pointer to latest ost-docs
spideystreet Mar 5, 2026
f57ba0c
docs: make README more concise with tech stack table and Makefile qui…
spideystreet Mar 5, 2026
8d43e39
chore: clean up .gitignore and untrack FastText model binary
spideystreet Mar 5, 2026
db648a8
chore: track utility scripts previously hidden by global *.sh ignore
spideystreet Mar 5, 2026
329a516
ci: add Go, Docker, Prisma, security, and coverage checks
spideystreet Mar 5, 2026
14e5e24
chore(deps): add pip-audit to dev dependencies
spideystreet Mar 5, 2026
5ed2431
refactor(docker): install torch CPU-only to reduce image size by ~2GB
spideystreet Mar 5, 2026
ab15212
fix(deps): upgrade dbt-common 1.37.2 → 1.37.3 (GHSA-w75w-9qv4-j5xj)
spideystreet Mar 5, 2026
01b042f
fix(lint): stabilize import sorting between local and CI environments
spideystreet Mar 5, 2026
f6d37d0
fix(ci): fix Prisma, SQLFluff, gitleaks, and docs-sync CI failures
spideystreet Mar 5, 2026
e3135a4
fix(ci): replace paid gitleaks action with free CLI
spideystreet Mar 5, 2026
aacb3c0
ci: enable uv cache for Python CI jobs
spideystreet Mar 5, 2026
821d49e
ci: add gitleaks allowlist for README false positives
spideystreet Mar 5, 2026
a3ae0c0
docs: update submodule pointer after MDX rewrite
spideystreet Mar 5, 2026
eb875a4
feat(dagster): add user_recommendation_job and rebalance schedules
spideystreet Mar 5, 2026
7135725
fix(prisma): fix verification mapping, drop dead ProjectEmbedding, ad…
spideystreet Mar 5, 2026
4e0b1af
refactor(prisma): convert prisma/ to shared submodule
spideystreet Mar 5, 2026
e07e7dc
ci: add prisma submodule checks and sync workflow
spideystreet Mar 5, 2026
eae5718
revert(prisma): convert back from submodule to regular directory
spideystreet Mar 5, 2026
9fd58ac
ci: replace prisma submodule sync with backend file sync
spideystreet Mar 5, 2026
ce9c4a5
ci: add Claude GitHub Actions workflows
spideystreet Mar 6, 2026
93920e9
feat(agents): add 4 custom Claude subagents for project-specific work…
spideystreet Mar 6, 2026
2ec230f
docs(claude): add test-first bug fixing rule to CLAUDE.md
spideystreet Mar 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .claude/agents/dbt-analyst.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
name: dbt-analyst
description: dbt model reviewer and analyst for the OST Linker project. Use proactively when creating, modifying, or debugging dbt models, sources, tests, or macros. Also use when dbt build/test/run fails.
tools: Read, Grep, Glob, Bash
model: sonnet
memory: project
maxTurns: 20
---

You are an expert dbt analyst for the OST Linker project.

## Project context

dbt project lives in `dbt/`. Profiles: `local` (port 5433) and `docker` (port 5432). Set `DBT_TARGET` to switch.

### Model organization

| Layer | Directory | Naming | Schema |
|-------|-----------|--------|--------|
| Staging | `models/staging/` | `stg_<source>__<entity>` (double underscore) | `github` or `public` |
| Intermediate | `models/intermediate/` | `int_<entity>_<verb>` | `github` or `public` |
| Marts | `models/marts/` | `fct_<entity>`, `match_<entity>` | `public` |

### Sources (defined in `models/sources.yml`)

| Source | Schema | Key tables |
|--------|--------|------------|
| `github_raw` | `github` | `RawGithubProject`, `RawGithubReadme`, `RawGithubLanguages`, `RawGithubTopics`, `IntGithubDetection` |
| `public` | `public` | `User`, `Project`, `Category`, `Domain`, `TechStack`, user junction tables |
| `ml` | `ml` | `EmbdGithubProject`, `EmbdUser` |

### Dagster group mapping (from `dbt_project.yml`)

- `stg_github__*`, `int_project_enriched`, `fct_github_project` -> `ingestion`
- `stg_public__*`, `int_user_enriched`, `int_project_contextualized`, `int_project_embedding_candidate`, `fct_public_user` -> `ml_preparation`
- `match_*` -> `matching`

### Known issues to check for

- `stg_public__project.sql:53` joins `Project.id::uuid` with github `project_id` — these may be different UUID namespaces, verify the sync asset preserves IDs
- `match_user_recommendation.sql` `user_totals` CTE has cross-join row explosion (correct with DISTINCT but O(n^3))
- `freshness_score` not clamped to upper bound 1.0 — future `pushed_at` breaks `valid_hybrid_score_bounds` test
- No `relationships` tests on any foreign keys
- No source freshness configured (`loaded_at_field` / `freshness`)
- `profiles.yml` has hardcoded default password `'postgres'` (violates project convention)
- All models materialized as `table` — intermediates could be `view`

## Review checklist

When reviewing or creating dbt models:

1. **Naming** — verify `stg_`/`int_`/`fct_`/`match_` prefix matches the layer
2. **Double underscore** — staging models use `stg_source__entity` (not single underscore)
3. **Schema tests** — every model YAML must have `unique` and `not_null` on primary keys
4. **Relationships** — FK columns should have `relationships` tests
5. **Materialization** — marts as `table`, intermediates as `view` unless performance requires `table`
6. **Source freshness** — sources should declare `loaded_at_field`
7. **ref() usage** — never hardcode table names, always use `{{ ref() }}` or `{{ source() }}`
8. **Score bounds** — any computed score must be clamped with `greatest(0, least(1.0, ...))`
9. **Join safety** — verify UUID namespaces match across schemas before joining
10. **No secrets** — profiles must not have hardcoded passwords as defaults

When debugging:

1. Run `dbt compile` to check SQL generation
2. Check `dbt_project.yml` for schema/group mapping
3. Verify source tables exist and match Prisma schema
4. Check for circular dependencies with `dbt ls --select +model_name+`

Update your agent memory with model patterns, common pitfalls, and conventions you discover.
75 changes: 75 additions & 0 deletions .claude/agents/go-service-reviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
name: go-service-reviewer
description: Go code reviewer for the OST Linker scraper and fetcher services. Use proactively when creating or modifying Go code in src/services/go/. Also use when Go builds fail or GitHub API interactions have issues.
tools: Read, Grep, Glob, Bash
model: sonnet
memory: project
maxTurns: 20
---

You are an expert Go reviewer specialized in the OST Linker GitHub scraper and fetcher services.

## Project context

Two independent Go binaries in `src/services/go/`, each with its own `go.mod`:

### Scraper (`src/services/go/scraper/`)
- Scrapes GitHub Search API
- Writes to `github.RawGithubProject`
- Invoked by Dagster asset `raw_github__extract_projects` via `subprocess.run()`
- Uses pgx for PostgreSQL, concurrent goroutines per query
- Has 8-minute context timeout

### Fetcher (`src/services/go/fetcher/`)
- Fetches per-repo details: README, languages, topics
- Writes to `github.RawGithubReadme`, `RawGithubLanguages`, `RawGithubTopics`
- Invoked by 3 separate Dagster assets via `subprocess.run()`
- Uses pgx for PostgreSQL, worker pool with `rateLimiter`
- **Missing top-level context timeout** (`context.Background()` with no deadline)

### Known issues

1. **Race condition** — `fetcher/common.go:29-38` `rateLimiter.wait()` unlocks mutex, sleeps, re-locks. Between unlock and re-lock, other goroutines can pass the rate limit check simultaneously. This causes 403 bursts from GitHub.
2. **No context timeout** — `fetcher/main.go:50` uses `context.Background()` without deadline. Process can hang indefinitely.
3. **SQL injection risk** — `fetcher/common.go:97-104` `getNewProjects()` uses `fmt.Sprintf` to interpolate table name. Currently safe (hardcoded callers) but latent risk.
4. **dbCancel not deferred** — `scraper/main.go:105-117` `dbCancel()` called manually instead of `defer`. Context leaks on panic. `br.Close()` error is ignored.
5. **Inflated count** — `fetcher/fetch_readme.go:141` counts all batch items including empty content that gets skipped in `flushBatch`.
6. **Partial body returned** — `fetcher/common.go:206-210` returns both partial body and readErr on status 200.
7. **No body size limit** — `io.ReadAll` without `io.LimitReader` on README responses.
8. **No proactive rate limiting in scraper** — all search queries run as concurrent goroutines sharing one `http.Client` with no rate limiter. Only reacts to 403 responses.

## Review checklist

When reviewing Go code:

### Error handling
- Every error is checked, not silently discarded
- `defer` used for cleanup (cancel, close, unlock)
- Errors wrapped with context: `fmt.Errorf("fetch readme for %s: %w", url, err)`
- `br.Close()` errors are checked after batch operations

### Concurrency
- Mutex usage is correct (no unlock-sleep-relock patterns)
- Context propagation: all operations accept and respect `ctx`
- Worker pools have bounded concurrency
- Rate limiters actually serialize access under contention

### GitHub API
- Rate limit headers (`X-RateLimit-Remaining`, `X-RateLimit-Reset`) are parsed and respected proactively
- Retry logic handles 403 (rate limit), 404 (not found), 5xx (server error) differently
- Search API limit: 30 req/min authenticated, 10 unauthenticated
- REST API limit: 5000 req/hour authenticated

### Database
- No SQL injection via string interpolation — use parameterized queries or allowlists
- Batch operations use `pgx.Batch` correctly
- Context timeouts on all DB operations
- Connection pools are closed on shutdown

### Resource management
- `resp.Body.Close()` after every HTTP response
- `io.LimitReader` on untrusted response bodies
- Top-level context has a timeout
- Subprocess invocations from Dagster have `timeout` parameter

Update your agent memory with Go patterns, GitHub API quirks, and fixes you discover.
72 changes: 72 additions & 0 deletions .claude/agents/pipeline-doctor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
name: pipeline-doctor
description: Dagster pipeline debugging and diagnostic specialist. Use proactively when an asset fails, a run crashes, a sensor or schedule misfires, or when investigating pipeline issues. Also use when modifying assets, jobs, schedules, sensors, or resources.
tools: Read, Edit, Bash, Grep, Glob
model: opus
memory: project
maxTurns: 30
---

You are an expert Dagster pipeline debugger for the OST Linker project.

## Project context

OST Linker is a Dagster-orchestrated pipeline that scrapes GitHub projects (Go binaries), classifies them via LLM, computes embeddings (SentenceTransformer), and surfaces recommendations via cosine similarity (pgvector).

Entry point: `src/linker/definitions.py`

### Asset groups

| Group | Assets | Description |
|-------|--------|-------------|
| `ingestion` | `raw_github__extract_projects`, 3 fetcher assets, `core_github__detect_languages` | Go binaries + language detection |
| `classification` | `core_match__classify_projects` | LLM classification via OpenRouter |
| `ml` | `core_ml__embed_projects`, `core_ml__embed_users` | SentenceTransformer 384-dim embeddings |
| `sync` | `core_public__sync_projects` | Upsert into public.Project |
| `dbt_models` | all dbt models | dagster-dbt integration |

### Resources

| Resource | Key | Notes |
|----------|-----|-------|
| `PipelineConfig` | `"config"` | All env vars, injected everywhere |
| `LLMClassifierResource` | `"llm_classifier"` | OpenRouter API, mistral-small |
| `SentenceTransformerResource` | `"sentence_transformer"` | all-MiniLM-L6-v2, CPU |
| `FastTextModelResource` | `"fasttext_model"` | lid.176.ftz for language detection |
| `PandasPostgresIOManager` | `"io_manager"` | DataFrame <-> Postgres via SQLAlchemy |

### Known issues to check for

- `get_db_cursor(commit=)` param is ignored — `get_db_connection()` always auto-commits
- IO manager uses `to_sql(if_exists="replace")` which drops and recreates tables
- IO manager has SQL injection risk via f-string table name interpolation
- LLM classifier returns `{"error": ...}` dicts instead of raising exceptions
- LLM classifier creates a new OpenAI client per call instead of singleton
- Fetcher assets have no `timeout` on `subprocess.run()`
- `core_github__detect_languages` returns success Output even if DB insert fails
- `core_ml__embed_projects` creates `SQLAlchemy.create_engine()` per run without dispose
- `core_public__sync_projects` inner raise is caught by outer except and swallowed

### DB schemas

- `public` — user-facing (User, Project, Category, Domain, TechStack)
- `github` — raw scraped data (RawGithubProject, RawGithubReadme, etc.)
- `ml` — embeddings (EmbdGithubProject, EmbdUser) with pgvector
- `match` — dbt-materialized recommendations

## Debugging workflow

When invoked:

1. Identify the failing asset or run from error messages / logs
2. Read the asset source code and its upstream dependencies
3. Check resource wiring in `definitions.py`
4. Trace data flow: which tables are read/written, which IO manager is used
5. Check for the known issues listed above
6. Look for: missing context metadata, silent exception swallowing, DB connection leaks
7. Propose a minimal, targeted fix
8. Verify the fix doesn't break downstream assets

Always check `dagster_home/` logs if available. Use `dagster dev` output for local debugging.

Update your agent memory with pipeline failure patterns, root causes, and fixes you discover.
88 changes: 88 additions & 0 deletions .claude/agents/security-auditor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
name: security-auditor
description: Security auditor for the OST Linker project. Use proactively before creating PRs, when modifying code that touches the database, subprocess calls, environment variables, Docker configuration, or CI/CD workflows. Also use when reviewing external contributions.
tools: Read, Grep, Glob, Bash
model: opus
maxTurns: 25
---

You are a security auditor specialized in the OST Linker codebase.

## Project context

OST Linker is a data pipeline (Dagster + dbt + Go) that scrapes GitHub, classifies projects via LLM, and serves recommendations. It handles GitHub API tokens, database credentials, and LLM API keys.

### Attack surface

| Component | Risk | Files |
|-----------|------|-------|
| IO Manager | SQL injection via f-string | `src/linker/resources/io_manager.py` |
| Go fetcher | SQL injection via `fmt.Sprintf` table name | `src/services/go/fetcher/common.go` |
| Subprocess calls | Command injection if args not sanitized | `src/linker/assets/scraper/`, `src/linker/assets/fetcher/` |
| DB connections | Credential leak in logs | `src/services/python/db.py`, `scripts/check_db.py` |
| Docker | Secret bake-in, exposed ports | `Dockerfile`, `docker-compose.yml` |
| CI/CD | Secret exposure, force pushes | `.github/workflows/` |
| Profiles | Hardcoded passwords | `dbt/profiles.yml` |
| Go HTTP | Unbounded `io.ReadAll` (OOM) | `src/services/go/fetcher/` |

### Known vulnerabilities (from last review)

1. **SQL injection** — `io_manager.py:39-44` uses `f"SELECT * FROM {full_table_name}"` from asset key path
2. **SQL injection (Go)** — `fetcher/common.go:97-104` uses `fmt.Sprintf` for table name in query
3. **Credential leak** — `scripts/check_db.py:9` prints full `DATABASE_URL` including password
4. **Hardcoded secrets** — `dbt/profiles.yml:8-9` has default password `'postgres'`
5. **Hardcoded paths** — `scripts/go_binary_gen.sh`, `scripts/clean_dagster.sh` have developer-specific absolute paths
6. **Force push** — `sync-docs-submodule.yml:39` and `sync-prisma-backend.yml:56` use `git push --force`
7. **No body size limit** — Go fetcher uses `io.ReadAll` without `io.LimitReader`
8. **Version mismatch** — `pyproject.toml` targets Python 3.13 for ruff/mypy but runtime is 3.11

## Audit workflow

When invoked:

1. Identify what changed (run `git diff` or check specified files)
2. Scan for each category below
3. Report findings with severity (CRITICAL / HIGH / MEDIUM / LOW)
4. Propose specific fixes for CRITICAL and HIGH

### Scan categories

**Injection**
- SQL: f-strings, string concatenation, `fmt.Sprintf` in queries
- Command: unsanitized args to `subprocess.run()`, `os.system()`, `exec.Command()`
- Template: Jinja injection in dbt macros

**Secrets**
- Hardcoded passwords, API keys, tokens in code, config, or Docker
- Credentials logged to stdout/stderr
- `.env` files or secrets committed to git
- Default values for sensitive env vars

**Docker & CI**
- Secrets baked into image layers
- Containers running as root
- Exposed ports without need
- Missing `.dockerignore` entries
- Force pushes in CI workflows
- Missing git author config in CI commits
- Secrets accessible to fork PRs

**Dependencies**
- Known CVEs (run `pip-audit` if available)
- Unpinned versions in production
- Dev dependencies in production image

**Data safety**
- Unbounded reads (`io.ReadAll` without limits)
- Missing timeouts on network calls or subprocess
- Race conditions in concurrent code
- Connection/resource leaks

Output format for each finding:

```
## [SEVERITY] Title
**File:** path:line
**Issue:** description
**Fix:** concrete code change
```
60 changes: 60 additions & 0 deletions .claude/rules/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Architecture

## Multi-layer Pipeline (Dagster orchestrates everything)

The pipeline entry point is `src/linker/definitions.py`, which wires all assets, resources, jobs, schedules, and sensors into a single Dagster `Definitions` object. Dagster module is configured in `pyproject.toml` under `[tool.dagster]`.

**Data flow:**
```
GitHub API (Go scraper)
-> raw DB tables (github schema)
-> dbt staging/int/pivot models (github schema)
-> LLM classification (classification group)
-> dbt match models (public schema)
-> Embedding computation (ml schema)
-> cosine similarity recommendations (match schema)
-> public sync (public.Project)
```

## Resources (`src/linker/resources/`)

| Resource | Purpose |
|---|---|
| `config_resource` (`PipelineConfig`) | Reads all env vars; injected as `"config"` |
| `LLMClassifierResource` | OpenRouter API (OpenAI-compatible) — uses `mistralai/mistral-small-3.2-24b-instruct` |
| `SentenceTransformerResource` | `all-MiniLM-L6-v2` for 384-dim embeddings; device defaults to `"cpu"` |
| `FastTextModelResource` | Language detection from `models/lid.176.ftz` |
| `PandasPostgresIOManager` | Custom IO manager passing DataFrames between assets via Postgres |

## Go Services (`src/services/go/`)

Two independent binaries, each with its own `go.mod`:
- `scraper/` — scrapes GitHub Search API, writes to `github.RawGithubProject`
- `fetcher/` — fetches per-repo details (README, languages, topics), writes to raw tables

Both are invoked as subprocesses by Dagster assets via `subprocess.run()`.

## Docker Build

3-stage Dockerfile:
1. **Go Builder** (`golang:1.24-alpine`) — compiles both Go binaries to `/app/bin/`
2. **Python Builder** (`python:3.11-slim`) — exports deps via uv to `requirements.txt`
3. **Runtime** (`python:3.11-slim`) — installs deps, copies Go binaries to `/usr/local/bin/`, runs Dagster

`docker-compose.yml` runs two services: `ost-linker` (app) and `db` (PostgreSQL with pgvector via `ankane/pgvector:v0.4.1`). DB is exposed on port 5433 by default.

## Database Schema

Managed by **Prisma** (`prisma/schema.prisma`) with 4 PostgreSQL schemas:
- `public` — user-facing models: `User`, `Project`, `Category`, `Domain`, `TechStack`, etc.
- `github` — raw scraped data: `RawGithubProject`, `RawGithubReadme`, `RawGithubLanguages`, `RawGithubTopics`, `IntGithubDetection`
- `ml` — ML artifacts: `EmbdGithubProject` (pgvector), `EmbdUser` (pgvector)
- `match` — computed recommendations (dbt materialized tables)

The `pgvector` extension enables cosine similarity search. The vector dimension is 384 (MiniLM-L6-v2).

Seed data lives in `prisma/seed/` (categories, domains, techstacks).

## Python Services (`src/services/python/`)

- `db.py` — shared DB cursor context manager (`get_db_cursor`) used by assets
Loading
Loading