perf(embeddings): cross-node batching + worker pool#33
Merged
Conversation
282cbce to
9c762e8
Compare
The embeddings phase was pegged to one embedding per node per await, behind a single-threaded ONNX session — an AWSQuickWork run sat at 95% CPU for 7+ minutes on 1,922 files. Refactor into two stages: walk tiers once to collect (text, emitRow) jobs in canonical order, then dispatch in fixed-size batches across a configurable Piscina pool of OnnxEmbedder workers. Each wave fires workers × batchSize embeds concurrently and scatters vectors back into the row buffer. Row ordering and the embeddingsHash contract are preserved — confirmed by a new test that asserts byte-identical hashes across batchSize=1 vs 32. - New flags: --embeddings-workers <n|auto>, --embeddings-batch-size <n>. - A main-thread canary OnnxEmbedder opens before the pool so EmbedderNotSetupError keeps its class identity across the structured-clone boundary. - HTTP backend unaffected (pool flag ignored when endpoint is set).
9c762e8 to
8bbf5b8
Compare
Closed
theagenticguy
added a commit
that referenced
this pull request
May 1, 2026
## Summary
- Refactors the embeddings phase from one-embedding-per-node-per-await
into two stages: a **job-collection** pass that walks
symbol/file/community tiers in canonical order producing `{text,
emitRow}` records, and a **dispatch** loop that fires `workers ×
batchSize` embeds concurrently per wave and scatters vectors back into
the row buffer.
- Adds a Piscina pool of independent `OnnxEmbedder` workers
(`packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts`).
Each worker holds its own ONNX session; the pool is exposed behind an
`Embedder`-shaped facade so the phase doesn't branch. A main-thread
canary `OnnxEmbedder` opens first so `EmbedderNotSetupError` keeps its
class identity across the structured-clone boundary.
- New flags: `--embeddings-workers <n|auto>` and
`--embeddings-batch-size <n>` (defaults: 1 and 32 — unchanged
single-threaded behaviour out of the box).
### Motivation
Real-world `codehub analyze --embeddings --force --granularity
symbol,file,community` on a ~1,922-file AWS codebase sat at 95% CPU for
7+ minutes before the refactor. The phase was awaiting `embedBatch()`
per node inside a single-threaded ONNX session (`intraOpNumThreads: 1`,
`graphOptimizationLevel: "disabled"` — required for the graphHash
determinism contract), so there was no concurrency anywhere in the
stack.
### Determinism
The graphHash / `embeddingsHash` contract is preserved:
- Canonical tier ordering (symbol → file → community) is unchanged.
- Rows are still sorted by `(granularity, nodeId, chunkIndex)` before
hashing.
- `openOnnxEmbedder()`'s deterministic knobs are intact per worker —
which input produces which vector is independent of which worker ran it.
- New regression test asserts `embeddingsHash` at `batchSize=1` equals
`embeddingsHash` at `batchSize=32`.
### Expected speedup
On an M-series laptop with `--embeddings-workers auto
--embeddings-batch-size 32`, the 7-minute AWSQuickWork run should drop
to roughly 1–2 minutes. `--embeddings-int8` cuts that further.
## Test plan
- [x] `pnpm build` — clean
- [x] `pnpm --filter @opencodehub/ingestion test` — 576/576 pass
- [x] New test: `embeddings.test.ts` — `batchSize=1` vs `batchSize=32`
produce byte-identical `embeddingsHash`
- [x] `codehub analyze --help` surfaces `--embeddings-workers` and
`--embeddings-batch-size`
- [ ] End-to-end: run `codehub analyze AWSQuickWork --embeddings --force
--granularity symbol,file,community --embeddings-workers auto` and
confirm wall time drop + identical `embeddingsHash` vs a single-threaded
control run
Merged
theagenticguy
added a commit
that referenced
this pull request
May 12, 2026
🤖 Automated release via release-please --- <details><summary>analysis: 0.1.1</summary> ## [0.1.1](analysis-v0.1.0...analysis-v0.1.1) (2026-05-12) ### Features * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Refactoring * consolidate repo-local dir references on META_DIR_NAME ([ce4b63d](ce4b63d)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/core-types bumped to 0.2.0 * @opencodehub/sarif bumped to 0.1.1 * @opencodehub/storage bumped to 0.1.1 </details> <details><summary>cli: 0.2.0</summary> ## [0.2.0](cli-v0.1.0...cli-v0.2.0) (2026-05-12) ### ⚠ BREAKING CHANGES * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ### Features * artifact factory + codehub init + CI UX fixes ([#38](#38)) ([d6ffafa](d6ffafa)) * **cli:** add --granularity flag to analyze for hierarchical embeddings ([defa9b6](defa9b6)) * **cli:** add --strict-detectors flag + ts-morph optional dep ([329f5c3](329f5c3)) * **cli:** add exact-name resolver and disambiguation flags to context ([7f279a9](7f279a9)) * **cli:** flip query hybrid-by-default with --bm25-only + --rerank-top-k ([3e924b5](3e924b5)) * detect-secrets as 20th scanner (Track B) ([#72](#72)) ([8fbdd61](8fbdd61)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * **ingestion:** WASM fallback via web-tree-sitter + --wasm-only flag ([cecb401](cecb401)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * **mcp,cli:** join symbol summaries into query results (P04 surface) ([3d73b65](3d73b65)) * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ([1cceb24](1cceb24)) * **scanners:** persist partialFingerprint, baselineState, suppressedJson ([fb4585d](fb4585d)) * **search:** add filter-aware zoom retrieval across hierarchical tiers ([5ab80c4](5ab80c4)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Bug Fixes * **cli:** accurate doctor native-binding + int8 weights checks ([fb569f9](fb569f9)) * **storage:** wire @ladybugdb/core binding, fix lbug open() guards, upgrade pnpm v10→v11 ([#93](#93)) ([78d6a85](78d6a85)) ### Performance * **embeddings:** cross-node batching + worker pool ([#33](#33)) ([acb59d0](acb59d0)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.1.1 * @opencodehub/core-types bumped to 0.2.0 * @opencodehub/embedder bumped to 0.1.1 * @opencodehub/ingestion bumped to 0.2.0 * @opencodehub/mcp bumped to 0.2.0 * @opencodehub/sarif bumped to 0.1.1 * @opencodehub/scanners bumped to 0.1.1 * @opencodehub/search bumped to 0.1.1 * @opencodehub/storage bumped to 0.1.1 </details> <details><summary>core-types: 0.2.0</summary> ## [0.2.0](core-types-v0.1.0...core-types-v0.2.0) (2026-05-12) ### ⚠ BREAKING CHANGES * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ### Features * **core-types:** scaffold v1.1 node-shape extensions for planned packets ([e17a4b5](e17a4b5)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ([1cceb24](1cceb24)) * **storage:** populate reserved complexity, coverage, deadness columns ([c81e4c3](c81e4c3)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Refactoring * **core-types:** centralize LanguageId in core-types ([4c33fc7](4c33fc7)) </details> <details><summary>embedder: 0.1.1</summary> ## [0.1.1](embedder-v0.1.0...embedder-v0.1.1) (2026-05-12) ### Features * detect-secrets as 20th scanner (Track B) ([#72](#72)) ([8fbdd61](8fbdd61)) * **embedder:** add SageMaker backend for remote embeddings ([9b5c53d](9b5c53d)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/core-types bumped to 0.2.0 </details> <details><summary>ingestion: 0.2.0</summary> ## [0.2.0](ingestion-v0.1.0...ingestion-v0.2.0) (2026-05-12) ### ⚠ BREAKING CHANGES * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ### Features * **cli:** add --strict-detectors flag + ts-morph optional dep ([329f5c3](329f5c3)) * **embedder:** add SageMaker backend for remote embeddings ([9b5c53d](9b5c53d)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * **ingestion:** [@doc](https://github.com/doc) captures + description field populated ([d63dfa6](d63dfa6)) * **ingestion:** add receiver resolver + detector precision (P06) ([431f428](431f428)) * **ingestion:** add top-20 framework detection catalog and dispatcher ([02f4864](02f4864)) * **ingestion:** capture MCP tool inputSchema as canonical JSON ([9872710](9872710)) * **ingestion:** emit CodeElement stubs for external imports ([49eefe7](49eefe7)) * **ingestion:** emit file-level and community-level embeddings ([09a117f](09a117f)) * **ingestion:** FastAPI, Spring, NestJS, Rails route detectors ([62bebfb](62bebfb)) * **ingestion:** Go IMPLEMENTS method-set resolver + C++20 import ([85c60f9](85c60f9)) * **ingestion:** nested .gitignore with layered negation ([40b5286](40b5286)) * **ingestion:** populate DependencyNode license from manifest ([f947194](f947194)) * **ingestion:** provider-driven complexity + Halstead volume ([5e1379a](5e1379a)) * **ingestion:** soft-fail summarize on credential errors, thread summaryModel ([d90eb38](d90eb38)) * **ingestion:** WASM fallback via web-tree-sitter + --wasm-only flag ([cecb401](cecb401)) * **ingestion:** wire framework catalog into profile phase ([d491401](d491401)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ([1cceb24](1cceb24)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Bug Fixes * **ingestion:** enumerate git submodule paths in the scan phase ([d290d04](d290d04)) * **ingestion:** skip submodule paths in the ownership blame pass ([e28f3e6](e28f3e6)) * **scip-ingest:** resolve caller/callee correctly for SCIP edges ([c15f928](c15f928)) ### Performance * **embeddings:** cross-node batching + worker pool ([#33](#33)) ([acb59d0](acb59d0)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Refactoring * consolidate repo-local dir references on META_DIR_NAME ([ce4b63d](ce4b63d)) * **core-types:** centralize LanguageId in core-types ([4c33fc7](4c33fc7)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.1.1 * @opencodehub/core-types bumped to 0.2.0 * @opencodehub/embedder bumped to 0.1.1 * @opencodehub/storage bumped to 0.1.1 </details> <details><summary>mcp: 0.2.0</summary> ## [0.2.0](mcp-v0.1.0...mcp-v0.2.0) (2026-05-12) ### ⚠ BREAKING CHANGES * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ### Features * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * **mcp,cli:** join symbol summaries into query results (P04 surface) ([3d73b65](3d73b65)) * **mcp:** short-circuit list_findings_delta via stored baselineState ([4d9c187](4d9c187)) * **mcp:** surface structured FrameworkDetection in project_profile tool ([15fb309](15fb309)) * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ([1cceb24](1cceb24)) * **search:** add filter-aware zoom retrieval across hierarchical tiers ([5ab80c4](5ab80c4)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Refactoring * **mcp:** consume shared tryOpenEmbedder + embeddingsPopulated from @opencodehub/search ([54f00de](54f00de)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/analysis bumped to 0.1.1 * @opencodehub/core-types bumped to 0.2.0 * @opencodehub/embedder bumped to 0.1.1 * @opencodehub/sarif bumped to 0.1.1 * @opencodehub/scanners bumped to 0.1.1 * @opencodehub/search bumped to 0.1.1 * @opencodehub/storage bumped to 0.1.1 </details> <details><summary>sarif: 0.1.1</summary> ## [0.1.1](sarif-v0.1.0...sarif-v0.1.1) (2026-05-12) ### Features * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) </details> <details><summary>scanners: 0.1.1</summary> ## [0.1.1](scanners-v0.1.0...scanners-v0.1.1) (2026-05-12) ### Features * detect-secrets as 20th scanner (Track B) ([#72](#72)) ([8fbdd61](8fbdd61)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/sarif bumped to 0.1.1 </details> <details><summary>search: 0.1.1</summary> ## [0.1.1](search-v0.1.0...search-v0.1.1) (2026-05-12) ### Features * detect-secrets as 20th scanner (Track B) ([#72](#72)) ([8fbdd61](8fbdd61)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * **search:** add filter-aware zoom retrieval across hierarchical tiers ([5ab80c4](5ab80c4)) * **search:** extract tryOpenEmbedder + embeddingsPopulated, demote NullEmbedder throw ([c4cc680](c4cc680)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/core-types bumped to 0.2.0 * @opencodehub/storage bumped to 0.1.1 </details> <details><summary>storage: 0.1.1</summary> ## [0.1.1](storage-v0.1.0...storage-v0.1.1) (2026-05-12) ### Features * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * **ingestion:** emit file-level and community-level embeddings ([09a117f](09a117f)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * **mcp:** short-circuit list_findings_delta via stored baselineState ([4d9c187](4d9c187)) * **search:** add filter-aware zoom retrieval across hierarchical tiers ([5ab80c4](5ab80c4)) * **storage:** add granularity column to embeddings for hierarchical retrieval ([b5bd5f8](b5bd5f8)) * **storage:** add summary fields to SearchResult and batch lookup helper ([4944a56](4944a56)) * **storage:** persist structured FrameworkDetection in frameworks_json ([75423fe](75423fe)) * **storage:** populate reserved complexity, coverage, deadness columns ([c81e4c3](c81e4c3)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) ### Bug Fixes * **storage:** wire @ladybugdb/core binding, fix lbug open() guards, upgrade pnpm v10→v11 ([#93](#93)) ([78d6a85](78d6a85)) ### Documentation * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) ### Dependencies * The following workspace dependencies were updated * dependencies * @opencodehub/core-types bumped to 0.2.0 </details> <details><summary>root: 0.2.0</summary> ## [0.2.0](root-v0.1.1...root-v0.2.0) (2026-05-12) ### ⚠ BREAKING CHANGES * **release:** footers in the commit log. * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ### Features * artifact factory + codehub init + CI UX fixes ([#38](#38)) ([d6ffafa](d6ffafa)) * cleanups ([bf1536e](bf1536e)) * **cli:** add --granularity flag to analyze for hierarchical embeddings ([defa9b6](defa9b6)) * **cli:** add --strict-detectors flag + ts-morph optional dep ([329f5c3](329f5c3)) * **cli:** add exact-name resolver and disambiguation flags to context ([7f279a9](7f279a9)) * **cli:** flip query hybrid-by-default with --bm25-only + --rerank-top-k ([3e924b5](3e924b5)) * **core-types:** scaffold v1.1 node-shape extensions for planned packets ([e17a4b5](e17a4b5)) * detect-secrets as 20th scanner (Track B) ([#72](#72)) ([8fbdd61](8fbdd61)) * **embedder:** add SageMaker backend for remote embeddings ([9b5c53d](9b5c53d)) * **embedder:** replace Arctic Embed XS with gte-modernbert-base ([#31](#31)) ([1214071](1214071)) * **gym:** add rust-spike trigger benchmark ([43c26d3](43c26d3)) * **ingestion:** [@doc](https://github.com/doc) captures + description field populated ([d63dfa6](d63dfa6)) * **ingestion:** add receiver resolver + detector precision (P06) ([431f428](431f428)) * **ingestion:** add top-20 framework detection catalog and dispatcher ([02f4864](02f4864)) * **ingestion:** capture MCP tool inputSchema as canonical JSON ([9872710](9872710)) * **ingestion:** emit CodeElement stubs for external imports ([49eefe7](49eefe7)) * **ingestion:** emit file-level and community-level embeddings ([09a117f](09a117f)) * **ingestion:** FastAPI, Spring, NestJS, Rails route detectors ([62bebfb](62bebfb)) * **ingestion:** Go IMPLEMENTS method-set resolver + C++20 import ([85c60f9](85c60f9)) * **ingestion:** nested .gitignore with layered negation ([40b5286](40b5286)) * **ingestion:** populate DependencyNode license from manifest ([f947194](f947194)) * **ingestion:** provider-driven complexity + Halstead volume ([5e1379a](5e1379a)) * **ingestion:** soft-fail summarize on credential errors, thread summaryModel ([d90eb38](d90eb38)) * **ingestion:** WASM fallback via web-tree-sitter + --wasm-only flag ([cecb401](cecb401)) * **ingestion:** wire framework catalog into profile phase ([d491401](d491401)) * initial public release of opencodehub v0.1.1 ([3f23006](3f23006)) * M7 LadybugDB default + IGraphStore abstraction hardening (Track A) ([#71](#71)) ([0175113](0175113)) * **mcp,cli:** join symbol summaries into query results (P04 surface) ([3d73b65](3d73b65)) * **mcp:** short-circuit list_findings_delta via stored baselineState ([4d9c187](4d9c187)) * **mcp:** surface structured FrameworkDetection in project_profile tool ([15fb309](15fb309)) * replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java) ([#32](#32)) ([1cceb24](1cceb24)) * **scanners:** persist partialFingerprint, baselineState, suppressedJson ([fb4585d](fb4585d)) * **search:** add filter-aware zoom retrieval across hierarchical tiers ([5ab80c4](5ab80c4)) * **search:** extract tryOpenEmbedder + embeddingsPopulated, demote NullEmbedder throw ([c4cc680](c4cc680)) * **storage:** add granularity column to embeddings for hierarchical retrieval ([b5bd5f8](b5bd5f8)) * **storage:** add summary fields to SearchResult and batch lookup helper ([4944a56](4944a56)) * **storage:** persist structured FrameworkDetection in frameworks_json ([75423fe](75423fe)) * **storage:** populate reserved complexity, coverage, deadness columns ([c81e4c3](c81e4c3)) * v1 finalize Track C — debt sweep (7 ACs) ([#73](#73)) ([06d2bb1](06d2bb1)) * v1 finalize Track D — dogfood polish (6 ACs) ([#75](#75)) ([e9da048](e9da048)) ### Bug Fixes * **ci:** pin gopls@v0.18.1 for Go 1.23 + add pnpm build-script allowlist ([c78b31d](c78b31d)) * **cli:** accurate doctor native-binding + int8 weights checks ([fb569f9](fb569f9)) * **deps:** bump minimatch override to 9.0.7 (GHSA-23c5/-7r86) ([7f6e2ae](7f6e2ae)) * **deps:** pin brace-expansion/minimatch/picomatch to patched versions ([5a7d1e0](5a7d1e0)) * **deps:** refresh pnpm-lock.yaml with ts-morph optional dep from P06 ([0dfee11](0dfee11)) * **docs:** rename agents/*.md to .mdx so JSX components render ([#89](#89)) ([d2d8bc7](d2d8bc7)) * **gym:** update corpus test waiver ID to window.desktop after PR [#38](#38) rename ([933b5f2](933b5f2)) * **ingestion:** enumerate git submodule paths in the scan phase ([d290d04](d290d04)) * **ingestion:** skip submodule paths in the ownership blame pass ([e28f3e6](e28f3e6)) * **repo:** replace stale lsp-oracle tsconfig reference with scip-ingest ([0ce5e29](0ce5e29)) * **scip-ingest:** resolve caller/callee correctly for SCIP edges ([c15f928](c15f928)) * **storage:** wire @ladybugdb/core binding, fix lbug open() guards, upgrade pnpm v10→v11 ([#93](#93)) ([78d6a85](78d6a85)) ### Performance * **embeddings:** cross-node batching + worker pool ([#33](#33)) ([acb59d0](acb59d0)) ### Documentation * add SPECS, USECASE, and OBJECTIVES docs ([f3120de](f3120de)) * **adr:** record hierarchical embeddings decision (0004) ([6d28631](6d28631)) * **adr:** update 0002 with P09 Phase 1 measurements ([92b9a1c](92b9a1c)) * clean-slate v1 — drop migration prose, milestone framing, 0.x caveats ([#90](#90)) ([af88fbc](af88fbc)) * compound — durable lessons from docs site revival ([#88](#88)) ([95642f0](95642f0)) * compound — durable lessons from v1 upstream bug sweep ([#77](#77)) ([60eef57](60eef57)) * deep refresh + sync + new architecture pages ([3693ddd](3693ddd)) * **repo:** durable lesson — set NODE_ENV at script scope for astro in CI ([18c159b](18c159b)) * **repo:** durable lesson — stale tsconfig project references ([ea67d7a](ea67d7a)) * **repo:** EARS 006 spec — v1 finalize (M7 + constraint-10 + debt + dogfood) ([67198e3](67198e3)) * **repo:** pre-publish npm readiness — READMEs, GOVERNANCE, CODEOWNERS, package metadata ([dd10f72](dd10f72)) * restore Starlight site + refresh for v1 + agent-friendly USAGE section ([#87](#87)) ([d9b2b30](d9b2b30)) * **site:** add Astro Starlight docs site + GitHub Pages deploy ([#34](#34)) ([5ce0191](5ce0191)) * **site:** add llms.txt + Copy-as-Markdown + Open-in-ChatGPT/Claude ([#36](#36)) ([149ba4e](149ba4e)) * **site:** inject LLM-nav banner + 'See also' footer into every .md ([#37](#37)) ([77190a5](77190a5)) * strip legacy stanzas + capture session lessons ([85f6881](85f6881)) ### Refactoring * consolidate repo-local dir references on META_DIR_NAME ([ce4b63d](ce4b63d)) * **core-types:** centralize LanguageId in core-types ([4c33fc7](4c33fc7)) * **mcp:** consume shared tryOpenEmbedder + embeddingsPopulated from @opencodehub/search ([54f00de](54f00de)) * **plugin:** file-level packet skeletons for codehub-document ([40a09c8](40a09c8)) ### CI * **release:** keep 0.x semver — breaking changes bump minor, feats bump patch ([a6ee4bf](a6ee4bf)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laith Al-Saadoon <alsaadoonlaith@gmail.com> Co-authored-by: Laith Al-Saadoon <9553966+theagenticguy@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
{text, emitRow}records, and a dispatch loop that firesworkers × batchSizeembeds concurrently per wave and scatters vectors back into the row buffer.OnnxEmbedderworkers (packages/ingestion/src/pipeline/phases/embedder-{worker,pool}.ts). Each worker holds its own ONNX session; the pool is exposed behind anEmbedder-shaped facade so the phase doesn't branch. A main-thread canaryOnnxEmbedderopens first soEmbedderNotSetupErrorkeeps its class identity across the structured-clone boundary.--embeddings-workers <n|auto>and--embeddings-batch-size <n>(defaults: 1 and 32 — unchanged single-threaded behaviour out of the box).Motivation
Real-world
codehub analyze --embeddings --force --granularity symbol,file,communityon a ~1,922-file AWS codebase sat at 95% CPU for 7+ minutes before the refactor. The phase was awaitingembedBatch()per node inside a single-threaded ONNX session (intraOpNumThreads: 1,graphOptimizationLevel: "disabled"— required for the graphHash determinism contract), so there was no concurrency anywhere in the stack.Determinism
The graphHash /
embeddingsHashcontract is preserved:(granularity, nodeId, chunkIndex)before hashing.openOnnxEmbedder()'s deterministic knobs are intact per worker — which input produces which vector is independent of which worker ran it.embeddingsHashatbatchSize=1equalsembeddingsHashatbatchSize=32.Expected speedup
On an M-series laptop with
--embeddings-workers auto --embeddings-batch-size 32, the 7-minute AWSQuickWork run should drop to roughly 1–2 minutes.--embeddings-int8cuts that further.Test plan
pnpm build— cleanpnpm --filter @opencodehub/ingestion test— 576/576 passembeddings.test.ts—batchSize=1vsbatchSize=32produce byte-identicalembeddingsHashcodehub analyze --helpsurfaces--embeddings-workersand--embeddings-batch-sizecodehub analyze AWSQuickWork --embeddings --force --granularity symbol,file,community --embeddings-workers autoand confirm wall time drop + identicalembeddingsHashvs a single-threaded control run