theagenticguy · theagenticguy · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
@@ -0,0 +1,14 @@
+# OpenCodeHub — ERPAVal durable knowledge index
+
+Compound-extracted lessons and EARS specs from prior autonomous
+development sessions. Solutions are reusable; specs are per-feature.
+
+## Solutions (architecture patterns + conventions)
+
+- [SCIP replaces LSP for code-graph oracle edges](solutions/architecture-patterns/scip-replaces-lsp.md) — one-shot indexers beat stateful LSP clients for compiler-grade graph edges.
+- [Repomix --compress is output-side only](solutions/architecture-patterns/repomix-is-output-side.md) — don't substitute it for a tree-sitter chunker; use it for repo snapshots.
+- [Hand-roll a minimal protobuf reader for fixed schemas](solutions/conventions/scip-protobuf-hand-rolled-reader.md) — ~130 LOC beats pulling in buf+codegen when the schema is small and stable.
+
+## Specs
+
+- [001-scip-replaces-lsp](specs/001-scip-replaces-lsp/spec.md) — rip-and-replace LSP with SCIP for TS/Py/Go/Rust/Java. Task map: [tasks.md](specs/001-scip-replaces-lsp/tasks.md).
@@ -0,0 +1,218 @@
+# OpenCodeHub — Wave-plan tech-debt tracker
+
+**Status**: Working document. Gitignored via `.gitignore: .erpaval/`.
+
+This file catalogs every wave/stream code reference that was scrubbed from
+the codebase on 2026-04-23 during the clean-room audit. The references were
+originally left behind as "TODO when Wave X lands" style hints, and they
+encoded actual product state — features deferred, scanner tiers, eval
+baselines, rollout priority, etc. The scrub removed the wave labels but
+some of the underlying work is still incomplete.
+
+Treat every line here as a candidate backlog ticket. For each: figure out
+whether the thing was actually shipped (and the comment was stale), or
+whether it's still open (and deserves an issue).
+
+## Legend
+
+- **W1** — W1-CORE (initial MVP shape)
+- **W2** — second wave (language coverage, caching, scanner tiers, detectors)
+- **W3** — third wave (analysis tools: risk_trends, verdict variants)
+- **W4** — fourth wave (bench, doctor, gates)
+- **W5** — fifth wave (new tools, eval matrix expansion)
+
+Stream letters appeared on W1 artifacts (Stream E = caching, Stream J =
+multi-repo groups, Stream T = suppressions, etc). They're a second axis
+orthogonal to the W-code.
+
+## Catalog
+
+### packages/cli — wave hints
+
+- `packages/cli/src/commands/analyze.ts:170` — "Cache-health stats
+  (W2-E.4): the parse-cache hit ratio and on-disk" size telemetry. Ships
+  the stats; was flagged as W2-E.4 work. **Action:** confirm stats are
+  actually populated; add a test if not.
+- `packages/cli/src/commands/bench.test.ts:2` and
+  `packages/cli/src/commands/doctor.test.ts:2` — "Unit tests for codehub
+  bench — W4-G.3" / "doctor — W4-G.3". Both command+test exist; W4-G.3
+  is delivered. **Action:** no debt, just a stale label.
+
+### packages/embedder — W2-A.2 (embedder weights downloader)
+
+All 5 files reference "W2-A.2" as the code path that installs ONNX
+weights via `codehub setup --embeddings`.
+
+- `packages/embedder/src/index.ts:7`
+- `packages/embedder/src/paths.ts:12`
+- `packages/embedder/src/model-pins.ts:4`
+- `packages/embedder/src/model-pins.ts:40` — `"once from the upstream repo"`
+- `packages/embedder/src/model-pins.test.ts:4`
+- `packages/embedder/src/onnx-embedder.ts:11`
+
+**Action:** `codehub setup --embeddings` ships in `packages/cli/src/commands/setup.ts`
+— feature is done. Labels are stale. No debt.
+
+### packages/eval — MVP + W2-C.* language fixtures + W5-3 new-tool matrix
+
+- `packages/eval/baselines/opencodehub-v1.json:60` — "14 language fixtures
+  (MVP 7 + W2-C.2/3/4 additions: c, cpp, ruby, kotlin, swift, php,
+  dart)."
+- `packages/eval/baselines/opencodehub-v1.json:63` — "risk_trends and
+  verdict map to tools still in flight (W3-F.1 / W3-F.2). Cases pass via
+  the isError branch with a structured error envelope until the server
+  registers the tools."
+- `packages/eval/src/opencodehub_eval/agent.py:177` — "W5-3 new tools"
+  section delimiter
+- `packages/eval/src/opencodehub_eval/agent.py:185` — "are still in
+  flight (W3-F.1 / W3-F.2)"
+- `packages/eval/src/opencodehub_eval/bench.py:243` — "new = 63 (W5-3
+  new-tool matrix)"
+- `packages/eval/src/opencodehub_eval/bench.py:269` — "hard-coded 98 (the
+  W2-C.5 core target)"
+- `packages/eval/src/opencodehub_eval/tests/conftest.py:31` — "14
+  language fixtures (7 MVP + 7 W2-C.2/3/4 additions)"
+- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:8,10` —
+  "W2-C.5 deliverable", "W5-3 coverage for the nine tools"
+- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:167-175`
+  — risk_trends / verdict (W3-F.1/W3-F.2) tool-still-unregistered
+  fallback logic
+- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:257` —
+  "W5-3 expansion" in the parametrize helper
+
+**Real debt here:**
+
+1. **W3-F.1 / W3-F.2 (risk_trends + verdict):** eval acknowledged these
+   as unregistered tools with fallback paths. Search `packages/mcp/src/tools/`
+   — if both tools exist and are registered, the fallback branches in
+   `test_parametrized.py:167-175` become dead code that can be removed.
+   If one is missing, that's a product gap.
+2. **W2-C.5 core target = 98.** If the eval baseline now passes a
+   different target, update the hard-coded fallback in
+   `bench.py:269`.
+
+### packages/ingestion — language registry (W2-C.1) + content cache (Stream E / W2-E.*)
+
+- `packages/ingestion/src/parse/grammar-registry.test.ts:52-53` — loads
+  "W2-C.1 grammars" (7 additional: c, cpp, ruby, kotlin, swift, php,
+  dart)
+- `packages/ingestion/src/parse/grammar-registry.ts:198` — "W2-C.*
+  languages whose grammar package is not installed"
+- `packages/ingestion/src/parse/language-detector.ts:26` — "W2-C.1
+  additions"
+- `packages/ingestion/src/pipeline/phases/content-cache.ts:2` —
+  "Content-addressed parse cache (Stream E, W2-E.1)"
+- `packages/ingestion/src/pipeline/phases/content-cache.ts:133` —
+  "lazily by a future eviction pass (W2-E.4)"
+- `packages/ingestion/src/pipeline/phases/content-cache.ts:193` —
+  "meta-sidecar cache-stats path (W2-E.4)"
+
+**Real debt:**
+
+1. **W2-E.4 eviction pass.** content-cache.ts:133 says eviction is
+   deferred to "a future eviction pass." Search for any actual eviction
+   code — if none exists, this is a real backlog item (parse cache will
+   grow unbounded).
+
+### packages/ingestion — profile detectors + providers (wave-labelled)
+
+- `packages/ingestion/src/pipeline/phases/default-set.ts:20` — "scanner
+  phases (W2-I4)"
+- `packages/ingestion/src/pipeline/phases/dependencies.ts` — probably
+  has W-code mentions; verify
+- `packages/ingestion/src/pipeline/phases/incremental-helper.ts` — W-code
+  mention; verify
+- `packages/ingestion/src/pipeline/phases/incremental-scope.ts` and
+  `incremental-scope.test.ts` — W-code mentions; verify
+- `packages/ingestion/src/pipeline/phases/openapi.ts` — verify
+- `packages/ingestion/src/pipeline/phases/parse.test.ts`, `parse.ts` —
+  verify
+- `packages/ingestion/src/pipeline/phases/processes.ts` — verify
+- `packages/ingestion/src/pipeline/phases/profile.ts` — verify
+- `packages/ingestion/src/pipeline/phases/sbom.test.ts`, `sbom.ts` —
+  verify
+- `packages/ingestion/src/pipeline/profile-detectors/frameworks.ts`,
+  `languages.ts`, `manifests.ts` — verify
+- `packages/ingestion/src/pipeline/types.ts` — verify
+- `packages/ingestion/src/providers/http-detect.ts` — verify
+- `packages/ingestion/src/providers/registry.test.ts`, `registry.ts` —
+  verify
+
+**Action:** most are likely stale labels. Spot-check any that contain
+"TODO", "FIXME", or "in flight" — those are real debt.
+
+### packages/mcp — prompts + tools (W-code markers)
+
+- `packages/mcp/src/prompts/prompts.test.ts` — verify
+- `packages/mcp/src/tools/annotations.test.ts` — verify
+- `packages/mcp/src/tools/context.ts` — verify
+- `packages/mcp/src/tools/dependencies.ts` — verify
+- `packages/mcp/src/tools/group-query.ts` — verify
+- `packages/mcp/src/tools/license-audit.ts` — verify
+
+### packages/sarif — schema-validation W-code marker
+
+- `packages/sarif/src/schema-validation.test.ts` — verify
+
+### packages/scanners — P1/P2 tiers + W2-I4
+
+- `packages/scanners/src/catalog.ts:107` — "W2-I4: Priority-2 scanners.
+  These ship alongside P1 but are opt-in via" (exact quote)
+- `packages/scanners/src/wrappers/osv-scanner.ts` — W-code mention
+- `packages/scanners/src/wrappers/p2-wrappers.test.ts` — W-code mention
+- `packages/scanners/src/wrappers/trivy.ts` — W-code mention
+
+**Product fact to preserve:** the P1/P2 split is a real user-facing
+feature. Keep "Priority-1" and "Priority-2" as product terminology
+(they're documented in scanners/package.json description). Only drop
+the W2-I4 label.
+
+### Vendor README — the literal "(to be created in W2-B.2)" smoking gun
+
+- `vendor/stack-graphs-python/README.md:39` — "That evaluator consumes
+  the vendored `.tsg` as (to be created in W2-B.2)."
+
+**Action:** the evaluator DOES exist at
+`packages/ingestion/src/providers/resolution/stack-graphs/`. Rephrase
+the README to point at the real path instead of a wave code. No debt;
+just a stale pointer.
+
+### Root / infra
+
+- `pnpm-workspace.yaml` — W-code mention (probably a comment)
+- `scripts/acceptance.sh` — W-code mentions
+- `scripts/smoke-mcp.sh` — W-code mentions
+
+### Commit subject line (about to be scrubbed via Option A nuke)
+
+- `645c08e "Stream J: Multi-repo retrieval & group queries"` — this was
+  a real release. Stream J = multi-repo groups. Confirmed shipped.
+- `f08c87f "Initial commit: OpenCodeHub MVP + v1.0 roadmap"` — body
+  says "Apache-2.0 clean-room reimplementation of the GitNexus product
+  surface". Nuke.
+- Several mid-history commits reference "gitnexus" by name in parity
+  / cleanroom-gym notes.
+
+## Stream names seen in history (for reference)
+
+| Stream | What it shipped |
+|--------|-----------------|
+| Stream E | Content-addressed parse cache (`content-cache.ts`, meta sidecar) |
+| Stream J | Multi-repo groups (group-query/group-status/group-sync MCP tools) |
+| Stream T | SARIF suppressions (`packages/sarif/src/suppressions.ts`) |
+
+## Revisit workflow
+
+When you come back to these:
+
+```bash
+# Re-list any wave codes that survived into future commits
+git grep -nE 'W[0-9]+[-.][A-Z0-9]+|\bStream [A-Z]\b'
+
+# Re-list any gitnexus references
+git grep -ni 'gitnexus'
+```
+
+Banned-strings CI check (`scripts/check-banned-strings.sh`) now blocks
+wave codes and `gitnexus` from re-entering the tree — so any future
+appearance is a regression, not drift.
@@ -0,0 +1,49 @@
+---
+title: Repomix --compress is output-side only, not an input-side chunker
+tags: [repomix, embedder, chunker, tree-sitter, llm]
+first_applied: 2026-04-26
+repos: [open-code-hub]
+---
+
+## The pattern
+
+Repomix (https://github.com/yamadashy/repomix) is tempting as a
+replacement for a tree-sitter-based chunker in an embedding pipeline —
+it ships `--compress` with ~70% token reduction and supports 16
+languages. **Do not use it that way.** Scope it to output-side surfaces
+(LLM-context packing, snapshot generation).
+
+## Why
+
+1. **Per-file, not per-symbol.** `--compress` stitches signatures +
+   class headers + imports into a single text blob per file joined by
+   `⋮----`. It discards `startLine / endLine / symbolName / nodeType`.
+   A graph-extraction pipeline that turns parse captures into
+   Function/Method/Class nodes + CALLS/IMPORTS/EXTENDS edges cannot be
+   fed from this output.
+2. **Tokenizer mismatch.** Token counts use `o200k_base` (GPT-4o). If
+   your embedder is anything else (BERT, modernbert, e5, voyage-code),
+   your budget math won't line up.
+3. **Determinism gap.** No grammar-sha is exposed, so content-addressed
+   cache keys `(sha256, grammarSha, pipelineVersion)` lose their
+   grammar component.
+4. **Coverage gaps.** tsx folds into typescript; kotlin is absent.
+
+## Where repomix actually shines
+
+- `codehub pack` CLI command — single-file snapshot for agents who want
+  to drop the whole repo into their context window.
+- An MCP `pack_codebase` tool that re-exports the repomix invocation so
+  agents can produce their own snapshots without knowing the CLI.
+
+## Quick sanity check before substituting repomix for anything
+
+Before planning to delete a chunker / parser in favor of repomix, ask:
+
+- Do downstream consumers need per-symbol boundaries?
+- Do they need startLine / endLine on every chunk?
+- Do they key caches off grammar shas?
+- Are tsx / kotlin / any other first-class language supported?
+
+Any **yes** means keep your existing chunker; use repomix only for the
+output-side feature.
@@ -0,0 +1,80 @@
+---
+title: SCIP replaces LSP for code-graph oracle edges
+tags: [scip, lsp, ingestion, graph, indexer]
+first_applied: 2026-04-26
+repos: [open-code-hub]
+---
+
+## The pattern
+
+When a code-intelligence system needs compiler-grade call / reference /
+heritage edges across many languages, prefer **SCIP** indexers (one-shot
+artifact producers) over **LSP** servers (stateful JSON-RPC subprocesses).
+
+SCIP indexers exist for TypeScript, Python, Go, Rust (via
+`rust-analyzer scip`), and Java. Each emits a single `.scip` protobuf
+file per run. A symbol string encodes
+`<scheme> <manager> <package> <version> <descriptor>+` which is
+globally unique — cross-repo references work by construction.
+
+## The shape
+
+```
+source tree  ─►  per-lang SCIP indexer (×5) ─►  .opencodehub/scip/<lang>.scip
+                                                        │
+                                                        ▼
+                                   parseScipIndex(Uint8Array) -> ScipIndex
+                                                        │
+                                                        ▼
+                                    deriveIndex(index) -> {symbols, edges}
+                                                        │
+                                                        ▼
+                                    materialize(edges) -> {node_metrics,
+                                                           reach_forward,
+                                                           reach_backward,
+                                                           scc}
+                                                        │
+                                                        ▼
+                                   CodeRelation(confidence=1.0,
+                                                reason="scip:<indexer>@<v>")
+```
+
+## Why this beats the LSP approach
+
+- **No daemon.** SCIP produces an artifact; no stdio JSON-RPC, no
+  request correlation, no warm-up, no timeout tuning.
+- **Dependency surface shrinks.** No pyright / tsserver / gopls /
+  rust-analyzer binaries in node_modules.
+- **Cross-repo for free.** SCIP symbol strings are globally unique;
+  merging two `.scip` files is just `concat documents[] + concat
+  external_symbols[]` at the protobuf level.
+- **Incremental caching is trivial.** One mtime check per language; no
+  need to track per-symbol queries.
+
+## The contract boundary worth preserving
+
+The `confidence=1.0` + `reason startsWith "<oracle>:"` contract that
+downstream consumers (`confidence-demote`, `summarize`,
+`mcp/confidence`, `cli/analyze` auto-cap) hinge on is load-bearing.
+When migrating from LSP to SCIP, keep the same confidence ceiling and
+switch only the reason-prefix list and the phase-name that produces
+the edges. Downstream code changes are then one-line (new constant).
+
+## Lingering gotchas
+
+- **scip-java / rust-analyzer run build scripts** — gate behind an
+  explicit `allowBuildScripts=true` opt-in for untrusted workspaces.
+- **Relationship edges (IMPLEMENTS) are in SymbolInformation, not in
+  Occurrence** — a minimal protobuf reader that only decodes
+  Occurrence will not surface them. When we need real IMPLEMENTS
+  semantics, extend the parser to decode `SymbolInformation.relationships`.
+- **SCIP range encoding has two shapes** — 4-int
+  `[startLine, startChar, endLine, endChar]` OR 3-int
+  `[line, startChar, endChar]` when start/end share a line. Normalize
+  at decode time.
+
+## When NOT to use this
+
+- Small toy projects where tree-sitter heuristic edges are good enough.
+- Languages without a SCIP indexer (C#, C, C++, Ruby, Kotlin, Swift,
+  PHP, Dart — as of 2026-04-26). Keep tree-sitter for those.