Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .erpaval/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# OpenCodeHub — ERPAVal durable knowledge index

Compound-extracted lessons and EARS specs from prior autonomous
development sessions. Solutions are reusable; specs are per-feature.

## Solutions (architecture patterns + conventions)

- [SCIP replaces LSP for code-graph oracle edges](solutions/architecture-patterns/scip-replaces-lsp.md) — one-shot indexers beat stateful LSP clients for compiler-grade graph edges.
- [Repomix --compress is output-side only](solutions/architecture-patterns/repomix-is-output-side.md) — don't substitute it for a tree-sitter chunker; use it for repo snapshots.
- [Hand-roll a minimal protobuf reader for fixed schemas](solutions/conventions/scip-protobuf-hand-rolled-reader.md) — ~130 LOC beats pulling in buf+codegen when the schema is small and stable.

## Specs

- [001-scip-replaces-lsp](specs/001-scip-replaces-lsp/spec.md) — rip-and-replace LSP with SCIP for TS/Py/Go/Rust/Java. Task map: [tasks.md](specs/001-scip-replaces-lsp/tasks.md).
218 changes: 218 additions & 0 deletions .erpaval/debt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# OpenCodeHub — Wave-plan tech-debt tracker

**Status**: Working document. Gitignored via `.gitignore: .erpaval/`.

This file catalogs every wave/stream code reference that was scrubbed from
the codebase on 2026-04-23 during the clean-room audit. The references were
originally left behind as "TODO when Wave X lands" style hints, and they
encoded actual product state — features deferred, scanner tiers, eval
baselines, rollout priority, etc. The scrub removed the wave labels but
some of the underlying work is still incomplete.

Treat every line here as a candidate backlog ticket. For each: figure out
whether the thing was actually shipped (and the comment was stale), or
whether it's still open (and deserves an issue).

## Legend

- **W1** — W1-CORE (initial MVP shape)
- **W2** — second wave (language coverage, caching, scanner tiers, detectors)
- **W3** — third wave (analysis tools: risk_trends, verdict variants)
- **W4** — fourth wave (bench, doctor, gates)
- **W5** — fifth wave (new tools, eval matrix expansion)

Stream letters appeared on W1 artifacts (Stream E = caching, Stream J =
multi-repo groups, Stream T = suppressions, etc). They're a second axis
orthogonal to the W-code.

## Catalog

### packages/cli — wave hints

- `packages/cli/src/commands/analyze.ts:170` — "Cache-health stats
(W2-E.4): the parse-cache hit ratio and on-disk" size telemetry. Ships
the stats; was flagged as W2-E.4 work. **Action:** confirm stats are
actually populated; add a test if not.
- `packages/cli/src/commands/bench.test.ts:2` and
`packages/cli/src/commands/doctor.test.ts:2` — "Unit tests for codehub
bench — W4-G.3" / "doctor — W4-G.3". Both command+test exist; W4-G.3
is delivered. **Action:** no debt, just a stale label.

### packages/embedder — W2-A.2 (embedder weights downloader)

All 5 files reference "W2-A.2" as the code path that installs ONNX
weights via `codehub setup --embeddings`.

- `packages/embedder/src/index.ts:7`
- `packages/embedder/src/paths.ts:12`
- `packages/embedder/src/model-pins.ts:4`
- `packages/embedder/src/model-pins.ts:40` — `"once from the upstream repo"`
- `packages/embedder/src/model-pins.test.ts:4`
- `packages/embedder/src/onnx-embedder.ts:11`

**Action:** `codehub setup --embeddings` ships in `packages/cli/src/commands/setup.ts`
— feature is done. Labels are stale. No debt.

### packages/eval — MVP + W2-C.* language fixtures + W5-3 new-tool matrix

- `packages/eval/baselines/opencodehub-v1.json:60` — "14 language fixtures
(MVP 7 + W2-C.2/3/4 additions: c, cpp, ruby, kotlin, swift, php,
dart)."
- `packages/eval/baselines/opencodehub-v1.json:63` — "risk_trends and
verdict map to tools still in flight (W3-F.1 / W3-F.2). Cases pass via
the isError branch with a structured error envelope until the server
registers the tools."
- `packages/eval/src/opencodehub_eval/agent.py:177` — "W5-3 new tools"
section delimiter
- `packages/eval/src/opencodehub_eval/agent.py:185` — "are still in
flight (W3-F.1 / W3-F.2)"
- `packages/eval/src/opencodehub_eval/bench.py:243` — "new = 63 (W5-3
new-tool matrix)"
- `packages/eval/src/opencodehub_eval/bench.py:269` — "hard-coded 98 (the
W2-C.5 core target)"
- `packages/eval/src/opencodehub_eval/tests/conftest.py:31` — "14
language fixtures (7 MVP + 7 W2-C.2/3/4 additions)"
- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:8,10` —
"W2-C.5 deliverable", "W5-3 coverage for the nine tools"
- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:167-175`
— risk_trends / verdict (W3-F.1/W3-F.2) tool-still-unregistered
fallback logic
- `packages/eval/src/opencodehub_eval/tests/test_parametrized.py:257` —
"W5-3 expansion" in the parametrize helper

**Real debt here:**

1. **W3-F.1 / W3-F.2 (risk_trends + verdict):** eval acknowledged these
as unregistered tools with fallback paths. Search `packages/mcp/src/tools/`
— if both tools exist and are registered, the fallback branches in
`test_parametrized.py:167-175` become dead code that can be removed.
If one is missing, that's a product gap.
2. **W2-C.5 core target = 98.** If the eval baseline now passes a
different target, update the hard-coded fallback in
`bench.py:269`.

### packages/ingestion — language registry (W2-C.1) + content cache (Stream E / W2-E.*)

- `packages/ingestion/src/parse/grammar-registry.test.ts:52-53` — loads
"W2-C.1 grammars" (7 additional: c, cpp, ruby, kotlin, swift, php,
dart)
- `packages/ingestion/src/parse/grammar-registry.ts:198` — "W2-C.*
languages whose grammar package is not installed"
- `packages/ingestion/src/parse/language-detector.ts:26` — "W2-C.1
additions"
- `packages/ingestion/src/pipeline/phases/content-cache.ts:2` —
"Content-addressed parse cache (Stream E, W2-E.1)"
- `packages/ingestion/src/pipeline/phases/content-cache.ts:133` —
"lazily by a future eviction pass (W2-E.4)"
- `packages/ingestion/src/pipeline/phases/content-cache.ts:193` —
"meta-sidecar cache-stats path (W2-E.4)"

**Real debt:**

1. **W2-E.4 eviction pass.** content-cache.ts:133 says eviction is
deferred to "a future eviction pass." Search for any actual eviction
code — if none exists, this is a real backlog item (parse cache will
grow unbounded).

### packages/ingestion — profile detectors + providers (wave-labelled)

- `packages/ingestion/src/pipeline/phases/default-set.ts:20` — "scanner
phases (W2-I4)"
- `packages/ingestion/src/pipeline/phases/dependencies.ts` — probably
has W-code mentions; verify
- `packages/ingestion/src/pipeline/phases/incremental-helper.ts` — W-code
mention; verify
- `packages/ingestion/src/pipeline/phases/incremental-scope.ts` and
`incremental-scope.test.ts` — W-code mentions; verify
- `packages/ingestion/src/pipeline/phases/openapi.ts` — verify
- `packages/ingestion/src/pipeline/phases/parse.test.ts`, `parse.ts` —
verify
- `packages/ingestion/src/pipeline/phases/processes.ts` — verify
- `packages/ingestion/src/pipeline/phases/profile.ts` — verify
- `packages/ingestion/src/pipeline/phases/sbom.test.ts`, `sbom.ts` —
verify
- `packages/ingestion/src/pipeline/profile-detectors/frameworks.ts`,
`languages.ts`, `manifests.ts` — verify
- `packages/ingestion/src/pipeline/types.ts` — verify
- `packages/ingestion/src/providers/http-detect.ts` — verify
- `packages/ingestion/src/providers/registry.test.ts`, `registry.ts` —
verify

**Action:** most are likely stale labels. Spot-check any that contain
"TODO", "FIXME", or "in flight" — those are real debt.

### packages/mcp — prompts + tools (W-code markers)

- `packages/mcp/src/prompts/prompts.test.ts` — verify
- `packages/mcp/src/tools/annotations.test.ts` — verify
- `packages/mcp/src/tools/context.ts` — verify
- `packages/mcp/src/tools/dependencies.ts` — verify
- `packages/mcp/src/tools/group-query.ts` — verify
- `packages/mcp/src/tools/license-audit.ts` — verify

### packages/sarif — schema-validation W-code marker

- `packages/sarif/src/schema-validation.test.ts` — verify

### packages/scanners — P1/P2 tiers + W2-I4

- `packages/scanners/src/catalog.ts:107` — "W2-I4: Priority-2 scanners.
These ship alongside P1 but are opt-in via" (exact quote)
- `packages/scanners/src/wrappers/osv-scanner.ts` — W-code mention
- `packages/scanners/src/wrappers/p2-wrappers.test.ts` — W-code mention
- `packages/scanners/src/wrappers/trivy.ts` — W-code mention

**Product fact to preserve:** the P1/P2 split is a real user-facing
feature. Keep "Priority-1" and "Priority-2" as product terminology
(they're documented in scanners/package.json description). Only drop
the W2-I4 label.

### Vendor README — the literal "(to be created in W2-B.2)" smoking gun

- `vendor/stack-graphs-python/README.md:39` — "That evaluator consumes
the vendored `.tsg` as (to be created in W2-B.2)."

**Action:** the evaluator DOES exist at
`packages/ingestion/src/providers/resolution/stack-graphs/`. Rephrase
the README to point at the real path instead of a wave code. No debt;
just a stale pointer.

### Root / infra

- `pnpm-workspace.yaml` — W-code mention (probably a comment)
- `scripts/acceptance.sh` — W-code mentions
- `scripts/smoke-mcp.sh` — W-code mentions

### Commit subject line (about to be scrubbed via Option A nuke)

- `645c08e "Stream J: Multi-repo retrieval & group queries"` — this was
a real release. Stream J = multi-repo groups. Confirmed shipped.
- `f08c87f "Initial commit: OpenCodeHub MVP + v1.0 roadmap"` — body
says "Apache-2.0 clean-room reimplementation of the GitNexus product
surface". Nuke.
- Several mid-history commits reference "gitnexus" by name in parity
/ cleanroom-gym notes.

## Stream names seen in history (for reference)

| Stream | What it shipped |
|--------|-----------------|
| Stream E | Content-addressed parse cache (`content-cache.ts`, meta sidecar) |
| Stream J | Multi-repo groups (group-query/group-status/group-sync MCP tools) |
| Stream T | SARIF suppressions (`packages/sarif/src/suppressions.ts`) |

## Revisit workflow

When you come back to these:

```bash
# Re-list any wave codes that survived into future commits
git grep -nE 'W[0-9]+[-.][A-Z0-9]+|\bStream [A-Z]\b'

# Re-list any gitnexus references
git grep -ni 'gitnexus'
```

Banned-strings CI check (`scripts/check-banned-strings.sh`) now blocks
wave codes and `gitnexus` from re-entering the tree — so any future
appearance is a regression, not drift.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
title: Repomix --compress is output-side only, not an input-side chunker
tags: [repomix, embedder, chunker, tree-sitter, llm]
first_applied: 2026-04-26
repos: [open-code-hub]
---

## The pattern

Repomix (https://github.com/yamadashy/repomix) is tempting as a
replacement for a tree-sitter-based chunker in an embedding pipeline —
it ships `--compress` with ~70% token reduction and supports 16
languages. **Do not use it that way.** Scope it to output-side surfaces
(LLM-context packing, snapshot generation).

## Why

1. **Per-file, not per-symbol.** `--compress` stitches signatures +
class headers + imports into a single text blob per file joined by
`⋮----`. It discards `startLine / endLine / symbolName / nodeType`.
A graph-extraction pipeline that turns parse captures into
Function/Method/Class nodes + CALLS/IMPORTS/EXTENDS edges cannot be
fed from this output.
2. **Tokenizer mismatch.** Token counts use `o200k_base` (GPT-4o). If
your embedder is anything else (BERT, modernbert, e5, voyage-code),
your budget math won't line up.
3. **Determinism gap.** No grammar-sha is exposed, so content-addressed
cache keys `(sha256, grammarSha, pipelineVersion)` lose their
grammar component.
4. **Coverage gaps.** tsx folds into typescript; kotlin is absent.

## Where repomix actually shines

- `codehub pack` CLI command — single-file snapshot for agents who want
to drop the whole repo into their context window.
- An MCP `pack_codebase` tool that re-exports the repomix invocation so
agents can produce their own snapshots without knowing the CLI.

## Quick sanity check before substituting repomix for anything

Before planning to delete a chunker / parser in favor of repomix, ask:

- Do downstream consumers need per-symbol boundaries?
- Do they need startLine / endLine on every chunk?
- Do they key caches off grammar shas?
- Are tsx / kotlin / any other first-class language supported?

Any **yes** means keep your existing chunker; use repomix only for the
output-side feature.
80 changes: 80 additions & 0 deletions .erpaval/solutions/architecture-patterns/scip-replaces-lsp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: SCIP replaces LSP for code-graph oracle edges
tags: [scip, lsp, ingestion, graph, indexer]
first_applied: 2026-04-26
repos: [open-code-hub]
---

## The pattern

When a code-intelligence system needs compiler-grade call / reference /
heritage edges across many languages, prefer **SCIP** indexers (one-shot
artifact producers) over **LSP** servers (stateful JSON-RPC subprocesses).

SCIP indexers exist for TypeScript, Python, Go, Rust (via
`rust-analyzer scip`), and Java. Each emits a single `.scip` protobuf
file per run. A symbol string encodes
`<scheme> <manager> <package> <version> <descriptor>+` which is
globally unique — cross-repo references work by construction.

## The shape

```
source tree ─► per-lang SCIP indexer (×5) ─► .opencodehub/scip/<lang>.scip
parseScipIndex(Uint8Array) -> ScipIndex
deriveIndex(index) -> {symbols, edges}
materialize(edges) -> {node_metrics,
reach_forward,
reach_backward,
scc}
CodeRelation(confidence=1.0,
reason="scip:<indexer>@<v>")
```

## Why this beats the LSP approach

- **No daemon.** SCIP produces an artifact; no stdio JSON-RPC, no
request correlation, no warm-up, no timeout tuning.
- **Dependency surface shrinks.** No pyright / tsserver / gopls /
rust-analyzer binaries in node_modules.
- **Cross-repo for free.** SCIP symbol strings are globally unique;
merging two `.scip` files is just `concat documents[] + concat
external_symbols[]` at the protobuf level.
- **Incremental caching is trivial.** One mtime check per language; no
need to track per-symbol queries.

## The contract boundary worth preserving

The `confidence=1.0` + `reason startsWith "<oracle>:"` contract that
downstream consumers (`confidence-demote`, `summarize`,
`mcp/confidence`, `cli/analyze` auto-cap) hinge on is load-bearing.
When migrating from LSP to SCIP, keep the same confidence ceiling and
switch only the reason-prefix list and the phase-name that produces
the edges. Downstream code changes are then one-line (new constant).

## Lingering gotchas

- **scip-java / rust-analyzer run build scripts** — gate behind an
explicit `allowBuildScripts=true` opt-in for untrusted workspaces.
- **Relationship edges (IMPLEMENTS) are in SymbolInformation, not in
Occurrence** — a minimal protobuf reader that only decodes
Occurrence will not surface them. When we need real IMPLEMENTS
semantics, extend the parser to decode `SymbolInformation.relationships`.
- **SCIP range encoding has two shapes** — 4-int
`[startLine, startChar, endLine, endChar]` OR 3-int
`[line, startChar, endChar]` when start/end share a line. Normalize
at decode time.

## When NOT to use this

- Small toy projects where tree-sitter heuristic edges are good enough.
- Languages without a SCIP indexer (C#, C, C++, Ruby, Kotlin, Swift,
PHP, Dart — as of 2026-04-26). Keep tree-sitter for those.
Loading
Loading