Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .erpaval/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ development sessions. Solutions are reusable; specs are per-feature.
- [Replace raw-SQL escape hatches with typed finders on the storage interface](solutions/architecture-patterns/typed-finders-replace-raw-sql-in-consumers.md) — 108 raw-SQL sites collapse into 15 named finders. Adapters internalize dialect; consumers stay backend-agnostic. Liskov-clean parity harness via public-method rebuilder.
- [Parallel Act subagents on a shared git tree — interleaving + cherry-pick discipline](solutions/best-practices/parallel-act-subagents-with-shared-git-tree.md) — verify branch state, spawn on non-overlapping packages, watch for stale dist + phantom test counts, watch the test-fixup tail.
- [Squash-merge masks pre-existing repo-wide debt](solutions/best-practices/squash-merge-masks-pre-existing-debt.md) — first action on a fresh branch from main is `mise run check` BEFORE starting work; lint rules / transitive deps / cross-package test assertions drift across squash boundaries even when per-commit gating was green inside the prior PR.
- [No spec-coordinate leakage into source](solutions/best-practices/no-spec-coordinate-leakage-into-source.md) — ERPAVal `AC-*`, `M-*`, `W-*`, `CL-*` prefixes belong in commits, PR bodies, ADR refs sections — NOT in JSDoc, inline comments, CLI flag help, MCP tool descriptions, or test names. Sweep `rg -n "AC-[A-Z]-[0-9]" packages/` before every PR-open; LLM clients pick up the leakage and start citing it back.

## Specs

Expand Down
12 changes: 7 additions & 5 deletions .erpaval/debt.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,11 +288,13 @@ architecture pages + mermaid rendering.
`sarif`, `scip-ingest`, `search` — all have code-level docs but no
README.

2. **`.gitmodules` thiserror pin comment.** The sweep reconciled
`packages/gym/corpus/rust/README.md` and `corpus/repos/README.md` on
`thiserror@2.0.17`. `.gitmodules` line 19 still says
`pin: v2.0.0 tag`. One-line fix — the subagent's write was denied,
deferred to the user.
2. **`.gitmodules` thiserror pin comment.** **Status: CLOSED-STALE** —
`git show HEAD:.gitmodules` returns "fatal: path .gitmodules does
not exist in HEAD"; the file was removed when `packages/gym` moved
to `opencodehub-testbed` (commit 378f79f). The submodule set lives
in the testbed repo now; any thiserror-pin reconciliation belongs
over there, not here. Closed as stale by AC-C-7 (Track C, v1
finalize, 2026-05-09).

3. **Dead eval-harness fallback.** `packages/eval/src/opencodehub_eval/
test_parametrized.py:167-175` has tool-still-unregistered fallback
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
name: ERPAVal spec coordinates (CL-*, AC-*, M-*, W-*) MUST NOT leak into production source or comments
description: Specifier prefixes from EARS specs and the ERPAVal classifier vocabulary are session-local bookkeeping; production code, comments, JSDoc, and CLI/MCP option descriptions must not reference them
type: feedback
---

ERPAVal specs use a structured vocabulary — `AC-A-1`, `AC-C-3`, `M3-1`,
`W-A-2`, `E-C-3`, `CL-VALIDATE`, `S-A-2`, `architecture-revised.md
§AC-A-7` — to coordinate work across the orchestrator and Act
subagents. These prefixes are useful inside ERPAVal artifacts:
`.erpaval/specs/`, `.erpaval/sessions/<id>/`, ADR validation tables,
commit messages, PR bodies. They are NOT useful in production source.

Observed leakage on Track C cleanup (2026-05-09): the orchestrator and
multiple Act subagents seeded `AC-C-3:`, `AC-C-2:`, `AC-A-1:`, `AC-A-6c:`,
`AC-A-9:` into JSDoc, inline comments, MCP tool option descriptions
(visible to every MCP client), and CLI flag help (visible to every
`codehub query --help` user). Counts after Wave C.1 + my Wave C.2 first
pass: ~45 source references to AC-A-* (legacy from Track A — already
on main via PR #71), 14 source references to AC-C-* introduced this
session before sweep.

**Why:** session-local coordinates rot. Six months after the AC graduates
into a release, the spec packet is in `.erpaval/sessions/session-<hex>/`
which is gitignored — readers of the source can't follow the citation.
The MCP option description "Bypass the embedder fingerprint check
(AC-C-3)." leaks ERPAVal vocabulary into the MCP tool surface, which
LLM clients then pick up and start citing back; the leakage compounds.

**How to apply:**

- **Source comments / JSDoc:** name the underlying invariant, behavior,
or contract. "Refuse when the persisted embedder modelId differs from
the current one" is forever; "AC-C-3 refusal" is until the AC merges
and then forgets itself.
- **Variable names, function names, type fields:** never carry the prefix.
`forceBackendMismatch` (good) not `acC3ForceBackendMismatch` (never).
- **CLI help / MCP descriptions / tool descriptions:** describe the
user-visible contract. The user does not know what an AC is. Strip.
- **ADR text:** ADRs MAY cite AC-* coordinates because the ADR is the
permanent home of the decision rationale and links to the spec packet.
But cite once, in a "References" section, not inline throughout the
decision body.
- **Commit messages and PR descriptions:** AC citations are great here.
Reviewers grep for them; release-please may include them in the
changelog.
- **Test names and fixture names:** prefer the behavior under test
("graphHash parity: medium-with-empty-keywords ([] vs absent)") over
the AC ("AC-C-2: graphHash..."). The behavior survives renames; AC
numbers don't.
- **Sweep before commit.** Run `rg -n "AC-[A-Z]-[0-9]" packages/`
against your branch before PR-open. Anything that hits is a
candidate for rephrase. If the comment NEEDS to cite the AC, use a
short reference at the end like "(AC-C-5)" rather than leading with
it.
- **The test fakes are the trap.** When a Wave subagent edits a test
fake, it tends to add `// AC-XXX: stubs ...` because it's writing
the comment WITH the AC packet open in front of it. Sweep test files
the same way as source files.

**Why it's worth a hook:** the leakage is mechanical and silent. A
PostToolUse hook on Edit/Write that scans the diff for `^[\\s*/]*AC-[A-Z]-[0-9]+`
in `packages/**` (excluding `.erpaval/`, `.md` ADRs, and commit-message
files) and either blocks the write or appends a stderr advisory would
catch every recurrence at the source. Until that hook exists, the
discipline is on the orchestrator + reviewer.

**Carry-forward debt:** Track A merged with extensive `AC-A-*`
references throughout `packages/storage/`, `packages/mcp/`, and
`packages/cli/`. They are on main and any Track-after-A branch picks
them up. A standalone `chore(repo): scrub spec coordinates from
source` cleanup PR is the right venue — not Track C, not Track D.
That PR can ship in its own session because the cleanup is mechanical
and reviewable in one window.
106 changes: 106 additions & 0 deletions docs/adr/0014-scip-references-and-embedder-fingerprint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# ADR 0014 — SCIP REFERENCES + TYPE_OF emission and embedder-fingerprint refusal

**Status**: Accepted
**Date**: 2026-05-09
**Supersedes**: none
**Superseded by**: none

## Context

Two unrelated holes in v1.0 finalize, both routing through a shared one-time graphHash content delta. They land in a single ADR per spec.md§Q7 because the fixture-regeneration cost is paid once.

### Hole A — Embedder rebuild-on-switch silent corruption (AC-C-3)

The `embeddings` table on disk is populated by ONE specific embedder at index time. The currently-shipped store_meta schema (`packages/storage/src/schema-ddl.ts:172-183`) records `schema_version, last_commit, indexed_at, node_count, edge_count, stats_json, cache_hit_ratio, cache_size_bytes, last_compaction` — but NOT which embedder produced the vectors.

Failure mode: an operator runs `codehub analyze` with the local ONNX `gte-modernbert-base/fp32` embedder, then later runs `codehub query` with `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` set to a SageMaker `gte-modernbert-base` deployment. Both embedders publish 768-dim vectors, so the dim guard at `bindParam(stmt, idx, vectorBuffer)` does not fire. The vector subspaces are NOT identical — different fp32/quantization, different post-processing pipelines, different L2-normalisation cutoffs — so cosine-similarity retrieval silently misranks.

There is no test suite that catches this; there is no error envelope at the query path.

### Hole B — SCIP REFERENCES + TYPE_OF unwired (AC-C-5)

`packages/scip-ingest/src/derive.ts` already correctly:
- Emits CALLS edges via `deriveEdges` for function-like SCIP occurrences (`derive.ts:128-152`).
- Collects `is_implementation → IMPLEMENTS` and `is_type_definition → TYPE_OF` rows into `derived.relations` via `collectRels` (`derive.ts:184-199`).

But:
- `packages/ingestion/src/pipeline/phases/scip-index.ts:245-252` consumes `derived.edges` (CALLS) and ignores `derived.relations` entirely. IMPLEMENTS and TYPE_OF reach the graph zero times even though the data is parsed.
- `derive.ts:136` filters non-call SCIP occurrences out of `deriveEdges` with `if (!isFunctionLike(occ.symbol)) continue;` — so non-call references (an identifier reading a class field, importing a type, accessing a constant) never produce REFERENCES edges either.
- `RelationType` at `packages/core-types/src/edges.ts:3-27` lists `REFERENCES` (position 21) but does NOT list `TYPE_OF`. The append-only rule at `edges.ts:29-32` requires new relation types to be appended at the end — `TYPE_OF` lands at position 25.

The combined effect: every existing OCH index understates the call/reference graph by ~3 edge classes, and the `RelationType` union is missing one of the two heritage relations SCIP exposes.

## Decision

### A — Persist embedder modelId; refuse mismatched queries

1. Add `embedder_model_id TEXT` column to `store_meta` (DuckDB) and the matching `STRING` field to the `StoreMeta` graph-db NODE TABLE.
2. `Store.setMeta(meta)` writes the currently-active embedder's `Embedder.modelId`.
3. `Store.getMeta()` returns the persisted value via the new `StoreMeta.embedderModelId?: string` field.
4. At query time (cli `runQuery`, MCP `runQuery`), read `meta.embedderModelId`, compare to `embedder.modelId`:
- Equal → proceed.
- Persisted is `undefined` (pre-AC-C-3 store) → proceed; the operator is trusted to know what they indexed.
- Mismatch + force flag set → proceed.
- Mismatch + no force flag → refuse. CLI prints to stderr and `process.exit(2)` per E-C-3. MCP returns a `EMBEDDER_MISMATCH` envelope via `toolError` per the same hint string.
5. Frozen remediation hint string lives in `packages/embedder/src/fingerprint.ts` as `EMBEDDER_MISMATCH_HINT`. Both surfaces import it so the message can never drift.
6. CLI `--force-backend-mismatch` flag and MCP `force_backend_mismatch` tool input give the operator an override path. Default `false`.

The `assertEmbedderCompatible(persistedModelId, currentModelId, force)` helper lives in `@opencodehub/embedder` so cli + mcp share one comparator.

### B — Emit IMPLEMENTS, TYPE_OF, REFERENCES from SCIP

1. Append `TYPE_OF` at position 25 of `RelationType` and `RELATION_TYPES` per the append-only rule. The schema-shape stays append-stable; `graphHash` for content that does NOT include TYPE_OF stays byte-identical.
2. Widen `derive.ts:136` to also emit a `REFERENCES` `DerivedEdge` for non-call SCIP occurrences whose symbol has a DEFINITION elsewhere AND is not an `SCIP_ROLE_IMPORT`-only occurrence.
3. Add a sibling `emitRelations` call in `scip-index.ts` that consumes `derived.relations` and writes IMPLEMENTS + TYPE_OF graph edges using the same caller→callee join shape as `emitEdges`. Both joins use `buildSymbolDefIndex` for callee resolution, per the `scip-callee-definition-site` lesson; both add `+1` at the SCIP→OCH boundary, per the `scip-0-indexed-vs-graph-1-indexed` lesson.
4. Regenerate `packages/ingestion/src/pipeline/incremental-determinism.test.ts` fixtures one time. Document this in the commit message as the expected one-time content delta.
5. Extend `packages/storage/src/graph-hash-parity.test.ts` with a fixture variant exercising IMPLEMENTS + TYPE_OF + REFERENCES; assert cross-adapter parity holds.

## Consequences

### graphHash impact

- **Hole A (embedder fingerprint)** is graphHash-NEUTRAL. `graphHash` (`packages/core-types/src/graph-hash.ts:44-69`) hashes ONLY `(nodes, edges)` — `store_meta` is not part of the input. Adding a `store_meta` column does not change any per-commit hash.
- **Hole B (SCIP edges)** is graphHash-CONTENT-DELTA. The first index run after merge produces additional edges (REFERENCES + IMPLEMENTS + TYPE_OF) that did not previously exist. Every existing OCH index will yield a different graphHash on next `codehub analyze`. This is documented as a v1.0 minor bump (schema-shape preserved via append-only; only content changes).

### Cross-track sequencing

This ADR is shared with AC-C-3 (Hole A) and AC-C-5 (Hole B). They land in the same Track C PR; the fixture regen runs once for both.

### Migration cost

For Hole A, existing stores have `embedder_model_id IS NULL`. On next `Store.open` an `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` runs cheaply; the `setMeta` call after the next `codehub analyze` populates the value. Until then, the query-time refusal sees `meta.embedderModelId === undefined` and passes — no false-positive refusals on existing stores.

For Hole B, every existing store needs a `codehub analyze --force` to pick up the new edges. That is the same posture as every prior schema-content delta in OCH (e.g. the M3+M6 IGraphStore split landed under the same minor-bump rule).

### Forward compatibility

`EMBEDDER_MISMATCH` is added to `ErrorCode` at `packages/mcp/src/error-envelope.ts`. Existing clients that do not enumerate the union ignore it; existing clients that DO enumerate the union pick up the new code on rebuild. No on-wire breaking change.

## Alternatives considered

### Hole A
- **Embed dim alone, not modelId.** Dim collisions (768 = 768) are exactly the failure mode this ADR addresses. Rejected.
- **Hash the first vector against a known canary string.** More complex, more storage, indistinguishable from "different post-processing pipelines" cases that produce identical canaries by accident. Rejected.
- **Force re-index on EVERY embedder env-var change.** Too aggressive for SageMaker→ONNX fallbacks during dev. The override flag exists for that case.

### Hole B
- **Insert TYPE_OF mid-union next to IMPLEMENTS.** Violates W-A-2 + the `edges.ts:29-32` append-only comment. Would break every existing graphHash on every existing OCH index, even for content with no IMPLEMENTS / TYPE_OF / REFERENCES. Rejected.
- **Split AC-C-5 into a sibling PR after Track C.** Considered in `pr-split-analysis.md` Option (b). Rejected because the fixture-regeneration cost would be paid twice (once for the v1.0 finalize hash bump that ships SCIP REFERENCES, once for the next ADR adding TYPE_OF). Bundling them is cheaper.

## Validation

- `mise run check` exits 0 on the Track C branch.
- `pnpm --filter @opencodehub/storage test` parity green (DuckDb leg + skip-clean GraphDb leg on dev box without `@ladybugdb/core` binding).
- `pnpm --filter @opencodehub/embedder test` covers `assertEmbedderCompatible` 5 cases.
- `pnpm --filter @opencodehub/scip-ingest test` covers REFERENCES emission + IMPLEMENTS/TYPE_OF collectRels.
- `pnpm --filter @opencodehub/ingestion test` regenerates incremental-determinism fixtures.
- ROADMAP constraint U1 (graphHash byte-identity per commit) holds for all content that does not exercise the new edge kinds; for content that does, the new fixture variant proves cross-adapter parity.

## References

- `.erpaval/specs/006-v1-finalize/spec.md§AC-C-3, §AC-C-5, §E-C-3, §E-C-4, §W-A-2`
- `.erpaval/sessions/session-33f24f/research-detectsecrets-scip.yaml` (SCIP role enum + Relationship message)
- `.erpaval/solutions/architecture-patterns/scip-callee-definition-site.md`
- `.erpaval/solutions/conventions/scip-0-indexed-vs-graph-1-indexed.md`
- `docs/adr/0011-graph-db-backend.md` (M3+M6 IGraphStore precedent)
- `docs/adr/0013-m7-default-flip-and-abstraction.md` (M7 LadybugDB default flip)
Loading
Loading