Skip to content

OCH v1.0 — M5 deterministic code-packs + M6 cross-repo federation#68

Merged
theagenticguy merged 21 commits into
mainfrom
feat/v1-m5-m6
May 8, 2026
Merged

OCH v1.0 — M5 deterministic code-packs + M6 cross-repo federation#68
theagenticguy merged 21 commits into
mainfrom
feat/v1-m5-m6

Conversation

@theagenticguy
Copy link
Copy Markdown
Owner

Summary

  • M5 (Deterministic code-packs) — ships @opencodehub/pack, the codehub code-pack CLI subcommand, the pack_codebase MCP tool routed through pack by default, and the codehub-code-pack skill. Output is a 9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + optional embeddings.parquet + findings + licenses+readme) byte-identical given (commit, tokenizer, budget, chonkie_version, duckdb_version). Locked into CI by packages/pack/src/pack-determinism.test.ts (5 variants) + scripts/pack-determinism-audit.sh (acceptance gate 16).
  • M6 (Cross-repo federation) — first-class RepoNode (9 attrs) in the graph; structured AMBIGUOUS_REPO with choices[]/total_matches + repo_uri alias; group_cross_repo_links MCP tool + cross-repo links in codehub-document --group; AGENTS.md/CLAUDE.md cross-refs to ADR 0012 + worked retry example; ADR 0012 (393 lines) captures the rationale + graphHash invariant W-M6-1.
  • 18 commits ahead of main, 1950/1951 tests passing (1 pre-existing skip), mise run check green, banned-strings green, AGENTS↔CLAUDE byte-identical sync verified.

Spec: .erpaval/specs/005-m5-m6/spec.md (12 ACs delivered, 4 spec drifts resolved inline).

What landed

M5 — Wave 1+2+3

AC Commit What
AC-M5-0 c0890fa (pre) pack added to commitlint scope-enum
AC-M5-1 1775500 (pre) @opencodehub/pack workspace scaffold
AC-M5-2 4e5d6f8 (pre) Lift PageRank from scip-ingest → analysis
AC-M5-3 bc5fd99 (pre) BOM manifest + packHash helper (RFC 8785 canonical JSON)
Drift 1 77f37c3 Switch chonkie dep → @chonkiejs/core@^0.0.9 (npm chonkie-ts is a squatter)
AC-M5-3a 018c253 IGraphStore.listNodes(opts?: {kinds, limit, offset}) on DuckStore + GraphDbStore
Drift 4 9d8d570 Lift classifyDependencies mcp → analysis (cycle-break)
AC-M5-4 072a062 BOM 2-4: skeleton.ts (PageRank-ranked symbols), file-tree.ts (framework-labelled), deps.ts
AC-M5-5 0c17be1 BOM 5-9 + generatePack assembly: ast-chunker.ts (chonkie + line-split fallback), xrefs.ts, findings.ts (SARIF level enum + suppressions), licenses.ts, readme.ts
AC-M5-6 5c118ac Parquet embeddings sidecar via DuckDB COPY+ZSTD (S-M5-3 absent-when-empty)
AC-M5-7 d1aa08d codehub code-pack CLI + pack_codebase MCP routes through @opencodehub/pack (engine=pack default; engine=repomix opt-in deferred to M7)
AC-M5-8 1f51300 Byte-identity determinism test suite + audit script + acceptance.sh gate 16
AC-M5-9 e043016 codehub-code-pack skill + references/determinism-contract.md + opencodehub-guide cross-link

M6 — Wave 1+2+3

AC Commit What
AC-M6-1 9ee6a96 (pre) RepoNode first-class in graph (9 attrs; appended to NodeKind union to preserve graphHash)
AC-M6-2 26e507b (pre) Structured AMBIGUOUS_REPO with choices[] + total_matches + repo_uri alias
AC-M6-3 (reframed) 86e295b (pre) group_cross_repo_links MCP tool + v2 docmeta cross-reference spec
AC-M6-4 f9fdde2 (pre) group_* tools emit repo_uri additively
AC-M6-5 4d8c5a9 ADR 0012 (393 lines, mirrors 0011) + AGENTS.md/CLAUDE.md cross-refs + worked AMBIGUOUS_REPO retry example + synthetic 2-repo fixture for codehub-contract-map quickcheck

Spec drifts resolved inline

  1. chonkie package mismatch — wave-1 wired chonkie@^0.3.0 (chonkie-inc-owned but undocumented). Canonical TS port is @chonkiejs/core@^0.0.9 per the chonkie-inc/chonkiejs README. Spec 005 amended in the swap commit.
  2. IGraphStore.listNodes() did not exist — spec called for it; implemented as a sub-AC on DuckStore + GraphDbStore. Cleaner long-term API than scattering raw store.query SQL across packages/pack/.
  3. AGENTS.md choices[] already shipped — reframed AC-M6-5 to add cross-references to ADR 0012, RepoNode, group_cross_repo_links + worked retry example.
  4. classifyDependencies cyclepack cannot import from mcp (mcp consumes pack via pack_codebase). Lifted the pure helper into @opencodehub/analysis as a 30-LOC prep commit.

Roadmap status post-merge

M1 ✅ → M2 ✅ → (M3 ✅ ∥ M4 ✅) → (M5 ✅ ∥ M6 ✅) → M7

M7 (LadybugDB default + drop sql for cypher-only) is the only remaining v1.0 milestone.

Test plan

  • pnpm install --frozen-lockfile clean
  • pnpm -r build clean
  • mise run check exits 0 (lint + typecheck + test + banned-strings)
  • 1950/1951 tests pass (1 pre-existing embedder skip)
  • bash scripts/check-banned-strings.sh PASS
  • bash scripts/pack-determinism-audit.sh runs (PASS or SKIP both acceptable)
  • AGENTS.md ↔ CLAUDE.md AMBIGUOUS_REPO byte-identical
  • codehub code-pack <repo> produces a 9-item BOM directory at <repo>/.codehub/packs/<packHash>/ (requires DuckStore on a real repo — verify post-merge)
  • Two consecutive codehub code-pack runs with same args produce byte-identical output (E-M5-3)
  • pack_codebase MCP tool engine=pack (default) route exercised end-to-end via Claude Code

🤖 Generated with Claude Code

Comment thread packages/pack/src/embeddings-sidecar.ts Fixed
theagenticguy added a commit that referenced this pull request May 8, 2026
Five durable lessons extracted from feat/v1-m5-m6 (PR #68, M5 + M6 complete):

- conventions/npm-package-canonicality-via-upstream-readme — chonkie-ts
  was a 2.6 kB squatter; @chonkiejs/core was canonical per upstream README.
- architecture-patterns/storage-list-nodes-over-scattered-sql — typed
  IGraphStore.listNodes() collapses N raw-SQL call sites; cross-adapter
  parity test catches schema drift.
- architecture-patterns/lift-pure-functions-to-shared-dep-to-break-cycles
  — classifyDependencies lifted into @opencodehub/analysis (LCA dep)
  averted mcp → pack → mcp cycle.
- best-practices/worktree-isolation-pwd-pin-and-biome-exclusion — pin
  pwd at task start; biome v2 traverses gitignored worktrees, scope to
  packages/ or add experimentalScannerIgnores.
- best-practices/spec-drift-amend-inline-with-implementing-commit —
  amend spec wording in the same commit that implements the resolution.

INDEX.md updated with five new entries under Solutions.
Prepares commitlint.config.mjs for the M5 `@opencodehub/pack`
workspace. No source code yet — this lands first so subsequent
`feat(pack): ...` commits pass the commit-msg hook.

Also adds the M5 + M6 EARS spec at `.erpaval/specs/005-m5-m6/spec.md`
describing the 14 acceptance criteria, wave structure, and 10-point
roadmap-constraint cross-check. See the spec for the full M5/M6 plan.

Refs: .erpaval/ROADMAP.md §M5 + §M6
Greenfield package for the M5 9-item code-pack BOM. This commit
wires the package (package.json, tsconfig, public entry with stubbed
generatePack, type surface) and updates the root tsconfig references.

The generatePack body lands in AC-M5-3 (manifest + pack_hash) and
AC-M5-4+ (BOM body implementations). AC-M5-1's job is to make the
empty-but-wired package compile, test, and lint clean so subsequent
ACs can parallel-implement.

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-1
Move pageRank, buildAdjacency, and the Adjacency interface from
packages/scip-ingest/src/materialize.ts (where it was dead code
stored into BlastMetrics.pagerank with zero downstream consumers)
to packages/analysis/src/page-rank.ts, where it becomes a
request-time kernel consumed by AC-M5-4's skeleton BOM item.

- Preserve fixed-iteration + fixed-damping semantics byte-for-byte
- Rename pagerank -> pageRank (camelCase, analysis convention)
- Make buildAdjacency generic over EdgeLike instead of DerivedEdge
- Add determinism snapshot test (Float64Array hex) for a 10-node fixture
- Remove BlastMetrics.pagerank field and the L231 call site
- scip-ingest's SCC/reach code stays in materialize.ts

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-2, E-M5-5, W-M5-3
Add Repo as a NodeKind — append-only to preserve graphHash byte
identity for existing graphs. RepoNode carries 9 attributes
(originUrl, repoUri, defaultBranch, commitSha, indexTime, group,
visibility, indexer, languageStats) synthesizing Sourcegraph URI
+ SCIP Metadata.toolInfo.

- Append to NodeKind + GraphNode at end of union
- Add Repo DDL to both schema-ddl.ts (DuckDB) and graphdb-schema.ts
  (graph-db). The "JSON-through" claim in the packet was checked and
  found false: the polymorphic nodes table uses per-field columns,
  so we added 9 new TEXT columns (append-only)
- New ingestion phase packages/ingestion/src/pipeline/phases/repo-node.ts
  probes git origin, defaults to local:<hash> on no-remote.
  indexTime pinned to %cI HEAD commit timestamp (not wall clock) so
  W-M6-1 determinism holds without excluding the field from graphHash
- graph-hash-parity tests: existing small/medium/large fixtures
  unchanged; new repo-node + repo-null fixtures round-trip parity
  across both stores
- duckdb-adapter + graphdb-roundtrip tests extended with repo write
  + round-trip coverage
- Does NOT introduce Repo edge kinds (deferred)
- Does NOT backfill existing graphs

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-1, E-M6-1, S-M6-1, W-M6-1
…(AC-M6-2)

Extend the existing AMBIGUOUS_REPO sentinel with a structured payload on
`structuredContent.error`: error_code, jsonrpc_code, choices[] (capped at
10), total_matches, hint. Choices carry { repo_uri, default_branch, group }
so a calling agent can retry deterministically with one of them; when
total_matches > choices.length, the caller knows the list was truncated.

Also adds `repo_uri` as an accepted alias for the `repo` arg on every
per-repo MCP tool (~20 tools spread a shared `repoArgShape` helper from
tools/shared.ts). `repo_uri` normalizes https/http/git@ protocol, trailing
`.git`, and host case, and falls back to `local:<sha256(path)[:12]>` when
the registry name is not URI-shaped. When both `repo` and `repo_uri` are
provided, `repo_uri` wins at the resolver.

- Backward compat: error-envelope.test.ts:39-47 stays green — the legacy
  { code, message, hint } shape is preserved alongside the new fields.
- No change to REPO_NOT_FOUND, NO_INDEX, or any other error code.
- No coupling to AC-M6-1's RepoNode type — repo_uri derived from
  RegistryEntry at call time; TODO marker flags the M7 upgrade path.
- No group-level ambiguity logic (AC-M6-4 scope untouched).

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-2, E-M6-2, W-M6-2
Implement the deterministic BOM manifest generator. buildManifest
computes pack_hash = sha256(canonicalJson(manifest - pack_hash)) from
the already-built BomItem list. serializeManifest emits snake_case,
canonical-key-order JSON to disk.

Also audited core-types/hash.ts#writeCanonicalJson against RFC 8785:
already compliant. Number formatting delegates to JSON.stringify,
which implements ES6 7.1.12.1 ToString (the exact algorithm RFC 8785
3.2.2.3 references). Key sort uses Object.keys().sort(), which is
UTF-16 code-unit ascending per V8's default string comparator.
Added 7 compliance tests to hash.test.ts so the behavior is locked
and any future refactor failing RFC 8785 fails CI.

- packHash is computed with the field itself omitted from the preimage
  (placeholder empty string, stripped during canonicalization).
- Byte-identity test: two runs on same opts produce === manifest JSON.
- camelCase TS / snake_case wire boundary handled by a single
  toSnakeCaseManifest helper; all consumers (disk write, hashing) see
  the same bytes.

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-3, E-M5-4, W-M5-2
Extend the 5 group MCP tools (group_list, group_query, group_contracts,
group_status, group_sync) with additive repo_uri fields. Legacy
name/_repo/consumerRepo/producerRepo fields preserved through M7 -
no breaking rename. repo_uri is derived via deriveRepoUri (shipped
by AC-M6-2 in repo-resolver.ts); when AC-M6-1's RepoNode is in the
graph, prefer its repoUri.

- Additive changes only
- codehub-contract-map skill continues to work via backward-compat
- Legacy test assertions preserved byte-for-byte

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-4, E-M6-4
…-M6-3 reframed)

After discovery revealed .docmeta.json lives in plugin Markdown
(not TS), reframe AC-M6-3: engine side owns the sourced link graph
(new computeCrossRepoLinks helper + group_cross_repo_links MCP tool),
skill orchestrator owns the .docmeta.json file and writes v2 during
Phase E.

- packages/analysis/src/group/cross-repo-links.ts: deterministic,
  alpha-sorted CrossRepoLink[] from group_contracts data
- packages/mcp/src/tools/group-cross-repo-links.ts: MCP wrapper
- cross-reference-spec.md: v2 schema with cross_repo_links[]
- SKILL.md Phase E prose: orchestrator calls the tool + writes v2
- Determinism snapshot-tested

This preserves E-M6-3 (sourced, not heuristic), U6 (no LLM calls in
engine), and OCH's architecture (skill owns doc assembly).

Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-3, E-M6-3, S-M6-2
Wave 1 (1775500) wired `chonkie@^0.3.0` into @opencodehub/pack as the
AST chunker. That package is owned by chonkie-inc but is NOT the
documented surface — it's an undocumented stub-history publish whose
repository URL points at the now-renamed chonkie-ts repo. The npm
`chonkie-ts` package is a PolyerAI squatter (2.6 kB, no deps,
abandoned ~1 year). The canonical chonkie-inc TypeScript port,
explicitly named in the chonkie-inc/chonkiejs README install command,
is `@chonkiejs/core` — same author (Bhavnick Minhas) as the Python
upstream, MIT-licensed, latest 0.0.9 on 2026-03-27.

This commit:
- Swaps `packages/pack/package.json` from `chonkie@^0.3.0` to
  `@chonkiejs/core@^0.0.9` (alpha-sorted in the dependencies block).
- Regenerates pnpm-lock.yaml (`pnpm install`); the lockfile also
  picks up incidental dedup of stale aws-sdk hoist entries already
  present on the base branch.
- Amends `.erpaval/specs/005-m5-m6/spec.md` AC-M5-1 deps list,
  AC-M5-5 ast-chunker.ts entry, S-M5-1 fallback condition, and the
  Context "AST chunker" bullet to read `@chonkiejs/core@^0.0.9`.
  Manifest field name `chonkie_version` is retained — the field is
  still pinned, only the package providing it changed.

No source code is touched: `generatePack` remains a typed stub. The
import wiring lands in T-W2-5 alongside ast-chunker.ts.
…quickcheck (AC-M6-5)

ADR 0012 captures the rationale for first-class RepoNode mirroring ADR
0011's structure (393 lines): Context, Decision (9-attribute shape),
Schema choice (append-only NodeKind union), graphHash invariant W-M6-1
(append-only ordering, %cI HEAD indexTime not wall-clock, no backfill),
Migration (lazy population + engine tolerance), Edge kinds deferred to
M7, Risks, References citing commits 9ee6a96 (M6-1 RepoNode), 26e507b
(M6-2 structured AMBIGUOUS_REPO), f9fdde2 (M6-4 group_* additive
repo_uri), 86e295b (M6-3 reframed cross-repo links).

AGENTS.md and CLAUDE.md AMBIGUOUS_REPO paragraphs cross-linked to ADR
0012, RepoNode (packages/core-types/src/nodes.ts:524-552), and the
AC-M6-3-reframed group_cross_repo_links MCP tool, plus a worked JSON
example showing the error envelope and a retry call. Both files stay
byte-identical for the synced range.

Synthetic 2-repo fixture under packages/analysis/src/group/__fixtures__/
exercises the populated-case path of computeCrossRepoLinks (HTTP route
+ gRPC service producer/consumer pair). cross-repo-links-quickcheck.test.ts
asserts shape (5-tuple), consumer/producer orientation, deterministic
ordering (two runs deep-equal), and evidence sourcing.
Move the pure license classifier (`classifyDependencies`), its supporting
types (`DependencyRef`, `LicenseTier`, `LicenseAuditFlagged`,
`LicenseAuditResult`), and the private `COPYLEFT_PATTERN` regex from
`@opencodehub/mcp/src/tools/license-audit.ts` into a new
`@opencodehub/analysis/src/license-classify.ts`. Re-export from the
analysis barrel.

Why: T-W2-5 (`packages/pack/src/licenses.ts`) needs the same classifier.
`pack` cannot import from `mcp` because that introduces a mcp → pack →
mcp dependency cycle. `analysis` is already a transitive dependency of
both `mcp` and `pack`, so lifting the helper there breaks the cycle
cleanly without adding new package edges.

Mechanical lift only — function body, regex, tier semantics, and
`LicenseAuditResult` shape are byte-identical. The MCP tool now imports
the classifier from `@opencodehub/analysis`; no shim re-export retained.
The mcp-side test (`license-audit.test.ts`) updates only its import
path. A package-local `license-classify.test.ts` mirrors the legacy
9 cases (OK / WARN-on-UNKNOWN / WARN-on-empty / BLOCK-on-GPL /
BLOCK-on-PROPRIETARY / AGPL+SSPL+EUPL+CPAL+OSL+RPL spread / LGPL
non-match / lowercase copyleft / BLOCK-wins-over-WARN).

Refs: T-W2-3 (drift_4 prep, extends spec 005 AC-M5-5).
…Store

The M5 BOM bodies (T-W2-4 / T-W2-5: skeleton, file-tree, deps, xrefs) need
typed kind-filtered enumeration of GraphNodes from the polymorphic `nodes`
table. Without a first-class API, every BOM body would have to scatter raw
`store.query("SELECT id, kind, version, license, ... FROM nodes WHERE
kind = ?")` SQL across `packages/pack/`, replicate the column→field
rehydration logic per-call, and lose type-safety on the kind-specific
wider columns (Dependency `version`/`license`/`lockfile_source`/`ecosystem`,
Repo `repo_uri`/`default_branch`/`languageStats`, etc.).

`listNodes(opts?: { kinds?, limit?, offset? })` is the cleaner long-term
API: deterministic ordering at the storage layer (ORDER BY id ASC + a
JS-side lex-stable tiebreak), `kinds: undefined` returns every kind,
`kinds: []` short-circuits to `[]`, paging via limit/offset.

Both adapters share a fully-typed `rowToGraphNode` / `recordToGraphNode`
rehydration helper that reverses every encoding `nodeToRow` /
`nodeToParams` writes, including the Operation
`http_method`/`http_path` → `method`/`path` aliasing, the polymorphic
`frameworks_json` legacy-vs-v2 envelope, the `unreachable_export` →
`unreachable-export` deadness denormalisation, and the Repo nullable-
field preservation. Tests verify cross-adapter parity: the same fixture
fed to DuckStore and GraphDbStore yields byte-identical
`canonicalJson(GraphNode)` for every node.

The interface change is purely additive — no production consumer was
touched. Test fakes implementing `IGraphStore` (`FakeStore`,
`WikiFakeStore`, two `StubStore` instances) gained a small noop
`listNodes` so the type check stays green across the monorepo.

Tests: 9 new in duckdb-adapter.test.ts (real DuckDB), 7 in
graphdb-adapter.test.ts (1 pure-JS short-circuit + 6 native-binding-
gated, including the cross-adapter parity test). All 159 storage tests
pass; 1764 tests pass across the monorepo with 0 failures.
Land the first three BOM body modules under `packages/pack/src/`. Each
emits a flat row stream that `generatePack` (a typed stub at
`packages/pack/src/index.ts:23` until T-W2-5) will eventually assemble
into a deterministic 9-item code-pack BOM.

skeleton.ts (item 2/9)
  PageRank-ranked Function/Class/Method symbols. Pulls callable nodes
  via `IGraphStore.listNodes({ kinds: [...] })` (T-W2-2) and CALLS
  edges via raw SQL against the `relations` table (column is `type`,
  not `kind`; columns `from_id`/`to_id`). Feeds `EdgeLike[]` into
  `buildAdjacency` + `pageRank(adj, 0.85, 50)` from
  `@opencodehub/analysis` — fixed iterations + damping per W-M5-3, no
  tolerance-based convergence. Map id → score is keyed off
  `adj.nodes[i]` (the Float64Array is index-aligned to that array;
  never rebuild the index from edges). Output sorted score DESC, id
  ASC. Method.owner round-trips; non-Method rows omit it.

file-tree.ts (item 3/9)
  File/Folder rows alpha-sorted by `path ASC` and decorated with the
  repo's framework set. Precedence:
  `frameworksDetected: FrameworkDetection[]` (preferred — structured)
  → legacy `frameworks: string[]` flat list → `[]`. Names are
  alpha-sorted + deduped before being stamped onto every row (the
  ProjectProfile is a per-repo singleton at v1, so all rows carry the
  same labels). Files surface optional language + contentHash;
  folders omit them. We deliberately do not walk CONTAINS edges —
  paths come from the FileNode/FolderNode `filePath` field.

deps.ts (item 4/9)
  Dependency rows mapped to a flat DepRow shape mirroring the MCP
  `dependencies` tool, but WITHOUT importing `@opencodehub/mcp`
  (mcp depends on pack via `pack_codebase` — that would create a
  workspace cycle). Sort key:
  `(ecosystem ASC, name ASC, version ASC, id ASC)`. The id-tiebreak
  catches polyrepos where the same package is pinned at the same
  version across multiple lockfiles. Missing license / version are
  preserved as `undefined` — the BOM stores raw graph state and
  leaves the "UNKNOWN" coercion to render-time consumers.

Determinism contract — non-negotiable for all three modules
  - `Array.prototype.sort` over a plain JS comparator; never trust
    Map iteration order for output sequencing.
  - score / version / etc. ties resolve via `id ASC` (lex-stable
    last resort).
  - PageRank itself is deterministic by construction.
  - Two consecutive calls return byte-identical canonicalJson.
  Each module ships a determinism test that asserts both
  `deepEqual` and `canonicalJson(a) === canonicalJson(b)` over two
  consecutive invocations on the same in-memory mock store.

Why three sibling modules instead of one bundled builder
  Each BOM item has a distinct shape, distinct sort keys, and a
  distinct origin kind on the graph. Bundling them behind a generic
  `buildBom(opts: { kind })` interface would force the variants
  through a sum-type seam that the manifest writer (T-W2-5) and the
  future `code_skeleton` MCP surface don't want — they consume each
  output as a strictly-typed table, not an `unknown[]`. Three
  small modules with parallel structure is simpler than one
  abstraction that needs to fit nine future shapes (xrefs,
  ast-chunks, embeddings-sidecar, findings, licenses).

Tests (21 new, baseline 18, total 39)
  Each module ships node:test cases against a thin
  `as unknown as IGraphStore` mock that implements only the methods
  the module reaches (listNodes + query for skeleton; listNodes for
  file-tree and deps). The mock pattern matches
  `packages/cli/src/commands/context.test.ts:118` and avoids the
  duckdb native-binding fragility in the worktree shell.

Verification
  - pnpm -C packages/pack exec tsc --noEmit → exit 0
  - pnpm exec biome check packages/pack/      → exit 0
  - pnpm -C packages/pack test                → 39/39 pass
  - bash scripts/check-banned-strings.sh      → PASS
ast-chunker.ts wraps @chonkiejs/core CodeChunker via dynamic import; degrades
to a line-split fallback when the loader rejects, when CodeChunker.create
throws (per-file path), or when a file lacks a `language` (per-file → strict
result preserved). CRLF→LF normalize before chunking (W-M5-4). pinsHint surfaces
chonkie's package.json `version` for the manifest pins object. Worktree
native-binding lesson — onnxruntime-node may not rebuild cleanly — drove the
mock-first test seam (`_loadChonkie`).

xrefs.ts emits Community rows (alpha by id) followed by CALLS rows
(`from, to, id` ASC) from a single `WHERE type = 'CALLS' ORDER BY id ASC`
scan of the relations table. Confidence is surfaced raw but never used as a
sort key — float comparison would inject non-determinism on near-equal values.

findings.ts groups by SARIF `level` enum + ruleId. NULL/unknown severity
coerces to "none". Suppressed rows are skipped via rehydration of
`suppressed_json` → `{suppressions: [...]}` → `sarif.isSuppressed()`,
mirroring the helper at `packages/analysis/src/verdict.ts:614-626`. Groups
sort by SEVERITY_RANK then ruleId ASC; examples sort by nodeId ASC and cap
at `examplesPerGroup` (default 3).

licenses.ts uses `classifyDependencies` from `@opencodehub/analysis` (lifted
in AC-M5-3). Aggregates LICENSES.md (tier counts header + per-package
sections in `(ecosystem, name, version, id)` ASC) and concatenates any
`NOTICE` / `NOTICE.md` / `NOTICES` files found at the repo root.

readme.ts renders a pure-function README with the determinism contract
(strict | best_effort | degraded) and BOM file index. Snapshot-stable.

generatePack assembles all 8 BOM files (skeleton, file-tree, deps, ast-chunks,
xrefs, findings, licenses, readme) plus manifest.json. Manifest is written
LAST so a partial run leaves an obviously-incomplete pack. NO Parquet sidecar —
T-W3-1 owns that. determinism_class: degraded > best_effort (anthropic:
tokenizer) > strict. pins.duckdbVersion read from `@duckdb/node-api`'s
package.json at runtime.

Tests: pack package goes from 39 → 90 (+51). End-to-end test asserts
byte-identical files across two runs on the same fixture using sha256
per-file. Workspace total: 1848 tests, 0 failures.
DuckDB COPY (SELECT node_id, granularity, chunk_index, vector FROM embeddings
ORDER BY node_id, granularity, chunk_index) TO 'embeddings.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD). Pins duckdbVersion in manifest.pins
from the runtime SELECT version() reported by the binding that wrote the
file — that string is what the parquet created_by metadata embeds, so the
manifest pin stays bound to the engine that produced the sidecar.

Sidecar absent when embeddings table empty (S-M5-3) — no file on disk and
manifest.files[] does not list a path. The sidecar is structurally
duck-typed (IGraphStore is not widened): stores without
exportEmbeddingsParquet (mocks, GraphDbStore, future LanceDB) cleanly
resolve to absent. Path is interpolated into COPY because DuckDB does not
bind COPY destinations; isSafeAbsolutePath() rejects anything outside a
strict POSIX-absolute character class so injection is structurally
impossible.

Two-run byte-identity test on a 100-row × 384-dim Float32 fixture confirms
determinism via Buffer.compare === 0 against a real DuckDbStore. Pack
tests 90 → 96; full repo tests 1848 → 1854; all gates green.
…hub/pack

AC-M5-7. CLI: new `codehub code-pack [path] [--budget N] [--tokenizer ID]
[--out-dir DIR] [--engine pack|repomix]`. Default engine is `pack` and writes
the 9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs +
findings + licenses + readme + optional embeddings.parquet) to
`<repo>/.codehub/packs/<packHash>/`. Output is staged in `os.tmpdir()` first
and renamed into the canonical hash-suffixed path once `generatePack` returns
its manifest, so the directory name encodes pack identity. The repomix path
delegates to the existing `runPack` shell-out for npx repomix and returns a
`bomItemCount: 1` envelope.

MCP: pack_codebase routes through @opencodehub/pack on engine=pack (default);
legacy repomix path retained under engine=repomix opt-in (drop deferred to M7
per spec 005 Q-DELTA-6). The repomix response carries `_meta.engine: "repomix"`
so callers can detect the legacy path and `next_steps[]` flags the pending
deprecation.

Test seams: both runCodePack and runPackCodebase accept injected stubs
(`_generatePack`, `_store`, `_runRepomix`, `PackCodebaseDeps`) so unit tests
exercise engine routing without loading native DuckDB bindings or shelling out.
16 new tests cover defaults, dispatch, the .codehub/packs/<hash>/ path layout,
embeddings sidecar inclusion, custom out-dir, and the no-index error envelope.

repomix is bandwidth output, not a tree-sitter chunker
(.erpaval/solutions/architecture-patterns/repomix-is-output-side.md): the
@opencodehub/pack engine fully supersedes it for code intelligence; repomix
stays available for raw repo packing through M6 and is removed in M7.
New skill at plugins/opencodehub/skills/codehub-code-pack/ surfaces
`codehub code-pack` to Claude Code agents. Single-repo + group mode,
allowed-tools list, 9-item BOM contract documented inline,
determinism class triage (strict/best_effort/degraded), pack_hash
verification recipe. references/determinism-contract.md captures
spec 005 §M5 invariants for future auditors. Cross-linked from
opencodehub-guide skills table.
Adds end-to-end packages/pack/src/pack-determinism.test.ts that runs
generatePack twice and asserts every output file is byte-identical
(packHash equality + Buffer.compare per file). Adds
scripts/pack-determinism-audit.sh that exercises the same invariant
through the codehub CLI; integrated into scripts/acceptance.sh. SKIP
guards keep both gates honest when DuckStore native bindings are
absent.
Five durable lessons extracted from feat/v1-m5-m6 (PR #68, M5 + M6 complete):

- conventions/npm-package-canonicality-via-upstream-readme — chonkie-ts
  was a 2.6 kB squatter; @chonkiejs/core was canonical per upstream README.
- architecture-patterns/storage-list-nodes-over-scattered-sql — typed
  IGraphStore.listNodes() collapses N raw-SQL call sites; cross-adapter
  parity test catches schema drift.
- architecture-patterns/lift-pure-functions-to-shared-dep-to-break-cycles
  — classifyDependencies lifted into @opencodehub/analysis (LCA dep)
  averted mcp → pack → mcp cycle.
- best-practices/worktree-isolation-pwd-pin-and-biome-exclusion — pin
  pwd at task start; biome v2 traverses gitignored worktrees, scope to
  packages/ or add experimentalScannerIgnores.
- best-practices/spec-drift-amend-inline-with-implementing-commit —
  amend spec wording in the same commit that implements the resolution.

INDEX.md updated with five new entries under Solutions.
Rebase onto main brought in PR #70's transitive-CVE overrides
(fast-xml-builder@1.1.7, fast-uri@3.1.2, hono@4.12.16, ip-address@10.1.1).
Regenerating the lockfile pulls those in alongside the M5/M6 pack deps.

No source changes — build + typecheck + tests + banned-strings all green
locally before push.
@theagenticguy theagenticguy enabled auto-merge (squash) May 8, 2026 21:44
CodeQL flagged a potential filesystem race on
packages/pack/src/embeddings-sidecar.ts:134 — stat(outPath) and
readFile(outPath) ran concurrently in Promise.all, so size and
content could come from different versions of the file.

Derive bytesWritten from the same buffer used for hashing: a single
readFile, then bytes.byteLength. No stat needed.
@theagenticguy theagenticguy merged commit d5f48e2 into main May 8, 2026
17 checks passed
@theagenticguy theagenticguy deleted the feat/v1-m5-m6 branch May 8, 2026 21:52
theagenticguy added a commit that referenced this pull request May 10, 2026
## Summary

- **M5 (Deterministic code-packs)** — ships `@opencodehub/pack`, the
`codehub code-pack` CLI subcommand, the `pack_codebase` MCP tool routed
through pack by default, and the `codehub-code-pack` skill. Output is a
9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs
+ optional embeddings.parquet + findings + licenses+readme)
byte-identical given `(commit, tokenizer, budget, chonkie_version,
duckdb_version)`. Locked into CI by
`packages/pack/src/pack-determinism.test.ts` (5 variants) +
`scripts/pack-determinism-audit.sh` (acceptance gate 16).
- **M6 (Cross-repo federation)** — first-class `RepoNode` (9 attrs) in
the graph; structured `AMBIGUOUS_REPO` with `choices[]`/`total_matches`
+ `repo_uri` alias; `group_cross_repo_links` MCP tool + cross-repo links
in `codehub-document --group`; AGENTS.md/CLAUDE.md cross-refs to ADR
0012 + worked retry example; ADR 0012 (393 lines) captures the rationale
+ graphHash invariant W-M6-1.
- 18 commits ahead of `main`, 1950/1951 tests passing (1 pre-existing
skip), `mise run check` green, banned-strings green, AGENTS↔CLAUDE
byte-identical sync verified.

Spec: `.erpaval/specs/005-m5-m6/spec.md` (12 ACs delivered, 4 spec
drifts resolved inline).

## What landed

### M5 — Wave 1+2+3

| AC | Commit | What |
|---|---|---|
| AC-M5-0 | `c0890fa` (pre) | `pack` added to commitlint scope-enum |
| AC-M5-1 | `1775500` (pre) | `@opencodehub/pack` workspace scaffold |
| AC-M5-2 | `4e5d6f8` (pre) | Lift PageRank from scip-ingest → analysis
|
| AC-M5-3 | `bc5fd99` (pre) | BOM manifest + packHash helper (RFC 8785
canonical JSON) |
| Drift 1 | `77f37c3` | Switch chonkie dep → `@chonkiejs/core@^0.0.9`
(npm `chonkie-ts` is a squatter) |
| AC-M5-3a | `018c253` | `IGraphStore.listNodes(opts?: {kinds, limit,
offset})` on DuckStore + GraphDbStore |
| Drift 4 | `9d8d570` | Lift `classifyDependencies` mcp → analysis
(cycle-break) |
| AC-M5-4 | `072a062` | BOM 2-4: `skeleton.ts` (PageRank-ranked
symbols), `file-tree.ts` (framework-labelled), `deps.ts` |
| AC-M5-5 | `0c17be1` | BOM 5-9 + `generatePack` assembly:
`ast-chunker.ts` (chonkie + line-split fallback), `xrefs.ts`,
`findings.ts` (SARIF level enum + suppressions), `licenses.ts`,
`readme.ts` |
| AC-M5-6 | `5c118ac` | Parquet embeddings sidecar via DuckDB COPY+ZSTD
(S-M5-3 absent-when-empty) |
| AC-M5-7 | `d1aa08d` | `codehub code-pack` CLI + `pack_codebase` MCP
routes through `@opencodehub/pack` (engine=pack default; engine=repomix
opt-in deferred to M7) |
| AC-M5-8 | `1f51300` | Byte-identity determinism test suite + audit
script + `acceptance.sh` gate 16 |
| AC-M5-9 | `e043016` | `codehub-code-pack` skill +
`references/determinism-contract.md` + `opencodehub-guide` cross-link |

### M6 — Wave 1+2+3

| AC | Commit | What |
|---|---|---|
| AC-M6-1 | `9ee6a96` (pre) | `RepoNode` first-class in graph (9 attrs;
appended to NodeKind union to preserve graphHash) |
| AC-M6-2 | `26e507b` (pre) | Structured `AMBIGUOUS_REPO` with
`choices[]` + `total_matches` + `repo_uri` alias |
| AC-M6-3 (reframed) | `86e295b` (pre) | `group_cross_repo_links` MCP
tool + v2 docmeta cross-reference spec |
| AC-M6-4 | `f9fdde2` (pre) | `group_*` tools emit `repo_uri` additively
|
| AC-M6-5 | `4d8c5a9` | ADR 0012 (393 lines, mirrors 0011) +
AGENTS.md/CLAUDE.md cross-refs + worked AMBIGUOUS_REPO retry example +
synthetic 2-repo fixture for `codehub-contract-map` quickcheck |

## Spec drifts resolved inline

1. **chonkie package mismatch** — wave-1 wired `chonkie@^0.3.0`
(chonkie-inc-owned but undocumented). Canonical TS port is
`@chonkiejs/core@^0.0.9` per the chonkie-inc/chonkiejs README. Spec 005
amended in the swap commit.
2. **`IGraphStore.listNodes()` did not exist** — spec called for it;
implemented as a sub-AC on DuckStore + GraphDbStore. Cleaner long-term
API than scattering raw `store.query` SQL across `packages/pack/`.
3. **AGENTS.md `choices[]` already shipped** — reframed AC-M6-5 to add
cross-references to ADR 0012, RepoNode, `group_cross_repo_links` +
worked retry example.
4. **`classifyDependencies` cycle** — `pack` cannot import from `mcp`
(mcp consumes pack via `pack_codebase`). Lifted the pure helper into
`@opencodehub/analysis` as a 30-LOC prep commit.

## Roadmap status post-merge

```
M1 ✅ → M2 ✅ → (M3 ✅ ∥ M4 ✅) → (M5 ✅ ∥ M6 ✅) → M7
```

M7 (LadybugDB default + drop `sql` for `cypher`-only) is the only
remaining v1.0 milestone.

## Test plan

- [x] `pnpm install --frozen-lockfile` clean
- [x] `pnpm -r build` clean
- [x] `mise run check` exits 0 (lint + typecheck + test +
banned-strings)
- [x] 1950/1951 tests pass (1 pre-existing embedder skip)
- [x] `bash scripts/check-banned-strings.sh` PASS
- [x] `bash scripts/pack-determinism-audit.sh` runs (PASS or SKIP both
acceptable)
- [x] AGENTS.md ↔ CLAUDE.md AMBIGUOUS_REPO byte-identical
- [ ] `codehub code-pack <repo>` produces a 9-item BOM directory at
`<repo>/.codehub/packs/<packHash>/` (requires DuckStore on a real repo —
verify post-merge)
- [ ] Two consecutive `codehub code-pack` runs with same args produce
byte-identical output (E-M5-3)
- [ ] `pack_codebase` MCP tool `engine=pack` (default) route exercised
end-to-end via Claude Code

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants