OCH v1.0 — M5 deterministic code-packs + M6 cross-repo federation#68
Merged
Conversation
theagenticguy
added a commit
that referenced
this pull request
May 8, 2026
Five durable lessons extracted from feat/v1-m5-m6 (PR #68, M5 + M6 complete): - conventions/npm-package-canonicality-via-upstream-readme — chonkie-ts was a 2.6 kB squatter; @chonkiejs/core was canonical per upstream README. - architecture-patterns/storage-list-nodes-over-scattered-sql — typed IGraphStore.listNodes() collapses N raw-SQL call sites; cross-adapter parity test catches schema drift. - architecture-patterns/lift-pure-functions-to-shared-dep-to-break-cycles — classifyDependencies lifted into @opencodehub/analysis (LCA dep) averted mcp → pack → mcp cycle. - best-practices/worktree-isolation-pwd-pin-and-biome-exclusion — pin pwd at task start; biome v2 traverses gitignored worktrees, scope to packages/ or add experimentalScannerIgnores. - best-practices/spec-drift-amend-inline-with-implementing-commit — amend spec wording in the same commit that implements the resolution. INDEX.md updated with five new entries under Solutions.
Prepares commitlint.config.mjs for the M5 `@opencodehub/pack` workspace. No source code yet — this lands first so subsequent `feat(pack): ...` commits pass the commit-msg hook. Also adds the M5 + M6 EARS spec at `.erpaval/specs/005-m5-m6/spec.md` describing the 14 acceptance criteria, wave structure, and 10-point roadmap-constraint cross-check. See the spec for the full M5/M6 plan. Refs: .erpaval/ROADMAP.md §M5 + §M6
Greenfield package for the M5 9-item code-pack BOM. This commit wires the package (package.json, tsconfig, public entry with stubbed generatePack, type surface) and updates the root tsconfig references. The generatePack body lands in AC-M5-3 (manifest + pack_hash) and AC-M5-4+ (BOM body implementations). AC-M5-1's job is to make the empty-but-wired package compile, test, and lint clean so subsequent ACs can parallel-implement. Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-1
Move pageRank, buildAdjacency, and the Adjacency interface from packages/scip-ingest/src/materialize.ts (where it was dead code stored into BlastMetrics.pagerank with zero downstream consumers) to packages/analysis/src/page-rank.ts, where it becomes a request-time kernel consumed by AC-M5-4's skeleton BOM item. - Preserve fixed-iteration + fixed-damping semantics byte-for-byte - Rename pagerank -> pageRank (camelCase, analysis convention) - Make buildAdjacency generic over EdgeLike instead of DerivedEdge - Add determinism snapshot test (Float64Array hex) for a 10-node fixture - Remove BlastMetrics.pagerank field and the L231 call site - scip-ingest's SCC/reach code stays in materialize.ts Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-2, E-M5-5, W-M5-3
Add Repo as a NodeKind — append-only to preserve graphHash byte identity for existing graphs. RepoNode carries 9 attributes (originUrl, repoUri, defaultBranch, commitSha, indexTime, group, visibility, indexer, languageStats) synthesizing Sourcegraph URI + SCIP Metadata.toolInfo. - Append to NodeKind + GraphNode at end of union - Add Repo DDL to both schema-ddl.ts (DuckDB) and graphdb-schema.ts (graph-db). The "JSON-through" claim in the packet was checked and found false: the polymorphic nodes table uses per-field columns, so we added 9 new TEXT columns (append-only) - New ingestion phase packages/ingestion/src/pipeline/phases/repo-node.ts probes git origin, defaults to local:<hash> on no-remote. indexTime pinned to %cI HEAD commit timestamp (not wall clock) so W-M6-1 determinism holds without excluding the field from graphHash - graph-hash-parity tests: existing small/medium/large fixtures unchanged; new repo-node + repo-null fixtures round-trip parity across both stores - duckdb-adapter + graphdb-roundtrip tests extended with repo write + round-trip coverage - Does NOT introduce Repo edge kinds (deferred) - Does NOT backfill existing graphs Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-1, E-M6-1, S-M6-1, W-M6-1
…(AC-M6-2)
Extend the existing AMBIGUOUS_REPO sentinel with a structured payload on
`structuredContent.error`: error_code, jsonrpc_code, choices[] (capped at
10), total_matches, hint. Choices carry { repo_uri, default_branch, group }
so a calling agent can retry deterministically with one of them; when
total_matches > choices.length, the caller knows the list was truncated.
Also adds `repo_uri` as an accepted alias for the `repo` arg on every
per-repo MCP tool (~20 tools spread a shared `repoArgShape` helper from
tools/shared.ts). `repo_uri` normalizes https/http/git@ protocol, trailing
`.git`, and host case, and falls back to `local:<sha256(path)[:12]>` when
the registry name is not URI-shaped. When both `repo` and `repo_uri` are
provided, `repo_uri` wins at the resolver.
- Backward compat: error-envelope.test.ts:39-47 stays green — the legacy
{ code, message, hint } shape is preserved alongside the new fields.
- No change to REPO_NOT_FOUND, NO_INDEX, or any other error code.
- No coupling to AC-M6-1's RepoNode type — repo_uri derived from
RegistryEntry at call time; TODO marker flags the M7 upgrade path.
- No group-level ambiguity logic (AC-M6-4 scope untouched).
Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-2, E-M6-2, W-M6-2
Implement the deterministic BOM manifest generator. buildManifest computes pack_hash = sha256(canonicalJson(manifest - pack_hash)) from the already-built BomItem list. serializeManifest emits snake_case, canonical-key-order JSON to disk. Also audited core-types/hash.ts#writeCanonicalJson against RFC 8785: already compliant. Number formatting delegates to JSON.stringify, which implements ES6 7.1.12.1 ToString (the exact algorithm RFC 8785 3.2.2.3 references). Key sort uses Object.keys().sort(), which is UTF-16 code-unit ascending per V8's default string comparator. Added 7 compliance tests to hash.test.ts so the behavior is locked and any future refactor failing RFC 8785 fails CI. - packHash is computed with the field itself omitted from the preimage (placeholder empty string, stripped during canonicalization). - Byte-identity test: two runs on same opts produce === manifest JSON. - camelCase TS / snake_case wire boundary handled by a single toSnakeCaseManifest helper; all consumers (disk write, hashing) see the same bytes. Refs: .erpaval/specs/005-m5-m6/spec.md AC-M5-3, E-M5-4, W-M5-2
Extend the 5 group MCP tools (group_list, group_query, group_contracts, group_status, group_sync) with additive repo_uri fields. Legacy name/_repo/consumerRepo/producerRepo fields preserved through M7 - no breaking rename. repo_uri is derived via deriveRepoUri (shipped by AC-M6-2 in repo-resolver.ts); when AC-M6-1's RepoNode is in the graph, prefer its repoUri. - Additive changes only - codehub-contract-map skill continues to work via backward-compat - Legacy test assertions preserved byte-for-byte Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-4, E-M6-4
…-M6-3 reframed) After discovery revealed .docmeta.json lives in plugin Markdown (not TS), reframe AC-M6-3: engine side owns the sourced link graph (new computeCrossRepoLinks helper + group_cross_repo_links MCP tool), skill orchestrator owns the .docmeta.json file and writes v2 during Phase E. - packages/analysis/src/group/cross-repo-links.ts: deterministic, alpha-sorted CrossRepoLink[] from group_contracts data - packages/mcp/src/tools/group-cross-repo-links.ts: MCP wrapper - cross-reference-spec.md: v2 schema with cross_repo_links[] - SKILL.md Phase E prose: orchestrator calls the tool + writes v2 - Determinism snapshot-tested This preserves E-M6-3 (sourced, not heuristic), U6 (no LLM calls in engine), and OCH's architecture (skill owns doc assembly). Refs: .erpaval/specs/005-m5-m6/spec.md AC-M6-3, E-M6-3, S-M6-2
Wave 1 (1775500) wired `chonkie@^0.3.0` into @opencodehub/pack as the AST chunker. That package is owned by chonkie-inc but is NOT the documented surface — it's an undocumented stub-history publish whose repository URL points at the now-renamed chonkie-ts repo. The npm `chonkie-ts` package is a PolyerAI squatter (2.6 kB, no deps, abandoned ~1 year). The canonical chonkie-inc TypeScript port, explicitly named in the chonkie-inc/chonkiejs README install command, is `@chonkiejs/core` — same author (Bhavnick Minhas) as the Python upstream, MIT-licensed, latest 0.0.9 on 2026-03-27. This commit: - Swaps `packages/pack/package.json` from `chonkie@^0.3.0` to `@chonkiejs/core@^0.0.9` (alpha-sorted in the dependencies block). - Regenerates pnpm-lock.yaml (`pnpm install`); the lockfile also picks up incidental dedup of stale aws-sdk hoist entries already present on the base branch. - Amends `.erpaval/specs/005-m5-m6/spec.md` AC-M5-1 deps list, AC-M5-5 ast-chunker.ts entry, S-M5-1 fallback condition, and the Context "AST chunker" bullet to read `@chonkiejs/core@^0.0.9`. Manifest field name `chonkie_version` is retained — the field is still pinned, only the package providing it changed. No source code is touched: `generatePack` remains a typed stub. The import wiring lands in T-W2-5 alongside ast-chunker.ts.
…quickcheck (AC-M6-5) ADR 0012 captures the rationale for first-class RepoNode mirroring ADR 0011's structure (393 lines): Context, Decision (9-attribute shape), Schema choice (append-only NodeKind union), graphHash invariant W-M6-1 (append-only ordering, %cI HEAD indexTime not wall-clock, no backfill), Migration (lazy population + engine tolerance), Edge kinds deferred to M7, Risks, References citing commits 9ee6a96 (M6-1 RepoNode), 26e507b (M6-2 structured AMBIGUOUS_REPO), f9fdde2 (M6-4 group_* additive repo_uri), 86e295b (M6-3 reframed cross-repo links). AGENTS.md and CLAUDE.md AMBIGUOUS_REPO paragraphs cross-linked to ADR 0012, RepoNode (packages/core-types/src/nodes.ts:524-552), and the AC-M6-3-reframed group_cross_repo_links MCP tool, plus a worked JSON example showing the error envelope and a retry call. Both files stay byte-identical for the synced range. Synthetic 2-repo fixture under packages/analysis/src/group/__fixtures__/ exercises the populated-case path of computeCrossRepoLinks (HTTP route + gRPC service producer/consumer pair). cross-repo-links-quickcheck.test.ts asserts shape (5-tuple), consumer/producer orientation, deterministic ordering (two runs deep-equal), and evidence sourcing.
Move the pure license classifier (`classifyDependencies`), its supporting types (`DependencyRef`, `LicenseTier`, `LicenseAuditFlagged`, `LicenseAuditResult`), and the private `COPYLEFT_PATTERN` regex from `@opencodehub/mcp/src/tools/license-audit.ts` into a new `@opencodehub/analysis/src/license-classify.ts`. Re-export from the analysis barrel. Why: T-W2-5 (`packages/pack/src/licenses.ts`) needs the same classifier. `pack` cannot import from `mcp` because that introduces a mcp → pack → mcp dependency cycle. `analysis` is already a transitive dependency of both `mcp` and `pack`, so lifting the helper there breaks the cycle cleanly without adding new package edges. Mechanical lift only — function body, regex, tier semantics, and `LicenseAuditResult` shape are byte-identical. The MCP tool now imports the classifier from `@opencodehub/analysis`; no shim re-export retained. The mcp-side test (`license-audit.test.ts`) updates only its import path. A package-local `license-classify.test.ts` mirrors the legacy 9 cases (OK / WARN-on-UNKNOWN / WARN-on-empty / BLOCK-on-GPL / BLOCK-on-PROPRIETARY / AGPL+SSPL+EUPL+CPAL+OSL+RPL spread / LGPL non-match / lowercase copyleft / BLOCK-wins-over-WARN). Refs: T-W2-3 (drift_4 prep, extends spec 005 AC-M5-5).
…Store
The M5 BOM bodies (T-W2-4 / T-W2-5: skeleton, file-tree, deps, xrefs) need
typed kind-filtered enumeration of GraphNodes from the polymorphic `nodes`
table. Without a first-class API, every BOM body would have to scatter raw
`store.query("SELECT id, kind, version, license, ... FROM nodes WHERE
kind = ?")` SQL across `packages/pack/`, replicate the column→field
rehydration logic per-call, and lose type-safety on the kind-specific
wider columns (Dependency `version`/`license`/`lockfile_source`/`ecosystem`,
Repo `repo_uri`/`default_branch`/`languageStats`, etc.).
`listNodes(opts?: { kinds?, limit?, offset? })` is the cleaner long-term
API: deterministic ordering at the storage layer (ORDER BY id ASC + a
JS-side lex-stable tiebreak), `kinds: undefined` returns every kind,
`kinds: []` short-circuits to `[]`, paging via limit/offset.
Both adapters share a fully-typed `rowToGraphNode` / `recordToGraphNode`
rehydration helper that reverses every encoding `nodeToRow` /
`nodeToParams` writes, including the Operation
`http_method`/`http_path` → `method`/`path` aliasing, the polymorphic
`frameworks_json` legacy-vs-v2 envelope, the `unreachable_export` →
`unreachable-export` deadness denormalisation, and the Repo nullable-
field preservation. Tests verify cross-adapter parity: the same fixture
fed to DuckStore and GraphDbStore yields byte-identical
`canonicalJson(GraphNode)` for every node.
The interface change is purely additive — no production consumer was
touched. Test fakes implementing `IGraphStore` (`FakeStore`,
`WikiFakeStore`, two `StubStore` instances) gained a small noop
`listNodes` so the type check stays green across the monorepo.
Tests: 9 new in duckdb-adapter.test.ts (real DuckDB), 7 in
graphdb-adapter.test.ts (1 pure-JS short-circuit + 6 native-binding-
gated, including the cross-adapter parity test). All 159 storage tests
pass; 1764 tests pass across the monorepo with 0 failures.
Land the first three BOM body modules under `packages/pack/src/`. Each
emits a flat row stream that `generatePack` (a typed stub at
`packages/pack/src/index.ts:23` until T-W2-5) will eventually assemble
into a deterministic 9-item code-pack BOM.
skeleton.ts (item 2/9)
PageRank-ranked Function/Class/Method symbols. Pulls callable nodes
via `IGraphStore.listNodes({ kinds: [...] })` (T-W2-2) and CALLS
edges via raw SQL against the `relations` table (column is `type`,
not `kind`; columns `from_id`/`to_id`). Feeds `EdgeLike[]` into
`buildAdjacency` + `pageRank(adj, 0.85, 50)` from
`@opencodehub/analysis` — fixed iterations + damping per W-M5-3, no
tolerance-based convergence. Map id → score is keyed off
`adj.nodes[i]` (the Float64Array is index-aligned to that array;
never rebuild the index from edges). Output sorted score DESC, id
ASC. Method.owner round-trips; non-Method rows omit it.
file-tree.ts (item 3/9)
File/Folder rows alpha-sorted by `path ASC` and decorated with the
repo's framework set. Precedence:
`frameworksDetected: FrameworkDetection[]` (preferred — structured)
→ legacy `frameworks: string[]` flat list → `[]`. Names are
alpha-sorted + deduped before being stamped onto every row (the
ProjectProfile is a per-repo singleton at v1, so all rows carry the
same labels). Files surface optional language + contentHash;
folders omit them. We deliberately do not walk CONTAINS edges —
paths come from the FileNode/FolderNode `filePath` field.
deps.ts (item 4/9)
Dependency rows mapped to a flat DepRow shape mirroring the MCP
`dependencies` tool, but WITHOUT importing `@opencodehub/mcp`
(mcp depends on pack via `pack_codebase` — that would create a
workspace cycle). Sort key:
`(ecosystem ASC, name ASC, version ASC, id ASC)`. The id-tiebreak
catches polyrepos where the same package is pinned at the same
version across multiple lockfiles. Missing license / version are
preserved as `undefined` — the BOM stores raw graph state and
leaves the "UNKNOWN" coercion to render-time consumers.
Determinism contract — non-negotiable for all three modules
- `Array.prototype.sort` over a plain JS comparator; never trust
Map iteration order for output sequencing.
- score / version / etc. ties resolve via `id ASC` (lex-stable
last resort).
- PageRank itself is deterministic by construction.
- Two consecutive calls return byte-identical canonicalJson.
Each module ships a determinism test that asserts both
`deepEqual` and `canonicalJson(a) === canonicalJson(b)` over two
consecutive invocations on the same in-memory mock store.
Why three sibling modules instead of one bundled builder
Each BOM item has a distinct shape, distinct sort keys, and a
distinct origin kind on the graph. Bundling them behind a generic
`buildBom(opts: { kind })` interface would force the variants
through a sum-type seam that the manifest writer (T-W2-5) and the
future `code_skeleton` MCP surface don't want — they consume each
output as a strictly-typed table, not an `unknown[]`. Three
small modules with parallel structure is simpler than one
abstraction that needs to fit nine future shapes (xrefs,
ast-chunks, embeddings-sidecar, findings, licenses).
Tests (21 new, baseline 18, total 39)
Each module ships node:test cases against a thin
`as unknown as IGraphStore` mock that implements only the methods
the module reaches (listNodes + query for skeleton; listNodes for
file-tree and deps). The mock pattern matches
`packages/cli/src/commands/context.test.ts:118` and avoids the
duckdb native-binding fragility in the worktree shell.
Verification
- pnpm -C packages/pack exec tsc --noEmit → exit 0
- pnpm exec biome check packages/pack/ → exit 0
- pnpm -C packages/pack test → 39/39 pass
- bash scripts/check-banned-strings.sh → PASS
ast-chunker.ts wraps @chonkiejs/core CodeChunker via dynamic import; degrades
to a line-split fallback when the loader rejects, when CodeChunker.create
throws (per-file path), or when a file lacks a `language` (per-file → strict
result preserved). CRLF→LF normalize before chunking (W-M5-4). pinsHint surfaces
chonkie's package.json `version` for the manifest pins object. Worktree
native-binding lesson — onnxruntime-node may not rebuild cleanly — drove the
mock-first test seam (`_loadChonkie`).
xrefs.ts emits Community rows (alpha by id) followed by CALLS rows
(`from, to, id` ASC) from a single `WHERE type = 'CALLS' ORDER BY id ASC`
scan of the relations table. Confidence is surfaced raw but never used as a
sort key — float comparison would inject non-determinism on near-equal values.
findings.ts groups by SARIF `level` enum + ruleId. NULL/unknown severity
coerces to "none". Suppressed rows are skipped via rehydration of
`suppressed_json` → `{suppressions: [...]}` → `sarif.isSuppressed()`,
mirroring the helper at `packages/analysis/src/verdict.ts:614-626`. Groups
sort by SEVERITY_RANK then ruleId ASC; examples sort by nodeId ASC and cap
at `examplesPerGroup` (default 3).
licenses.ts uses `classifyDependencies` from `@opencodehub/analysis` (lifted
in AC-M5-3). Aggregates LICENSES.md (tier counts header + per-package
sections in `(ecosystem, name, version, id)` ASC) and concatenates any
`NOTICE` / `NOTICE.md` / `NOTICES` files found at the repo root.
readme.ts renders a pure-function README with the determinism contract
(strict | best_effort | degraded) and BOM file index. Snapshot-stable.
generatePack assembles all 8 BOM files (skeleton, file-tree, deps, ast-chunks,
xrefs, findings, licenses, readme) plus manifest.json. Manifest is written
LAST so a partial run leaves an obviously-incomplete pack. NO Parquet sidecar —
T-W3-1 owns that. determinism_class: degraded > best_effort (anthropic:
tokenizer) > strict. pins.duckdbVersion read from `@duckdb/node-api`'s
package.json at runtime.
Tests: pack package goes from 39 → 90 (+51). End-to-end test asserts
byte-identical files across two runs on the same fixture using sha256
per-file. Workspace total: 1848 tests, 0 failures.
DuckDB COPY (SELECT node_id, granularity, chunk_index, vector FROM embeddings ORDER BY node_id, granularity, chunk_index) TO 'embeddings.parquet' (FORMAT PARQUET, COMPRESSION ZSTD). Pins duckdbVersion in manifest.pins from the runtime SELECT version() reported by the binding that wrote the file — that string is what the parquet created_by metadata embeds, so the manifest pin stays bound to the engine that produced the sidecar. Sidecar absent when embeddings table empty (S-M5-3) — no file on disk and manifest.files[] does not list a path. The sidecar is structurally duck-typed (IGraphStore is not widened): stores without exportEmbeddingsParquet (mocks, GraphDbStore, future LanceDB) cleanly resolve to absent. Path is interpolated into COPY because DuckDB does not bind COPY destinations; isSafeAbsolutePath() rejects anything outside a strict POSIX-absolute character class so injection is structurally impossible. Two-run byte-identity test on a 100-row × 384-dim Float32 fixture confirms determinism via Buffer.compare === 0 against a real DuckDbStore. Pack tests 90 → 96; full repo tests 1848 → 1854; all gates green.
…hub/pack AC-M5-7. CLI: new `codehub code-pack [path] [--budget N] [--tokenizer ID] [--out-dir DIR] [--engine pack|repomix]`. Default engine is `pack` and writes the 9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + findings + licenses + readme + optional embeddings.parquet) to `<repo>/.codehub/packs/<packHash>/`. Output is staged in `os.tmpdir()` first and renamed into the canonical hash-suffixed path once `generatePack` returns its manifest, so the directory name encodes pack identity. The repomix path delegates to the existing `runPack` shell-out for npx repomix and returns a `bomItemCount: 1` envelope. MCP: pack_codebase routes through @opencodehub/pack on engine=pack (default); legacy repomix path retained under engine=repomix opt-in (drop deferred to M7 per spec 005 Q-DELTA-6). The repomix response carries `_meta.engine: "repomix"` so callers can detect the legacy path and `next_steps[]` flags the pending deprecation. Test seams: both runCodePack and runPackCodebase accept injected stubs (`_generatePack`, `_store`, `_runRepomix`, `PackCodebaseDeps`) so unit tests exercise engine routing without loading native DuckDB bindings or shelling out. 16 new tests cover defaults, dispatch, the .codehub/packs/<hash>/ path layout, embeddings sidecar inclusion, custom out-dir, and the no-index error envelope. repomix is bandwidth output, not a tree-sitter chunker (.erpaval/solutions/architecture-patterns/repomix-is-output-side.md): the @opencodehub/pack engine fully supersedes it for code intelligence; repomix stays available for raw repo packing through M6 and is removed in M7.
New skill at plugins/opencodehub/skills/codehub-code-pack/ surfaces `codehub code-pack` to Claude Code agents. Single-repo + group mode, allowed-tools list, 9-item BOM contract documented inline, determinism class triage (strict/best_effort/degraded), pack_hash verification recipe. references/determinism-contract.md captures spec 005 §M5 invariants for future auditors. Cross-linked from opencodehub-guide skills table.
Adds end-to-end packages/pack/src/pack-determinism.test.ts that runs generatePack twice and asserts every output file is byte-identical (packHash equality + Buffer.compare per file). Adds scripts/pack-determinism-audit.sh that exercises the same invariant through the codehub CLI; integrated into scripts/acceptance.sh. SKIP guards keep both gates honest when DuckStore native bindings are absent.
Five durable lessons extracted from feat/v1-m5-m6 (PR #68, M5 + M6 complete): - conventions/npm-package-canonicality-via-upstream-readme — chonkie-ts was a 2.6 kB squatter; @chonkiejs/core was canonical per upstream README. - architecture-patterns/storage-list-nodes-over-scattered-sql — typed IGraphStore.listNodes() collapses N raw-SQL call sites; cross-adapter parity test catches schema drift. - architecture-patterns/lift-pure-functions-to-shared-dep-to-break-cycles — classifyDependencies lifted into @opencodehub/analysis (LCA dep) averted mcp → pack → mcp cycle. - best-practices/worktree-isolation-pwd-pin-and-biome-exclusion — pin pwd at task start; biome v2 traverses gitignored worktrees, scope to packages/ or add experimentalScannerIgnores. - best-practices/spec-drift-amend-inline-with-implementing-commit — amend spec wording in the same commit that implements the resolution. INDEX.md updated with five new entries under Solutions.
Rebase onto main brought in PR #70's transitive-CVE overrides (fast-xml-builder@1.1.7, fast-uri@3.1.2, hono@4.12.16, ip-address@10.1.1). Regenerating the lockfile pulls those in alongside the M5/M6 pack deps. No source changes — build + typecheck + tests + banned-strings all green locally before push.
1a733f2 to
82d4d42
Compare
CodeQL flagged a potential filesystem race on packages/pack/src/embeddings-sidecar.ts:134 — stat(outPath) and readFile(outPath) ran concurrently in Promise.all, so size and content could come from different versions of the file. Derive bytesWritten from the same buffer used for hashing: a single readFile, then bytes.byteLength. No stat needed.
theagenticguy
added a commit
that referenced
this pull request
May 10, 2026
## Summary
- **M5 (Deterministic code-packs)** — ships `@opencodehub/pack`, the
`codehub code-pack` CLI subcommand, the `pack_codebase` MCP tool routed
through pack by default, and the `codehub-code-pack` skill. Output is a
9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs
+ optional embeddings.parquet + findings + licenses+readme)
byte-identical given `(commit, tokenizer, budget, chonkie_version,
duckdb_version)`. Locked into CI by
`packages/pack/src/pack-determinism.test.ts` (5 variants) +
`scripts/pack-determinism-audit.sh` (acceptance gate 16).
- **M6 (Cross-repo federation)** — first-class `RepoNode` (9 attrs) in
the graph; structured `AMBIGUOUS_REPO` with `choices[]`/`total_matches`
+ `repo_uri` alias; `group_cross_repo_links` MCP tool + cross-repo links
in `codehub-document --group`; AGENTS.md/CLAUDE.md cross-refs to ADR
0012 + worked retry example; ADR 0012 (393 lines) captures the rationale
+ graphHash invariant W-M6-1.
- 18 commits ahead of `main`, 1950/1951 tests passing (1 pre-existing
skip), `mise run check` green, banned-strings green, AGENTS↔CLAUDE
byte-identical sync verified.
Spec: `.erpaval/specs/005-m5-m6/spec.md` (12 ACs delivered, 4 spec
drifts resolved inline).
## What landed
### M5 — Wave 1+2+3
| AC | Commit | What |
|---|---|---|
| AC-M5-0 | `c0890fa` (pre) | `pack` added to commitlint scope-enum |
| AC-M5-1 | `1775500` (pre) | `@opencodehub/pack` workspace scaffold |
| AC-M5-2 | `4e5d6f8` (pre) | Lift PageRank from scip-ingest → analysis
|
| AC-M5-3 | `bc5fd99` (pre) | BOM manifest + packHash helper (RFC 8785
canonical JSON) |
| Drift 1 | `77f37c3` | Switch chonkie dep → `@chonkiejs/core@^0.0.9`
(npm `chonkie-ts` is a squatter) |
| AC-M5-3a | `018c253` | `IGraphStore.listNodes(opts?: {kinds, limit,
offset})` on DuckStore + GraphDbStore |
| Drift 4 | `9d8d570` | Lift `classifyDependencies` mcp → analysis
(cycle-break) |
| AC-M5-4 | `072a062` | BOM 2-4: `skeleton.ts` (PageRank-ranked
symbols), `file-tree.ts` (framework-labelled), `deps.ts` |
| AC-M5-5 | `0c17be1` | BOM 5-9 + `generatePack` assembly:
`ast-chunker.ts` (chonkie + line-split fallback), `xrefs.ts`,
`findings.ts` (SARIF level enum + suppressions), `licenses.ts`,
`readme.ts` |
| AC-M5-6 | `5c118ac` | Parquet embeddings sidecar via DuckDB COPY+ZSTD
(S-M5-3 absent-when-empty) |
| AC-M5-7 | `d1aa08d` | `codehub code-pack` CLI + `pack_codebase` MCP
routes through `@opencodehub/pack` (engine=pack default; engine=repomix
opt-in deferred to M7) |
| AC-M5-8 | `1f51300` | Byte-identity determinism test suite + audit
script + `acceptance.sh` gate 16 |
| AC-M5-9 | `e043016` | `codehub-code-pack` skill +
`references/determinism-contract.md` + `opencodehub-guide` cross-link |
### M6 — Wave 1+2+3
| AC | Commit | What |
|---|---|---|
| AC-M6-1 | `9ee6a96` (pre) | `RepoNode` first-class in graph (9 attrs;
appended to NodeKind union to preserve graphHash) |
| AC-M6-2 | `26e507b` (pre) | Structured `AMBIGUOUS_REPO` with
`choices[]` + `total_matches` + `repo_uri` alias |
| AC-M6-3 (reframed) | `86e295b` (pre) | `group_cross_repo_links` MCP
tool + v2 docmeta cross-reference spec |
| AC-M6-4 | `f9fdde2` (pre) | `group_*` tools emit `repo_uri` additively
|
| AC-M6-5 | `4d8c5a9` | ADR 0012 (393 lines, mirrors 0011) +
AGENTS.md/CLAUDE.md cross-refs + worked AMBIGUOUS_REPO retry example +
synthetic 2-repo fixture for `codehub-contract-map` quickcheck |
## Spec drifts resolved inline
1. **chonkie package mismatch** — wave-1 wired `chonkie@^0.3.0`
(chonkie-inc-owned but undocumented). Canonical TS port is
`@chonkiejs/core@^0.0.9` per the chonkie-inc/chonkiejs README. Spec 005
amended in the swap commit.
2. **`IGraphStore.listNodes()` did not exist** — spec called for it;
implemented as a sub-AC on DuckStore + GraphDbStore. Cleaner long-term
API than scattering raw `store.query` SQL across `packages/pack/`.
3. **AGENTS.md `choices[]` already shipped** — reframed AC-M6-5 to add
cross-references to ADR 0012, RepoNode, `group_cross_repo_links` +
worked retry example.
4. **`classifyDependencies` cycle** — `pack` cannot import from `mcp`
(mcp consumes pack via `pack_codebase`). Lifted the pure helper into
`@opencodehub/analysis` as a 30-LOC prep commit.
## Roadmap status post-merge
```
M1 ✅ → M2 ✅ → (M3 ✅ ∥ M4 ✅) → (M5 ✅ ∥ M6 ✅) → M7
```
M7 (LadybugDB default + drop `sql` for `cypher`-only) is the only
remaining v1.0 milestone.
## Test plan
- [x] `pnpm install --frozen-lockfile` clean
- [x] `pnpm -r build` clean
- [x] `mise run check` exits 0 (lint + typecheck + test +
banned-strings)
- [x] 1950/1951 tests pass (1 pre-existing embedder skip)
- [x] `bash scripts/check-banned-strings.sh` PASS
- [x] `bash scripts/pack-determinism-audit.sh` runs (PASS or SKIP both
acceptable)
- [x] AGENTS.md ↔ CLAUDE.md AMBIGUOUS_REPO byte-identical
- [ ] `codehub code-pack <repo>` produces a 9-item BOM directory at
`<repo>/.codehub/packs/<packHash>/` (requires DuckStore on a real repo —
verify post-merge)
- [ ] Two consecutive `codehub code-pack` runs with same args produce
byte-identical output (E-M5-3)
- [ ] `pack_codebase` MCP tool `engine=pack` (default) route exercised
end-to-end via Claude Code
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@opencodehub/pack, thecodehub code-packCLI subcommand, thepack_codebaseMCP tool routed through pack by default, and thecodehub-code-packskill. Output is a 9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + optional embeddings.parquet + findings + licenses+readme) byte-identical given(commit, tokenizer, budget, chonkie_version, duckdb_version). Locked into CI bypackages/pack/src/pack-determinism.test.ts(5 variants) +scripts/pack-determinism-audit.sh(acceptance gate 16).RepoNode(9 attrs) in the graph; structuredAMBIGUOUS_REPOwithchoices[]/total_matches+repo_urialias;group_cross_repo_linksMCP tool + cross-repo links incodehub-document --group; AGENTS.md/CLAUDE.md cross-refs to ADR 0012 + worked retry example; ADR 0012 (393 lines) captures the rationale + graphHash invariant W-M6-1.main, 1950/1951 tests passing (1 pre-existing skip),mise run checkgreen, banned-strings green, AGENTS↔CLAUDE byte-identical sync verified.Spec:
.erpaval/specs/005-m5-m6/spec.md(12 ACs delivered, 4 spec drifts resolved inline).What landed
M5 — Wave 1+2+3
c0890fa(pre)packadded to commitlint scope-enum1775500(pre)@opencodehub/packworkspace scaffold4e5d6f8(pre)bc5fd99(pre)77f37c3@chonkiejs/core@^0.0.9(npmchonkie-tsis a squatter)018c253IGraphStore.listNodes(opts?: {kinds, limit, offset})on DuckStore + GraphDbStore9d8d570classifyDependenciesmcp → analysis (cycle-break)072a062skeleton.ts(PageRank-ranked symbols),file-tree.ts(framework-labelled),deps.ts0c17be1generatePackassembly:ast-chunker.ts(chonkie + line-split fallback),xrefs.ts,findings.ts(SARIF level enum + suppressions),licenses.ts,readme.ts5c118acd1aa08dcodehub code-packCLI +pack_codebaseMCP routes through@opencodehub/pack(engine=pack default; engine=repomix opt-in deferred to M7)1f51300acceptance.shgate 16e043016codehub-code-packskill +references/determinism-contract.md+opencodehub-guidecross-linkM6 — Wave 1+2+3
9ee6a96(pre)RepoNodefirst-class in graph (9 attrs; appended to NodeKind union to preserve graphHash)26e507b(pre)AMBIGUOUS_REPOwithchoices[]+total_matches+repo_urialias86e295b(pre)group_cross_repo_linksMCP tool + v2 docmeta cross-reference specf9fdde2(pre)group_*tools emitrepo_uriadditively4d8c5a9codehub-contract-mapquickcheckSpec drifts resolved inline
chonkie@^0.3.0(chonkie-inc-owned but undocumented). Canonical TS port is@chonkiejs/core@^0.0.9per the chonkie-inc/chonkiejs README. Spec 005 amended in the swap commit.IGraphStore.listNodes()did not exist — spec called for it; implemented as a sub-AC on DuckStore + GraphDbStore. Cleaner long-term API than scattering rawstore.querySQL acrosspackages/pack/.choices[]already shipped — reframed AC-M6-5 to add cross-references to ADR 0012, RepoNode,group_cross_repo_links+ worked retry example.classifyDependenciescycle —packcannot import frommcp(mcp consumes pack viapack_codebase). Lifted the pure helper into@opencodehub/analysisas a 30-LOC prep commit.Roadmap status post-merge
M7 (LadybugDB default + drop
sqlforcypher-only) is the only remaining v1.0 milestone.Test plan
pnpm install --frozen-lockfilecleanpnpm -r buildcleanmise run checkexits 0 (lint + typecheck + test + banned-strings)bash scripts/check-banned-strings.shPASSbash scripts/pack-determinism-audit.shruns (PASS or SKIP both acceptable)codehub code-pack <repo>produces a 9-item BOM directory at<repo>/.codehub/packs/<packHash>/(requires DuckStore on a real repo — verify post-merge)codehub code-packruns with same args produce byte-identical output (E-M5-3)pack_codebaseMCP toolengine=pack(default) route exercised end-to-end via Claude Code🤖 Generated with Claude Code