Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 6 additions & 18 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,11 @@ jobs:
- run: pnpm --filter '!@opencodehub/docs' -r exec tsc --noEmit

test:
# Node 22 = native-opt-in path (OCH_NATIVE_PARSER=1); Node 24 = WASM default
# Parsing is WASM-only on every supported Node version (ADR 0015), so the
# test suite needs no native grammar build — `--ignore-scripts` is the
# single install path across the matrix. The remaining native deps
# (@duckdb/node-api, @ladybugdb/core, onnxruntime-node) ship prebuilds, so
# storage/embedder tests pass without running postinstall.
strategy:
fail-fast: false
matrix:
Expand All @@ -48,24 +52,8 @@ jobs:
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: jdx/mise-action@1648a7812b9aeae629881980618f079932869151 # v4
- name: Ensure node-gyp is available for native tree-sitter build
if: matrix.node-version == 22
# Pin node-gyp version (Scorecard Pinned-Dependencies / npmCommand)
run: npm i -g node-gyp@12.3.0
# Node 22: let native tree-sitter grammars postinstall (scripts enabled)
# so the OCH_NATIVE_PARSER=1 test path has working N-API bindings.
# Node 24: skip postinstall — native grammars can't build against the
# Node 24 V8 ABI yet (tree-sitter/node-tree-sitter#276). WASM default
# doesn't need the N-API addons on disk.
- name: Install deps (Node 22, with postinstall)
if: matrix.node-version == 22
run: pnpm install --frozen-lockfile
- name: Install deps (Node 24, ignore-scripts)
if: matrix.node-version == 24
run: pnpm install --frozen-lockfile --ignore-scripts
- run: pnpm install --frozen-lockfile --ignore-scripts
- run: pnpm --filter '!@opencodehub/docs' -r test
env:
OCH_NATIVE_PARSER: ${{ matrix.node-version == 22 && '1' || '' }}

sarif-validate:
runs-on: ubuntu-latest
Expand Down
32 changes: 19 additions & 13 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,22 @@ This repo ships a Claude Code plugin at `plugins/opencodehub/` — it
provides a `code-analyst` subagent and 10 skills. Install via
`codehub init` (writes `.mcp.json` + links the plugin).

## Storage backend — graph-default

`CODEHUB_STORE` is unset by default. OpenCodeHub probes
`@ladybugdb/core` and uses the graph-database backend when the binding
is available; otherwise it falls back to DuckDB with a one-shot stderr
advisory (gated on TTY or `OCH_VERBOSE=1`). Set `CODEHUB_STORE=duck` to
force the legacy layout (single DuckDB file backs both graph + temporal
views) or `CODEHUB_STORE=lbug` to require the graph-database backend.

When both `graph.duckdb` and `graph.lbug` exist as siblings in the same
`<repo>/.codehub/`, the newer-mtime file wins. See ADR 0013
(`docs/adr/0013-m7-default-flip-and-abstraction.md`) for the rationale
and the AGE/Memgraph/Neo4j/Neptune community-adapter escape hatch.
## Storage backend — lbug graph + DuckDB temporal

The graph tier is always `@ladybugdb/core` (`graph.lbug`); the temporal
tier — cochanges, structured symbol summaries, and the
`codehub query --sql` escape hatch — is always DuckDB
(`temporal.duckdb`). Both files live under `<repo>/.codehub/`. There is
no env-var, no probe, no fallback; if the lbug binding fails to load,
`open()` throws `GraphDbBindingError` and the operation aborts. See
ADR 0016 (`docs/adr/0016-duckdb-graph-rip.md`) for the rationale and the
AGE/Memgraph/Neo4j/Neptune community-adapter contract that survives the
rip-out (the segregated `IGraphStore` / `ITemporalStore` interfaces stay
exactly because community-fork adapters are a deliberate escape hatch).

`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only. Embeddings live in `graph.lbug` and stream into a
per-call DuckDB temp table at pack time so the byte-identical Parquet
sidecar still works (see `packages/pack/src/embeddings-sidecar.ts`).
Future temporal swap (e.g. SQLite-WASM) only needs a new `ITemporalStore`
implementor — no graph-tier change.
48 changes: 23 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ flowchart LR
| **Local-first, offline-capable** | `codehub analyze --offline` opens zero sockets. Your code never leaves your machine. No telemetry. |
| **Deterministic indexing** | Identical inputs produce a byte-identical graph hash. Reproducible. Auditable. Cacheable in CI. |
| **MCP-native** | Works out-of-the-box with Claude Code, Cursor, Codex, Windsurf, OpenCode. The MCP server is the primary interface; CLI exists for scripts and CI. |
| **Embedded storage, graph-default** | `@ladybugdb/core` graph engine for the structural store (default at v1) with DuckDB + `hnsw_acorn` (filter-aware HNSW via ACORN-1 + RaBitQ) + `fts` (BM25) for the temporal + retrieval views. Embedded files. No daemon. No database to operate. `CODEHUB_STORE=duck` reverts to the legacy single-file layout. |
| **Embedded storage, two-tier** | `@ladybugdb/core` holds the structural store: symbols, edges, embeddings, BM25 + HNSW. A dedicated DuckDB sibling holds the temporal views: cochanges and summaries. Embedded files. No daemon. No database to operate. Both tiers are always present, with no backend knob (ADR 0016). |
| **15 languages at GA** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart, COBOL — tree-sitter for the first 14 plus a regex provider for fixed-format COBOL. |
| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime, on Node 20, 22, and 24. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`. There is no native opt-in — `npm install -g @opencodehub/cli@latest` does zero native builds and zero GitHub fetches. |

Expand Down Expand Up @@ -165,7 +165,7 @@ The monorepo is organised as 17 workspace packages under `packages/`:
| `scanners` | Subprocess wrappers for 20 scanners — OSV, Semgrep, hadolint, tflint, detect-secrets, and the rest |
| `scip-ingest` | SCIP indexer runners (TS, Python, Go, Rust, Java) — emits CALLS, REFERENCES, IMPLEMENTS, TYPE_OF |
| `search` | Hybrid BM25 + HNSW (ACORN-1 + RaBitQ) query layer |
| `storage` | `IGraphStore` / `ITemporalStore` adapters — `@ladybugdb/core` (default) and DuckDB; deterministic `graphHash` |
| `storage` | `IGraphStore` (`@ladybugdb/core`) + `ITemporalStore` (DuckDB) adapters; deterministic `graphHash` |
| `summarizer` | Process + cluster summaries for MCP responses |
| `wiki` | LLM-narrated module pages emitted by `codehub wiki --llm` |

Expand Down Expand Up @@ -204,29 +204,27 @@ switching mid-project requires `codehub analyze --rebuild-embeddings`.
`--offline` refuses SageMaker and HTTP backends, so offline mode is
compatible only with the local ONNX path.

## Storage backend — graph-default

Starting with v1.0, OpenCodeHub picks the graph-database backend
(`@ladybugdb/core`) as the default whenever the binding is importable on
the current platform. DuckDB is retained as the temporal store
(cochanges + symbol summaries) and as the legacy graph fallback. The
`CODEHUB_STORE` environment variable controls selection:

| `CODEHUB_STORE` | Behaviour |
|---|---|
| *unset* (default) | Probe `@ladybugdb/core`. Available → graph artifact at `<repo>/.codehub/graph.lbug` + temporal sibling `temporal.duckdb`. Missing → fall back to `<repo>/.codehub/graph.duckdb` (one-shot stderr advisory under TTY / `OCH_VERBOSE=1`). |
| `duck` | Force the legacy DuckDB-only layout. One file backs both the graph and temporal views. |
| `lbug` | Force the graph-database layout. Surface a `GraphDbBindingError` at open time if the binding is unavailable. |

Two-artifact transition: when both `graph.duckdb` AND `graph.lbug` are
present in the same `<repo>/.codehub/`, the newer-mtime file wins and a
one-shot advisory fires. Remove the stale artifact to silence the
advisory.

See [`docs/adr/0011-graph-db-backend.md`](./docs/adr/0011-graph-db-backend.md)
for the M3 phase-1 rationale and
[`docs/adr/0013-m7-default-flip-and-abstraction.md`](./docs/adr/0013-m7-default-flip-and-abstraction.md)
for the M7 default-flip + interface segregation.
## Storage backend — lbug graph + DuckDB temporal

The graph tier is always `@ladybugdb/core` (`<repo>/.codehub/graph.lbug`);
the temporal tier — cochanges, structured symbol summaries, and the
`codehub query --sql` escape hatch — is always DuckDB
(`<repo>/.codehub/temporal.duckdb`). Both files are written on every
`analyze`. There is no `CODEHUB_STORE` env var, no backend probe, no
single-file `graph.duckdb` layout, and no mtime arbitration; if the lbug
binding fails to load, `open()` throws `GraphDbBindingError` and the
operation aborts.

`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only. The segregated interfaces stay because they are
the v1.0 contract for community-fork adapters (AGE / Memgraph / Neo4j /
Neptune target `IGraphStore`; DuckDB owns `ITemporalStore`). Embeddings
live in `graph.lbug` and stream into a per-call DuckDB temp table at
pack time so the byte-identical Parquet sidecar still works.

See [`docs/adr/0016-duckdb-graph-rip.md`](./docs/adr/0016-duckdb-graph-rip.md)
for the rationale behind ripping out the DuckDB graph backend; it
supersedes ADR 0013 and the DuckDB-as-graph passages of ADR 0011.

## Parse runtime — WASM-only, vendored grammars

Expand Down
2 changes: 1 addition & 1 deletion mise.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ node = "22"
pnpm = "11.1.0"
python = "3.12"
uv = "latest"
"npm:node-gyp" = "latest" # required to build tree-sitter native bindings during `pnpm install`
"npm:node-gyp" = "latest" # fallback native build for @duckdb/node-api / onnxruntime-node when a platform prebuild is missing (parsing is WASM-only — ADR 0015)
"aqua:betterleaks/betterleaks" = "1.2.0" # secret scanner — used by analyze + pre-release gate

[env]
Expand Down
7 changes: 4 additions & 3 deletions packages/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,11 @@ top-level subcommands by phase of the workflow.
- **Registry on disk** — `~/.codehub/registry.json` enumerates indexed
repos; per-repo state lives under `<repo>/.codehub/`
(`packages/cli/src/registry.ts`).
- **Env-toggle defaults** — `CODEHUB_STORE`, `CODEHUB_BEDROCK_DISABLED`
- **Env-toggle defaults** — env vars such as `CODEHUB_BEDROCK_DISABLED`
flip behaviour without touching flags.
- **`mcp` is launched, never embedded** — agents that need the MCP
surface spawn `codehub mcp` over stdio (`packages/cli/src/commands/mcp.ts`).

See ADR 0013 for the storage-backend toggle and the root README's
"MCP tool surface" section for the agent-facing tool inventory.
See ADR 0016 for the lbug-graph + DuckDB-temporal storage layout and the
root README's "MCP tool surface" section for the agent-facing tool
inventory.
27 changes: 15 additions & 12 deletions packages/docs/src/content/docs/agents/install.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ right config and links the plugin.

## 1. Prerequisites

- **Node** 22 (with native tree-sitter) or 24 (WASM default).
- **Node** 20, 22, or 24. Parsing is `web-tree-sitter` (WASM) on every
supported version — no native parser, no build step.
- **pnpm** 10 or newer.
- **git**.
- Optional: [`mise`](https://mise.jdx.dev) — recommended for the
Expand Down Expand Up @@ -71,24 +72,26 @@ medium monorepo, 5–10 minutes the first time on a large repo (after
which incremental analyzes are sub-second per file). Output is written
to `.codehub/` next to the project root.

The default storage backend is the graph database backend
(`graph.lbug` + `temporal.duckdb`). DuckDB is the legacy fallback.
Set `CODEHUB_STORE=duck` to force the legacy single-file layout, or
`CODEHUB_STORE=lbug` to require the graph backend. See
[ADR 0013](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0013-m7-default-flip-and-abstraction.md).
Storage is split across two always-present files under `.codehub/`: the
graph tier is LadybugDB (`graph.lbug`) and the temporal tier is DuckDB
(`temporal.duckdb`). There is no selection knob and no fallback — if the
LadybugDB binding fails to load, the operation aborts with a
`GraphDbBindingError`. See
[ADR 0016](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0016-duckdb-graph-rip.md).

## 6. Verify the install

```bash title="health check"
codehub doctor
```

`codehub doctor` prints the active toolchain: tree-sitter native
binding, DuckDB native binding, optional graph-database binding, and
which embedding backend is in effect (SageMaker → HTTP → local ONNX,
in that precedence). All three native bindings should report OK on
Node 22; on Node 24 the WASM parser is the default and the native
tree-sitter binding may be missing — that is expected.
`codehub doctor` prints the active toolchain: the `web-tree-sitter`
(WASM) parse runtime, the DuckDB native binding (`@duckdb/node-api`),
the LadybugDB graph binding (`@ladybugdb/core`), and which embedding
backend is in effect (SageMaker → HTTP → local ONNX, in that
precedence). Parsing is WASM-only on every supported Node version, so
there is no native parser binding to probe; the DuckDB and LadybugDB
bindings ship prebuilds and should report OK on Node 20, 22, and 24.

## 7. Wire your editor

Expand Down
10 changes: 5 additions & 5 deletions packages/docs/src/content/docs/architecture/determinism.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@ Anything outside that list — wall-clock time, process ID, file-system
inode ordering — must not influence the hash. The ingestion phases
are pure: inputs in, relations out, no ambient state.

The `graphHash` invariant is **backend-independent**. A repo indexed
into LadybugDB (`graph.lbug`) and the same repo indexed into the
single-file DuckDB layout (`graph.duckdb`) at the same commit produce
the same hash. A parity gate in CI compares the two hashes on every
PR that touches the storage layer.
The `graphHash` invariant covers everything the graph store
(`graph.lbug`) owns; the temporal signals in the DuckDB sibling
(`temporal.duckdb`) are statistical and never enter the hash. A parity
gate in CI asserts the invariant on every PR that touches the storage
layer.

## How we test it

Expand Down
20 changes: 11 additions & 9 deletions packages/docs/src/content/docs/architecture/monorepo-map.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ binary; every other package is a library imported by `cli`, `mcp`,
| `@opencodehub/scanners` | `packages/scanners` | Nineteen scanner wrappers (semgrep, osv-scanner, bandit, ruff, grype, vulture, pip-audit, npm-audit, biome, betterleaks, trivy, checkov, hadolint, tflint, spectral, radon, ty, clamav, och self-scan). |
| `@opencodehub/scip-ingest` | `packages/scip-ingest` | `.scip` protobuf reader + per-language indexer runners (TypeScript, Python, Go, Rust, Java, .NET, clang, Kotlin, Ruby). |
| `@opencodehub/search` | `packages/search` | Hybrid BM25 + RRF search. |
| `@opencodehub/storage` | `packages/storage` | The `IGraphStore` / `ITemporalStore` interface segregation, the LadybugDB and DuckDB adapters, the resolver that picks between them. |
| `@opencodehub/storage` | `packages/storage` | The `IGraphStore` / `ITemporalStore` interface segregation, the LadybugDB graph adapter and DuckDB temporal adapter, and `openStore()` that composes them. |
| `@opencodehub/summarizer` | `packages/summarizer` | Structured per-symbol summarizer (Haiku 4.5 via Bedrock Converse + Zod 4). |
| `@opencodehub/wiki` | `packages/wiki` | Markdown wiki renderer (architecture, api-surface, dependency-map, ownership-map, risk-atlas) over the graph. |
| `@opencodehub/docs` | `packages/docs` | This Starlight documentation site. |
Expand Down Expand Up @@ -56,19 +56,21 @@ TypeScript project-references graph enforces this via `tsc --noEmit`.

`@opencodehub/storage` exposes two narrow interfaces — `IGraphStore`
(graph workload: nodes, edges, embeddings, multi-hop traversal) and
`ITemporalStore` (temporal workload: cochanges, summary cache). Two
adapters implement them:
`ITemporalStore` (temporal workload: cochanges, summary cache). The
single shipping pair implements them:

- **LadybugDB graph store + DuckDB temporal store** — the default. Two
- **LadybugDB graph store + DuckDB temporal store** — always. Two
artifacts on disk (`graph.lbug` + `temporal.duckdb`), backed by a
Cypher-emitting dialect for the graph half and DuckDB SQL for the
temporal half.
- **Single DuckDB file** — the opt-in fallback. One artifact
(`graph.duckdb`) backs both interfaces.
temporal half. `IGraphStore` lives only on `GraphDbStore`;
`DuckDbStore` implements `ITemporalStore` only; `openStore()`
composes them. There is no backend selector and no fallback (ADR
0016) — a missing LadybugDB binding throws `GraphDbBindingError`.

See [Storage backend](/opencodehub/architecture/storage-backend/) for
the resolver, the dual-artifact precedence rule, and the
community-adapter escape hatch (AGE / Memgraph / Neo4j / Neptune).
how `openStore()` composes the pair and the community-adapter escape
hatch (AGE / Memgraph / Neo4j / Neptune via the segregated
interfaces).

## Related files

Expand Down
Loading
Loading