Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -324,15 +324,19 @@ jobs:
# ---------------------------------------------------------------------------
# 5. npm publish — OIDC trusted publishing with provenance.
#
# Gated by the `OCH_NPM_PUBLISH_ENABLED` repo variable. Each
# @opencodehub/* package on npmjs.com has the trusted publisher
# relationship configured against this repo + workflow filename,
# so no NPM_TOKEN is required — the `id-token: write` permission
# drives both OIDC auth to npm AND the Sigstore provenance
# attestation that ties each published tarball back to this
# workflow run + commit SHA. pnpm 10.21+ / 11.x supports OIDC on
# direct `pnpm publish` (the changeset-publish regression in
# pnpm/pnpm#11566 does not apply here).
# Only `@opencodehub/cli` is published; every other workspace
# package is `private: true` and its source is bundled into the CLI
# at build time (PR #189), so `pnpm -r publish` skips them. The CLI
# has an npm trusted-publisher relationship configured against this
# repo + workflow filename, so no NPM_TOKEN is required — the
# `id-token: write` permission drives both OIDC auth to npm AND the
# Sigstore provenance attestation that ties the published tarball
# back to this workflow run + commit SHA. This path is live: run
# #176 published 0.6.0 with provenance. pnpm 10.21+ / 11.x supports
# OIDC on direct `pnpm publish` (the changeset-publish regression in
# pnpm/pnpm#11566 does not apply here). The publish job is gated
# `if: vars.OCH_NPM_PUBLISH_ENABLED == 'true'` — that repo variable
# is the on/off switch (set to `true` today).
# ---------------------------------------------------------------------------
npm-publish:
name: npm publish (OIDC + provenance)
Expand Down
19 changes: 0 additions & 19 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,25 +289,6 @@

* **release:** keep 0.x semver — breaking changes bump minor, feats bump patch ([a6ee4bf](https://github.com/theagenticguy/opencodehub/commit/a6ee4bf1081dd9a0623694aadae1e6f72cf60254))

## [Unreleased]

### Fixed

- **cli:** `scan` ingests SARIF into the scanned repo, not CWD.
- **cli:** `doctor` resolves native bindings from owner workspaces.
- **smoke-mcp:** asserts 29 tools, matching the v1.0 server surface.

### Docs

- **repo:** README v1.0 status, 29-tool surface, parse-runtime section,
and accurate 17-package list (drops `eval` / `gym`, adds
`cobol-proleap`, `frameworks`, `pack`, `policy`, `wiki`).
- **adr:** cross-link the two concurrently-numbered ADR 0013 files,
flip 0011 + 0013-m7 status to Accepted, and scrub session-local
spec coordinates from ADR text.
- **repo:** sync `CHANGELOG`, `USECASE`, `AGENTS`, and `OBJECTIVES`
with v1 reality (tool count, language count, package set).

## [0.1.1](https://github.com/theagenticguy/opencodehub/compare/root-v0.1.0...root-v0.1.1) (2026-04-22)


Expand Down
69 changes: 60 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,29 @@ flowchart LR
| **MCP-native** | Works out-of-the-box with Claude Code, Cursor, Codex, Windsurf, OpenCode. The MCP server is the primary interface; CLI exists for scripts and CI. |
| **Embedded storage, two-tier** | `@ladybugdb/core` holds the structural store: symbols, edges, embeddings, BM25 + HNSW. A dedicated DuckDB sibling holds the temporal views: cochanges and summaries. Embedded files. No daemon. No database to operate. Both tiers are always present, with no backend knob (ADR 0016). |
| **15 languages at GA** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart, COBOL — tree-sitter for the first 14 plus a regex provider for fixed-format COBOL. |
| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime, on Node 20, 22, and 24. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`. There is no native opt-in — `npm install -g @opencodehub/cli@latest` does zero native builds and zero GitHub fetches. |
| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, so parsing does **zero grammar/native builds and zero GitHub fetches** at install time — there is no native parser opt-in. Storage and embeddings still load prebuilt native bindings (see Platform support). |

## Platform support

Parsing is WASM and runs anywhere Node does. The storage and embedding
tiers, however, depend on **prebuilt native bindings** — `@ladybugdb/core`
(graph store), `@duckdb/node-api` (temporal store), and `onnxruntime-node`
(local embeddings) — so OpenCodeHub runs on the platforms those bindings
ship a prebuild for:

| Platform | Supported |
|---|---|
| `darwin-arm64`, `darwin-x64` | ✅ prebuilt |
| `linux-x64`, `linux-arm64` (glibc) | ✅ prebuilt |
| `win32-x64` | ✅ prebuilt |
| `win32-arm64` | ❌ no prebuild — `codehub analyze` fails at store open |
| Alpine / musl, 32-bit Linux ARM | ❌ no prebuild — needs a source build of `@ladybugdb/core` |

On an unsupported platform the lbug binding fails to load and `open()`
throws `GraphDbBindingError` (there is no DuckDB-graph fallback — see
[ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)). The five-target prebuilt
matrix mirrors `@ladybugdb/core`'s release artifacts; track its upstream
for musl / `win32-arm64` coverage.

## Quick start

Expand Down Expand Up @@ -229,10 +251,11 @@ supersedes ADR 0013 and the DuckDB-as-graph passages of ADR 0011.
## Parse runtime — WASM-only, vendored grammars

`@opencodehub/ingestion` runs `web-tree-sitter` (WASM) as the only parse
runtime on Node 20, 22, and 24. There is no native opt-in: the native
`tree-sitter` N-API addon and all 14 `tree-sitter-<lang>` npm packages
are gone from the install graph. `npm install -g @opencodehub/cli@latest`
does zero native builds and zero GitHub fetches.
runtime on the supported Node range (22 and 24). There is no native opt-in:
the native `tree-sitter` N-API addon and all 14 `tree-sitter-<lang>` npm
packages are gone from the install graph, so parsing pulls **zero native
builds and zero GitHub fetches** at install time. (Storage and embeddings
load prebuilt native bindings — see Platform support.)

All 15 grammar `.wasm` blobs are vendored at
`packages/ingestion/vendor/wasms/`, built from the grammar sources
Expand All @@ -253,14 +276,40 @@ superseded.
`IGraphStore` / `ITemporalStore` interface segregation), B (19-scanner
fleet incl. betterleaks), C (debt sweep — embedder fingerprint, SCIP
REFERENCES + TYPE_OF), and D (dogfood polish) have all merged. The
current shipped tag remains `0.1.1`; `1.0.0` is cut once schema +
tool-surface stability is signed off.
published package is `@opencodehub/cli` (currently `0.7.0`; the monorepo
root tracks `0.8.0`); `1.0.0` is cut once schema + tool-surface stability
is signed off.

While on `0.x`, **any release may contain breaking changes** to the
graph schema, MCP tool shapes, CLI flags, or storage layout. Breaking
changes are called out with `!` or a `BREAKING CHANGE:` footer in the
commit log and summarised in each release's generated CHANGELOG.

## Troubleshooting

### `codehub analyze` runs out of memory on a large repo

The in-memory graph (`KnowledgeGraph`) holds the full node and edge set in
two JavaScript `Map`s for the duration of `analyze`, and `bulkLoad`
materializes transient copies before persistence — there is no spill to
disk during the build. A real index is already in the 96k-node /
291k-edge range; a monorepo roughly 10x that size can exhaust Node's
default heap and exit with an out-of-memory error (`FATAL ERROR:
Reached heap limit` / `JavaScript heap out of memory`), sometimes without
a clear message.

Raise Node's old-space ceiling for the run via `NODE_OPTIONS` (nothing
is set by default):

```bash
# 8 GB heap — bump higher for very large monorepos
NODE_OPTIONS=--max-old-space-size=8192 codehub analyze
```

Pick a value comfortably below your machine's free RAM. If you still hit
the ceiling, analyze a subtree at a time rather than the whole monorepo
in one pass.

## Supply-chain posture

- **CycloneDX SBOM** at [`SBOM.cdx.json`](./SBOM.cdx.json) (regenerated on every release)
Expand All @@ -274,8 +323,10 @@ Architecture decision records live in [`docs/adr/`](./docs/adr/) — the
durable record of design tradeoffs (storage backend, SCIP adoption,
hierarchical embeddings, CI toolchain pins, etc.).

A standalone user-guide + MCP reference site is being bootstrapped in a
dedicated repo; this README will link it once published.
The user guide + MCP reference is published at
**<https://theagenticguy.github.io/opencodehub>** — an Astro Starlight
site whose source lives in-repo at [`packages/docs/`](./packages/docs/)
and deploys to GitHub Pages on every push to `main`.

## Contributing

Expand Down
25 changes: 15 additions & 10 deletions docs/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Three workflows split the work:
| Workflow | Trigger | Purpose |
| ------------------------------------- | ------------------------------- | --------------------------------------------------------------------- |
| `.github/workflows/release-please.yml`| `push: main` | Open / update the release PR; on merge, cut the tag and call release.yml. |
| `.github/workflows/pre-release-gate.yml` | `pull_request: main` | Add release-time-only checks (npm audit, lockfile integrity, detect-secrets, license re-assert). Aggregator job is the required check on release branches. |
| `.github/workflows/pre-release-gate.yml` | `pull_request: main` | Add release-time-only checks (npm audit, lockfile integrity, betterleaks secret sweep, license re-assert). Aggregator job is the required check on release branches. |
| `.github/workflows/release.yml` | `release: published` + `workflow_call` + `workflow_dispatch` | Build, SBOM, code-pack, cosign sign, SLSA L3 provenance, attach to release. |

The existing CI surface (`ci.yml`, `codeql.yml`, `semgrep.yml`, `osv.yml`,
Expand Down Expand Up @@ -175,14 +175,19 @@ If the gate is broken and you must cut a release out-of-band:
The pipeline runs without any long-lived secrets except `GITHUB_TOKEN`
(which GitHub injects automatically). Specifically:

- **No npm token** — `npm-publish` is gated by the
`OCH_NPM_PUBLISH_ENABLED` repo variable (default unset = disabled)
until the packages flip to public. When that change lands, set
`OCH_NPM_PUBLISH_ENABLED=true` in
`Settings -> Secrets and variables -> Actions -> Variables`, then
configure the npmjs.org OIDC trust relationship at
`https://www.npmjs.com/settings/<scope>/access` so `npm publish
--provenance` works without a static `NPM_TOKEN`.
- **No npm token** — npm publishing is **live** via OIDC trusted
publishing (run #176 published `0.6.0` with provenance). Only
`@opencodehub/cli` is published; every other workspace package is
`private: true` and bundled into the CLI at build time (PR #189), so
`pnpm -r publish` skips them. The CLI's trusted-publisher relationship
is configured at `https://www.npmjs.com/settings/opencodehub/access`
against this repo + `release.yml`, so the `id-token: write` permission
drives both OIDC auth to npm and the Sigstore provenance attestation —
no static `NPM_TOKEN`. The publish job is gated `if:
vars.OCH_NPM_PUBLISH_ENABLED == 'true'`, so that repo variable is the
on/off switch in `Settings -> Secrets and variables -> Actions ->
Variables` — it is set to `true` today; unset it (or any non-`true`
value) to skip the publish step.
- **No cosign keys** — keyless signing uses the workflow's OIDC token
against Fulcio. The certificate's SAN binds the signature to the
workflow file path + ref, which is what `cosign verify-blob` checks.
Expand Down Expand Up @@ -235,7 +240,7 @@ branch, it adds:
| ---------------------- | ------------------------------------------------------------------------------------------------ |
| `npm-audit` | `pnpm audit --audit-level=high --prod` finds no high-or-critical vulns in production deps. |
| `lockfile-integrity` | `pnpm install --frozen-lockfile --ignore-scripts` succeeds — no lockfile drift, no postinstalls. |
| `detect-secrets` | Full sweep against `.secrets.baseline`; any new finding fails the gate. |
| `betterleaks` | `betterleaks dir` full sweep with the vendored `packages/scanners/config/betterleaks.default.toml`; any finding fails the gate (ADR 0017 replaced detect-secrets). |
| `licenses-reassert` | `license-checker-rseidelsohn` allowlist (Apache-2.0, MIT, BSD-2/3-Clause, ISC, CC0-1.0, BlueOak-1.0.0, 0BSD). |
| `pre-release-gate` | Aggregator. Fails if any of the above failed; passes (no-op) on non-release PRs. |

Expand Down
68 changes: 39 additions & 29 deletions docs/adr/0012-repo-as-first-class-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,12 @@ The phased plan, sequenced by milestone:
AMBIGUOUS_REPO `_meta.choices[]` payload, the `group_*` tools'
additive `repo_uri` fields, and the cross-repo link records all
source `repo_uri` from the new node.
- **M7**: drop the legacy `repo` registry-name argument across all
per-repo and group MCP tools (T-M7-6); the `repo_uri` form becomes
the only accepted input. New edge kinds (`Repo HAS_FILE File`,
`Repo HAS_DEPENDENCY Dependency`) get added then — see §Edge kinds
deferred below.
- **M7** (planned at authoring time; **not pursued** — see §Edge kinds
deferred below): drop the legacy `repo` registry-name argument across
all per-repo and group MCP tools (T-M7-6) and add `Repo`-rooted edge
kinds (T-M7-7). Neither task shipped. The clean-slate v1 release keeps
the legacy `repo` argument as an accepted alias alongside `repo_uri`,
and `Repo` remains an edge-less singleton node.

## Schema choice — append-only `NodeKind` union

Expand Down Expand Up @@ -226,27 +227,35 @@ without a `RepoNode`. Three rules govern the migration:
`CLAUDE.md`) works regardless of whether the graph has the node
yet.

## Edge kinds deferred

`Repo` ships in M6 **without new edge kinds**. The full graph schema
would have `Repo HAS_FILE File`, `Repo HAS_DEPENDENCY Dependency`,
`Repo OWNED_BY Contributor`, `Repo IN_GROUP Community` (or similar),
but those edges add complexity that does not pay off until M7's
default-flip work for the LadybugDB backend. The M6 scope is the node
itself plus the wire-format updates to AMBIGUOUS_REPO, the
`group_*` tools, and the cross-repo link records. M7 (T-M7-6 and
T-M7-7) extends the schema with the four edge kinds above, gated by
its own parity gate and ADR.

The reason for the deferral is the v1.0 invariant at the heart of ADR
0011: every new edge kind is a new physical rel table on the
LadybugDB backend (rel-table-per-kind shape, ADR 0011 §Schema
choice), so each new kind costs one DDL update plus one parity-test
fixture. Bundling those four kinds into M7 — alongside the
default-backend flip — keeps the parity surface small and the merge
risk low. Adding them in M6 would split the rel-table-per-kind
churn across two milestones and risk a graphHash drift if the
W-M6-1 fixture coverage missed an interaction.
## Edge kinds deferred → not pursued (won't-do for v1)

`Repo` ships **without new edge kinds**, and that stayed true for v1.
At authoring time this section sketched four `Repo`-rooted edges —
`Repo HAS_FILE File`, `Repo HAS_DEPENDENCY Dependency`,
`Repo OWNED_BY Contributor`, `Repo IN_GROUP Community` (or similar) —
to land in M7 under tasks T-M7-6 / T-M7-7. **None of them shipped.**

> **Resolution (v1 clean-slate, 2026-06): won't-do.** The four
> `Repo`-rooted edge kinds were never added. The v1 release does not
> carry the M7 edge-schema extension; `RelationType` /
> `RELATION_TYPES` in `packages/core-types/src/edges.ts` has **25**
> members (`CONTAINS` … `TYPE_OF`), none of them `Repo`-rooted, and
> `Repo` remains an edge-less singleton. `OWNED_BY` does exist in that
> enum, but it is a **blame-level** edge from a symbol/file to a
> `Contributor` (its `confidence` carries the normalized blame-line
> share, per `CodeRelation`'s doc comment) — it is **not** the
> `Repo OWNED_BY Contributor` repo-level edge sketched above. The
> federation surface (AMBIGUOUS_REPO, the `group_*` tools, cross-repo
> links) reads `repo_uri` straight off the `RepoNode` and from the
> persisted ContractRegistry, so no `Repo`-rooted edge was needed to
> ship it.

The original deferral rationale (left for the record): every new edge
kind is a new physical rel table on the LadybugDB backend
(rel-table-per-kind shape, ADR 0011 §Schema choice), so each new kind
costs one DDL update plus one parity-test fixture. The cost never paid
off — the v1 surface ships without these edges, and any future
`Repo`-rooted edge work would land under its own ADR.

## Risks

Expand Down Expand Up @@ -300,9 +309,10 @@ W-M6-1 fixture coverage missed an interaction.
flips to **Accepted** in the same commit that ships AC-M6-5 (this
ADR plus the AGENTS.md / CLAUDE.md cross-references plus the
synthetic 2-repo quickcheck) — see §References below.
- **Superseded**: not before M7. M7 adds a follow-up ADR (scope: drop
legacy `repo` argument, add `Repo`-rooted edge kinds, final
parity audit across the testbed corpus).
- **Superseded**: no. The planned M7 follow-up (drop the legacy `repo`
argument, add `Repo`-rooted edge kinds) was **not pursued** — see
§Edge kinds deferred → not pursued. The `RepoNode` shape this ADR
introduced stands as-is in v1.

## References

Expand Down
16 changes: 10 additions & 6 deletions docs/adr/0013-m7-default-flip-and-abstraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,16 @@ later, four facts forced the M7 architectural shift.
limit identified in ADR 0011 §Context (one polymorphic `relations`
table, `WHERE type = ?` evaluated after the join, no per-kind
columnar pushdown) holds across every workload we measured in M4 –
M6. The 24-edge-kind cardinality is now 28 with M5/M6 additions
(`HAS_FILE`, `HAS_DEPENDENCY`, `IN_GROUP`, `OWNED_BY` repo-level
edges). DuckDB is the right engine for time-series / cochange
queries — its column-store strengths land squarely in the temporal
domain — but the graph workload is a different shape and benefits
from a graph-native engine.
M6. The edge-kind cardinality is **25** (`RelationType` /
`RELATION_TYPES` in `packages/core-types/src/edges.ts`, `CONTAINS`
… `TYPE_OF`) — the M5/M6 addition over the earlier 24 was `OWNED_BY`
(a blame-level symbol→`Contributor` edge), not the four `Repo`-rooted
edges (`HAS_FILE`, `HAS_DEPENDENCY`, `IN_GROUP`,
`Repo OWNED_BY Contributor`) that ADR 0012 §Edge kinds deferred
sketched: those never shipped. DuckDB is the right engine for
time-series / cochange queries — its column-store strengths land
squarely in the temporal domain — but the graph workload is a
different shape and benefits from a graph-native engine.
2. **The `IGraphStore` interface had grown two non-graph
responsibilities.** By the end of M6 it carried `cochanges` and
`symbol-summaries` queries — both temporal, neither graph. Every
Expand Down
Loading
Loading