2chain v2 (Self-Hosted) — Product Requirements Document

One-Line Description

A neutral, self-hostable tool registry for AI agents that picks the right specialist tool from many candidates, validates every response on the wire, and quarantines tools that break — runnable on a laptop or behind a corporate firewall, with no managed-cloud dependencies.

Problem Statement

Today's AI agents are paired with thousands of MCP tools but have no way to tell good from bad, on-topic from generic, or trustworthy from broken at runtime. Existing registries (the official MCP directory, Smithery, LangChain Hub) are dumb lists with keyword search and no quality signal. Heavy AI users spend hours hunting for the right tool; non-technical users give up and blame the model. The 2chain v1 prototype solved this on MongoDB Atlas + Voyage AI, but those dependencies make it impossible to run inside companies with strict data-residency / no-cloud rules and impractical for individual developers who don't want to pay for managed services.

Target Users

Personal tier (single-user, laptop-grade):

Developers using Claude Code, Cursor, Continue, or any MCP client who want a curated, reliability-graded tool registry running locally
AI champions inside companies who want to test the architecture before pitching IT
Hackathon and demo builders who need a self-contained registry that runs on a free instance
Technical level: comfortable with npm/Docker, not necessarily with cloud or DBA work

Enterprise tier (multi-user, on-prem or VPC):

Internal AI platform teams at regulated companies (banks, healthcare, government) where employee prompts and tool-call data cannot leave the corporate network
Tool authors inside these companies who publish internal tools alongside vetted public ones (e.g. an internal HR data fetcher next to the public arxiv search)
Technical level: ops/platform engineers who can run Postgres, K8s, or Docker Compose

Both tiers share one product. The difference is deployment topology, not feature set.

Core Features (MVP)

Tool registry storage — name, version, capability text, JSON Schema contracts, reliability score, status, metadata. Backed by Postgres.
Hybrid retrieval — vector + lexical full-text search fused with Reciprocal Rank Fusion, server-side, in one query. Vector via pgvector. Lexical via Postgres tsvector + ts_rank_cd. Fusion via a single SQL CTE.
Local embedding model — runs entirely on-device or in-cluster, no API calls. Default: nomic-embed-text (768-dim) via Ollama; pluggable to bge-large-en-v1.5, gte-large, or any HuggingFace model loaded with transformers.js.
Reliability gating — pre-search filter excludes tools below 0.80 reliability. Score is computed from inline eval suite at publish time.
JSON Schema contract enforcement — every tool call's input and output validated by ajv on the wire. Output violations flip the tool to circuit_broken and 503 future calls.
Live registry updates — Postgres LISTEN/NOTIFY drives Server-Sent Events to the dashboard so registry changes propagate without polling.
MCP-native interface — stdio MCP server exposing discover_tools and call_tool for Claude Code, Cursor, Continue, etc.
Two deployment modes — npm install -g 2chain for personal (single binary, embedded SQLite + sqlite-vec fallback if no Postgres available); docker compose up for enterprise (Postgres 16 + pgvector + the 2chain server).
Existing tool fixtures preserved — the 14 hand-crafted tools including sec-edgar-financials and arxiv-paper-search migrate as-is. The 185 generated fixtures port over without changes.
Live dashboard — single HTML page showing registry, eval runs, violations panel, live call feed. SSE-driven, zero polling.
Continuous re-verification → reliability lifecycle — the publish-time eval suite re-runs against already-registered tools on demand (POST /v1/reverify, 2chain reverify) or on an opt-in interval (REVERIFY_INTERVAL_MIN, default off), so a tool that rots after publish is re-scored and gate-dropped by the registry instead of being discovered live by a user's agent. Catalog entries without an eval suite are skipped, never zeroed. Re-scoring is evidence-blended (E2): the materialized reliability score combines a recency-decayed eval history (7-day half-life, weight 0.8) with windowed usage evidence (weight 0.2) in which caller-fault never counts against the tool — only ok calls, output-stage violations, and stub timeouts are evidence. Recovery closes the circuit_broken dead end: a circuit-broken tool whose 3 most recent re-verification runs are clean and span ≥ 60 minutes is restored to active by the unfiltered sweep (the D34 amendment — re-verification may flip circuit_broken → active ONLY, never the reverse and never pending; call.ts remains the only flip TO circuit_broken).
Contract drift detection on push — pushing a new version of an existing tool diffs both JSON Schema contracts against its version-line predecessor, direction-aware (callers send inputs, consumers receive outputs). A breaking change without a major version bump is rejected (breaking_contract_requires_major_bump, full diff in error.details); accepted drift is recorded per direction in a drift_events audit table. The differ is conservative: unmodeled schema constructs that change classify as breaking.
Tool health surface — one read-only aggregate per tool name answering "can I trust this tool right now": per-version status, reliability score, verification streak (consecutive clean re-verification runs), bounded score history, and 7-day usage outcome counts, plus the name's recent contract drift events. Served authenticated at GET /v1/tools/:name/health (every role — callers are exactly who needs the answer), in the CLI as 2chain health <name>, and in the dashboard detail panel via a dashboard-scoped view that ships projected drift fields only (changes_json never leaves the server). Strictly read-only: no writes, no status flips, no score recompute (this surface reads 2chain's own local logs — it is not an evals framework or observability platform, per IS-NOT 6 and 8).
Freshness in agent-facing discovery — /discover re-sorts the RRF top-K by final_score = rrf_score + 0.0005 × freshness, where freshness = 0.5^(age_days(metadata.last_eval_run)/7). The weight is calibrated on the RRF scale to perturb only NEAR-TIED neighbours: a freshly verified tool climbs past a stale near-tie, but cannot overtake a tool 5 or more RRF ranks ahead (rank distance, not raw similarity margin, is the guarantee: RRF compresses any cosine gap between adjacent ranks to ~1.3e-4, so a near-tied rank-1 can be passed however large its raw-similarity lead). Freshness weights, never gates — the 0.80 reliability gate stays in SQL. Every result also carries last_verified_at, verification_streak (consecutive clean re-verification runs, window 20), and freshness, on both the HTTP route and the MCP shim, so agents can weigh staleness exactly where they choose tools. Owned consequence: real catalog imports carry freshness 0 by design until a reverify sweep scores them — unverified means stale, which is the correct product semantics. (IS-NOT check: this ranks on 2chain's own local eval recency — not an evals framework (IS-NOT 6), not an observability platform (IS-NOT 8).)

What This Product IS NOT

Not a model host. 2chain does not run language models. It does not generate embeddings as a service to other apps. It does not chat with users. It only picks tools and enforces contracts.
Not a tool execution sandbox. When a tool calls SEC EDGAR or runs a Python linter, 2chain forwards the request and validates the response. It does not sandbox network egress, it does not isolate filesystems, it does not VM-jail tool authors. Sandboxing is a v0.3 concern.
Not a managed SaaS. We do not host a multi-tenant cloud version. Hosted-2chain might come later as a separate product; v2 ships only as something users self-deploy.
Not a marketplace with payments. Tools register and run for free. Revenue capture, billing, subscription tiering, paid-tool discovery — all explicitly out of scope.
Not tied to MongoDB Atlas, Voyage AI, OpenAI, or any cloud-only API. Everything must run offline, including embeddings. If a deployment can't reach the internet, the registry still works against tools that don't need internet.
Not a replacement for evals frameworks. 2chain runs its own minimal pass/fail evals at publish time, and re-runs the same suites on re-verification sweeps, to populate the reliability score. Sophisticated eval orchestration (LangSmith, Braintrust, OpenAI Evals) is a separate concern; we offer hooks to plug those in but don't reinvent them.
Not a model router. 2chain picks tools, not models. Choosing between Claude vs GPT vs Llama for the agent's planning brain is the agent runtime's job, not ours.
Not a logging/observability platform. We log calls, violations, and reliability changes locally. We don't aim to replace OpenTelemetry, Datadog, or Honeycomb for full agent traces.

Success Metrics

Personal tier:

Time from npm install -g 2chain to first successful discover_tools MCP call: under 2 minutes on a fresh laptop.
Embedding latency for a 199-tool seed on M-series Mac: under 30 seconds.
Discovery latency (warm): under 50ms p95.
100 GitHub stars within 60 days of v2 launch.

Enterprise tier:

Time from docker compose up to first successful tool call: under 10 minutes including pgvector index build.
Discovery latency (warm) with 10,000 tools: under 200ms p95.
One paying / committed enterprise pilot within 90 days of v2 launch.

Migration:

Existing 199-tool fixture set runs identically on v2 with no capability_text changes.
All five demo prompts (DCF, arxiv, PR review, security audit, malformed-bot violation) pass end-to-end on the new stack.

Constraints

No managed cloud dependencies. Every component must run on a single laptop or inside a private VPC. No Atlas, Voyage, OpenAI, Pinecone, Cohere as required dependencies. Optional cloud integrations (e.g. plug-in OpenAI embeddings) are allowed but cannot be the default.
TypeScript-first, Node 24+. Must keep the existing Fastify + ajv + MCP SDK stack to maximize code reuse from v1.
Postgres 16 with pgvector is the enterprise reference DB. SQLite + sqlite-vec is the personal-tier fallback. No third option.
Single-binary CLI distribution for personal tier. npx 2chain or npm i -g 2chain and you're up. No Docker required for individual use.
License: MIT or Apache 2.0. The whole point is enterprise adoption — no AGPL, no SSPL, no source-available licenses that scare procurement.
Solo founder timeline: working v2 in 4 weeks, public release in 6.
Preserve the demo narrative. Same five demo prompts, same on-stage script (with one slide swap to mention "now self-hostable"). Anything that breaks the demo is out of scope.
Match v1 quality bar. No regression on retrieval relevance, no regression on contract enforcement, no regression on dashboard responsiveness. Eval suite from v1 must pass on v2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2chain v2 (Self-Hosted) — Product Requirements Document

One-Line Description

Problem Statement

Target Users

Core Features (MVP)

What This Product IS NOT

Success Metrics

Constraints

FilesExpand file tree

PRD.md

Latest commit

History

PRD.md

File metadata and controls

2chain v2 (Self-Hosted) — Product Requirements Document

One-Line Description

Problem Statement

Target Users

Core Features (MVP)

What This Product IS NOT

Success Metrics

Constraints