v0ropaev · v0ropaev · Jun 21, 2026 · Jun 21, 2026
diff --git a/.github/workflows/nightly-llm-ab.yml b/.github/workflows/nightly-llm-ab.yml
@@ -0,0 +1,71 @@
+name: Nightly LLM A/B
+
+# Optional, NON-gating answer-quality A/B (knowbase context vs RAG context), judged by an LLM.
+# Runs nightly and on demand. Self-skips when no API key secret is set; never blocks merges (it is a
+# separate workflow from CI, and the judge step is continue-on-error). The deterministic Tier-3 recall
+# gate in CI remains the hard floor.
+
+on:
+  schedule:
+    - cron: "0 6 * * *"  # 06:00 UTC nightly
+  workflow_dispatch:
+
+concurrency:
+  group: nightly-llm-ab
+  cancel-in-progress: true
+
+jobs:
+  ab:
+    name: LLM-judged knowledge-vs-RAG A/B
+    runs-on: ubuntu-latest
+    services:
+      postgres:
+        image: pgvector/pgvector:pg17
+        env:
+          POSTGRES_USER: postgres
+          POSTGRES_PASSWORD: postgres
+          POSTGRES_DB: postgres
+        ports:
+          - 5432:5432
+        options: >-
+          --health-cmd pg_isready
+          --health-interval 10s
+          --health-timeout 5s
+          --health-retries 5
+    env:
+      KB_TEST_DB_URL: postgresql+psycopg://postgres:postgres@127.0.0.1:5432/postgres
+      KB_LLM_PROVIDER: anthropic
+      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+      KB_LLM_AB_METRICS: tier3_llm_ab_metrics.json
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+          enable-cache: true
+
+      - name: Install dependencies (incl. llm extra)
+        run: uv sync --extra dev --extra embed --extra llm
+
+      - name: Cache embedding model
+        uses: actions/cache@v4
+        with:
+          path: ~/.cache/huggingface
+          key: hf-all-MiniLM-L6-v2-v1
+
+      - name: Warm up embedding model
+        run: uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
+
+      - name: LLM-judged A/B (tracked, non-gating; self-skips without ANTHROPIC_API_KEY)
+        continue-on-error: true
+        run: uv run pytest src/kb/eval/tier3_llm_judge_test.py -q -s
+
+      - name: Upload A/B metrics
+        uses: actions/upload-artifact@v4
+        with:
+          name: tier3-llm-ab-metrics
+          path: tier3_llm_ab_metrics.json
+          if-no-files-found: ignore
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -26,6 +26,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Tier-3 entity questions** (`kb.eval.questions`): the knowledge-vs-RAG A/B now also covers domain
   entities (a two-file `Order`/`LineItem` fixture), asserting knowbase cross-file recall@k == 1.0 for
   entity questions as well as API-contract questions.
+- **Nightly LLM-judged A/B** (`kb.llm`, `kb.eval.tier3_llm_judge_test`, `.github/workflows/nightly-llm-ab.yml`):
+  an optional, key-gated, **non-gating** answer-quality comparison. An answerer LLM answers each question
+  from knowbase-grounded context vs RAG-over-source context; a judge LLM scores accuracy against
+  hand-written `GOLD` references and flags hallucination (claims unsupported by that arm's context).
+  `kb.llm.providers` mirrors the embed-provider pattern (Anthropic default, OpenAI optional, lazy imports
+  via the new `llm` extra); the test self-skips without an API key and asserts only that the A/B ran
+  (never the win); the nightly workflow uploads a metrics artifact. `RagHit` gained a `raw_text` field so
+  the RAG arm can feed chunk text to the answerer.
 
 ## [0.2.0] - 2026-06-02
 

diff --git a/DESIGN.md b/DESIGN.md
@@ -264,10 +264,14 @@ Eval is **co-equal with extraction**, weighted to cheap/exact tiers that gate CI
 - **Tier 2 — golden curated repos (TRACKED, non-gating).** 3–5 SHA-pinned permissive Python
   repos; one **held out** and never used for tuning (the real trust signal). Report per-repo,
   never just the mean.
-- **Tier 3 — downstream vs RAG (TRACKED, after MCP).** Fixed ~30–50 question set; coding agent
+- **Tier 3 — downstream vs RAG (TRACKED, after MCP).** Fixed question set; coding agent
   answers with knowbase-MCP vs a **frozen, peer-reviewed** pgvector-RAG baseline (same
   Postgres, same model). **Pre-register the win threshold.** Metrics: grounded-answer accuracy,
   hallucination rate (claims with no provenance), tokens-to-answer, tool round-trips.
+  *Implemented:* the deterministic cross-file recall gate (`tier3_rag_test`, HARD) plus an optional,
+  key-gated, NON-gating **LLM-judged A/B** (`kb.llm` + `tier3_llm_judge_test`, run nightly): an answerer
+  answers each question from knowbase context vs RAG context, and a judge scores accuracy against
+  hand-written gold + hallucination, comparing to a pre-registered threshold (printed, never asserted).
 
 **Invariants asserted as exact ground truth every run:** every artifact has ≥1 `derived_from`
 row (zero orphans); re-running an extractor on the same span identity+version yields an

diff --git a/README.md b/README.md
@@ -90,7 +90,9 @@ flowchart LR
 - **A frozen RAG-over-source baseline** and the **Tier-3 knowledge-vs-RAG recall gate** — the honest A/B that backs the "knowledge > RAG" thesis.
 - **Eight HARD CI eval gates** (see [Development](#development)).
 
-**Not done yet** (and deliberately not faked): the semantic / **LLM-grounded** extraction layer, the nightly LLM-judged A/B, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the [Roadmap](#roadmap).
+- **A nightly LLM-judged A/B** (optional, key-gated, **non-gating**) — an answerer LLM answers each question from knowbase's grounded context vs a RAG-over-source context, and a judge LLM scores **answer accuracy** (against hand-written gold) + **hallucination**. Tracked metrics on top of recall; it never blocks CI.
+
+**Not done yet** (and deliberately not faked): the semantic / **LLM-grounded** extraction layer, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the [Roadmap](#roadmap).
 
 ## Quickstart
 
@@ -226,7 +228,7 @@ flowchart LR
 
 Next milestones:
 
-- [ ] **Nightly LLM-judged A/B** (key-gated, non-gating) — grounded-answer accuracy + hallucination rate on top of recall.
+- [x] **Nightly LLM-judged A/B** (key-gated, non-gating) — grounded-answer accuracy + hallucination rate on top of recall. *(shipped)*
 - [ ] **LLM-grounded semantic layer** — model-backed artifacts that still carry ≥ 1 span (`extraction_method = "llm_grounded"`).
 - [ ] **Incremental re-index on git push** — turn the diff-based invalidation seed into live updates.
 - [ ] **ADR mining** from git / PR history.

diff --git a/pyproject.toml b/pyproject.toml
@@ -37,6 +37,13 @@ dev = [
 embed = [
     "sentence-transformers>=3,<6",
 ]
+# LLM backend for the optional, nightly, NON-gating LLM-judged A/B (kb.llm + tier3_llm_judge_test).
+# Not needed for index/serve/embed; the nightly workflow uses `uv sync --extra dev --extra embed
+# --extra llm`. Also makes the existing OpenAI embedding adapter installable.
+llm = [
+    "anthropic>=0.40",
+    "openai>=1.40",
+]
 
 [project.scripts]
 kb = "kb.daemon.cli:app"
@@ -86,7 +93,7 @@ ignore_errors = true
 [[tool.mypy.overrides]]
 module = [
     "pygit2.*", "grimp.*", "sqlparse.*", "fastapi.*", "starlette.*", "fastmcp.*", "mcp.*",
-    "sentence_transformers.*", "pgvector.*", "openai.*",
+    "sentence_transformers.*", "pgvector.*", "openai.*", "anthropic.*",
 ]
 ignore_missing_imports = true
 

diff --git a/src/kb/eval/questions.py b/src/kb/eval/questions.py
@@ -75,3 +75,24 @@ class Question:
     Question("e3", "Which model does the Order entity's items field reference, and where is it?",
              ENTITY_CROSS_FILE, frozenset({"entity:app.domain.order.Order"})),
 ]
+
+# Hand-written reference answers (the gold oracle for the nightly LLM-judged A/B, PR-3b). Derived
+# from the fixtures above (we own them), so accuracy is judged against ground truth — not against
+# either arm's own retrieval (no knowbase-favoring bias). Keep in sync with FILES / ENTITY_FILES.
+GOLD: dict[str, str] = {
+    "q1": "A JSON array of OrderOut objects; each OrderOut has id (int) and total (float).",
+    "q2": "The fields of OrderOut: id (int) and total (float).",
+    "q3": "Accepts an OrderIn request body with a single field item (str); returns an OrderOut "
+          "(id: int, total: float).",
+    "q4": "OrderOut, which has id (int) and total (float).",
+    "q5": "OrderOut (id: int, total: float).",
+    "q6": "201.",
+    "q7": "float.",
+    "q8": "GET /api/orders (also POST and GET /api/orders/{order_id}) returns OrderOut; "
+          "OrderOut is defined in src/app/schemas.py.",
+    "e1": "Order (a dataclass) has id (int) and items (a list of LineItem); each LineItem has "
+          "sku (str) and qty (int, default 1).",
+    "e2": "Order has id (int) and items, where items is list[LineItem]; LineItem has sku (str) and "
+          "qty (int).",
+    "e3": "It references LineItem, defined in src/app/domain/line_item.py.",
+}
diff --git a/src/kb/eval/tier3_llm_judge_test.py b/src/kb/eval/tier3_llm_judge_test.py
@@ -0,0 +1,175 @@
+"""TRACKED (NON-gating) — Tier 3: LLM-judged answer quality, knowbase context vs RAG context (§9).
+
+Beyond Tier-3 *recall* (tier3_rag_test), this measures *answer quality*: an answerer LLM answers
+each question from the knowbase-grounded context and, separately, from the RAG-over-source context.
+A judge LLM scores each answer against a hand-written GOLD reference for **accuracy**, and flags
+**hallucination** (a claim unsupported by that arm's own provided context).
+
+This is **nightly, key-gated, and NON-gating**: it self-skips without an API key, asserts only that
+the A/B actually ran (never the win), and writes a metrics JSON for the CI artifact. The
+deterministic Tier-3 recall gate remains the hard floor. Self-judging bias is accepted here (see
+DESIGN §9); a distinct judge model can be set via KB_LLM_JUDGE_MODEL.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import re
+
+import pytest
+from sqlalchemy import Engine
+
+from kb.daemon.pipeline import index_commit
+from kb.embed.population import embed_snapshot
+from kb.eval._fixtures import make_git_repo
+from kb.eval.questions import ENTITY_FILES, GOLD, QUESTIONS
+from kb.eval.tier1_api_test import FILES
+from kb.extract.deterministic.entities import EntityExtractor
+from kb.extract.deterministic.fastapi_contract import FastAPIExtractor
+from kb.llm.providers import default_llm_provider, has_llm_key
+from kb.mcp.records import summarize
+from kb.rag.baseline import index_rag_baseline, rag_retrieve
+from kb.store import queries as q
+from kb.store.queries import provenance_for_artifact
+
+pytestmark = pytest.mark.skipif(
+    not has_llm_key(), reason="no LLM API key (set ANTHROPIC_API_KEY or OPENAI_API_KEY)"
+)
+
+K = 5
+WIN_ACCURACY_MARGIN = 0.15  # pre-registered: knowbase wins iff acc margin >= this AND hall <= RAG's
+
+ANSWER_SYSTEM = (
+    "Answer the question using ONLY the provided context about a codebase. Be concise and "
+    "specific. If the context does not contain the answer, say you cannot tell from the context."
+)
+JUDGE_SYSTEM = "You are a strict grader. Respond with ONE JSON object and nothing else."
+
+
+@pytest.fixture(scope="module")
+def prepared(engine: Engine, tmp_path_factory, st_provider) -> tuple[Engine, str]:
+    repo = tmp_path_factory.mktemp("tier3_llm")
+    sha = make_git_repo(repo, [{**FILES, **ENTITY_FILES}])[0]
+    index_commit(
+        engine,
+        str(repo),
+        sha,
+        extractors=[FastAPIExtractor(), EntityExtractor()],
+        first_party_root="src",
+    )
+    embed_snapshot(engine, sha, st_provider)
+    index_rag_baseline(engine, str(repo), sha, st_provider)
+    return engine, sha
+
+
+def _knowbase_context(conn, sha, question, st_provider) -> str:
+    qvec = st_provider.embed([question])[0]
+    blocks = []
+    for row in q.similar_artifacts_by_embedding(conn, sha, qvec, K):
+        prov = provenance_for_artifact(conn, sha, row.logical_key)
+        prov_str = ", ".join(f"{p.file_path}:{p.start_line}" for p in prov)
+        blocks.append(
+            f"[{row.logical_key}] kind={row.kind}\n"
+            f"summary: {summarize(row.kind, row.payload)}\n"
+            f"details: {json.dumps(row.payload, default=str)[:600]}\n"
+            f"provenance: {prov_str}"
+        )
+    return "\n\n".join(blocks) if blocks else "(no knowledge units found)"
+
+
+def _rag_context(conn, sha, question, st_provider) -> str:
+    hits = rag_retrieve(conn, question, st_provider, sha, K)
+    blocks = [f"# {h.file_path}:{h.start_line}-{h.end_line}\n{h.raw_text}" for h in hits]
+    return "\n\n".join(blocks) if blocks else "(no source chunks found)"
+
+
+def _answer(provider, question: str, context: str) -> str:
+    return provider.complete(ANSWER_SYSTEM, f"Context:\n{context}\n\nQuestion: {question}")
+
+
+def _judge(provider, question: str, gold: str, answer: str, context: str) -> dict:
+    prompt = (
+        f"Question: {question}\n"
+        f"Gold answer: {gold}\n"
+        f"Candidate answer: {answer}\n\n"
+        f"Candidate's source context:\n{context}\n\n"
+        'Return JSON {"accuracy": 0|1, "hallucinated": 0|1, "note": "..."} where accuracy=1 iff '
+        "the candidate conveys the gold answer's key facts (paraphrase is fine), and "
+        "hallucinated=1 iff the candidate states a fact not supported by its source context."
+    )
+    return _parse_verdict(provider.complete(JUDGE_SYSTEM, prompt, max_tokens=300))
+
+
+def _parse_verdict(raw: str) -> dict:
+    match = re.search(r"\{.*\}", raw, re.S)
+    if match:
+        try:
+            data = json.loads(match.group(0))
+            return {
+                "accuracy": int(bool(data.get("accuracy"))),
+                "hallucinated": int(bool(data.get("hallucinated"))),
+                "note": str(data.get("note", ""))[:200],
+                "parse_error": False,
+            }
+        except json.JSONDecodeError:
+            pass
+    return {"accuracy": 0, "hallucinated": 0, "note": "unparseable", "parse_error": True}
+
+
+def test_llm_judged_ab(prepared: tuple[Engine, str], st_provider) -> None:
+    engine, sha = prepared
+    answerer = default_llm_provider()
+    judge = default_llm_provider(os.environ.get("KB_LLM_JUDGE_MODEL"))
+
+    records = []
+    with engine.connect() as conn:
+        for question in QUESTIONS:
+            gold = GOLD[question.id]
+            kb_ctx = _knowbase_context(conn, sha, question.question, st_provider)
+            rag_ctx = _rag_context(conn, sha, question.question, st_provider)
+            kb_ans = _answer(answerer, question.question, kb_ctx)
+            rag_ans = _answer(answerer, question.question, rag_ctx)
+            kb_v = _judge(judge, question.question, gold, kb_ans, kb_ctx)
+            rag_v = _judge(judge, question.question, gold, rag_ans, rag_ctx)
+            records.append({"id": question.id, "knowbase": kb_v, "rag": rag_v})
+
+    n = len(records)
+
+    def mean(arm: str, key: str) -> float:
+        return sum(r[arm][key] for r in records) / n
+
+    kb_acc, rag_acc = mean("knowbase", "accuracy"), mean("rag", "accuracy")
+    kb_hall, rag_hall = mean("knowbase", "hallucinated"), mean("rag", "hallucinated")
+    win = (kb_acc - rag_acc >= WIN_ACCURACY_MARGIN) and (kb_hall <= rag_hall)
+
+    summary = {
+        "answerer": answerer.model_id,
+        "judge": judge.model_id,
+        "n": n,
+        "k": K,
+        "knowbase": {"accuracy": kb_acc, "hallucination": kb_hall},
+        "rag": {"accuracy": rag_acc, "hallucination": rag_hall},
+        "pre_registered_threshold": {
+            "accuracy_margin": WIN_ACCURACY_MARGIN,
+            "hallucination": "knowbase <= rag",
+        },
+        "win": win,
+        "records": records,
+    }
+    out_path = os.environ.get("KB_LLM_AB_METRICS", "tier3_llm_ab_metrics.json")
+    with open(out_path, "w", encoding="utf-8") as fh:
+        json.dump(summary, fh, indent=2)
+
+    print(
+        f"\n[tier3-llm] answerer={answerer.model_id} judge={judge.model_id} "
+        f"n={n} (TRACKED, non-gating)\n"
+        f"  accuracy:      knowbase={kb_acc:.3f}  RAG={rag_acc:.3f}\n"
+        f"  hallucination: knowbase={kb_hall:.3f}  RAG={rag_hall:.3f}\n"
+        f"  pre-registered win (acc margin >= {WIN_ACCURACY_MARGIN} and hall <= RAG): "
+        f"{'PASS' if win else 'not met'}  -> {out_path}"
+    )
+
+    # NON-gating: assert only that the A/B actually ran for every question — never the win.
+    assert n == len(QUESTIONS)
+    assert all(r["knowbase"]["note"] is not None for r in records)
diff --git a/src/kb/llm/__init__.py b/src/kb/llm/__init__.py
@@ -0,0 +1,6 @@
+"""Replaceable LLM adapters for the optional, nightly, NON-gating LLM-judged A/B (DESIGN.md §1, §9).
+
+Used only by ``kb.eval.tier3_llm_judge_test``; never on the index or serve path. Heavy SDK imports
+(anthropic / openai) are lazy so importing this package is cheap and collection-safe even when those
+packages are not installed.
+"""