Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions .github/workflows/nightly-llm-ab.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Nightly LLM A/B

# Optional, NON-gating answer-quality A/B (knowbase context vs RAG context), judged by an LLM.
# Runs nightly and on demand. Self-skips when no API key secret is set; never blocks merges (it is a
# separate workflow from CI, and the judge step is continue-on-error). The deterministic Tier-3 recall
# gate in CI remains the hard floor.

on:
schedule:
- cron: "0 6 * * *" # 06:00 UTC nightly
workflow_dispatch:

concurrency:
group: nightly-llm-ab
cancel-in-progress: true

jobs:
ab:
name: LLM-judged knowledge-vs-RAG A/B
runs-on: ubuntu-latest
services:
postgres:
image: pgvector/pgvector:pg17
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
env:
KB_TEST_DB_URL: postgresql+psycopg://postgres:postgres@127.0.0.1:5432/postgres
KB_LLM_PROVIDER: anthropic
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
KB_LLM_AB_METRICS: tier3_llm_ab_metrics.json

steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.12"
enable-cache: true

- name: Install dependencies (incl. llm extra)
run: uv sync --extra dev --extra embed --extra llm

- name: Cache embedding model
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
key: hf-all-MiniLM-L6-v2-v1

- name: Warm up embedding model
run: uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

- name: LLM-judged A/B (tracked, non-gating; self-skips without ANTHROPIC_API_KEY)
continue-on-error: true
run: uv run pytest src/kb/eval/tier3_llm_judge_test.py -q -s

- name: Upload A/B metrics
uses: actions/upload-artifact@v4
with:
name: tier3-llm-ab-metrics
path: tier3_llm_ab_metrics.json
if-no-files-found: ignore
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Tier-3 entity questions** (`kb.eval.questions`): the knowledge-vs-RAG A/B now also covers domain
entities (a two-file `Order`/`LineItem` fixture), asserting knowbase cross-file recall@k == 1.0 for
entity questions as well as API-contract questions.
- **Nightly LLM-judged A/B** (`kb.llm`, `kb.eval.tier3_llm_judge_test`, `.github/workflows/nightly-llm-ab.yml`):
an optional, key-gated, **non-gating** answer-quality comparison. An answerer LLM answers each question
from knowbase-grounded context vs RAG-over-source context; a judge LLM scores accuracy against
hand-written `GOLD` references and flags hallucination (claims unsupported by that arm's context).
`kb.llm.providers` mirrors the embed-provider pattern (Anthropic default, OpenAI optional, lazy imports
via the new `llm` extra); the test self-skips without an API key and asserts only that the A/B ran
(never the win); the nightly workflow uploads a metrics artifact. `RagHit` gained a `raw_text` field so
the RAG arm can feed chunk text to the answerer.

## [0.2.0] - 2026-06-02

Expand Down
6 changes: 5 additions & 1 deletion DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,10 +264,14 @@ Eval is **co-equal with extraction**, weighted to cheap/exact tiers that gate CI
- **Tier 2 — golden curated repos (TRACKED, non-gating).** 3–5 SHA-pinned permissive Python
repos; one **held out** and never used for tuning (the real trust signal). Report per-repo,
never just the mean.
- **Tier 3 — downstream vs RAG (TRACKED, after MCP).** Fixed ~30–50 question set; coding agent
- **Tier 3 — downstream vs RAG (TRACKED, after MCP).** Fixed question set; coding agent
answers with knowbase-MCP vs a **frozen, peer-reviewed** pgvector-RAG baseline (same
Postgres, same model). **Pre-register the win threshold.** Metrics: grounded-answer accuracy,
hallucination rate (claims with no provenance), tokens-to-answer, tool round-trips.
*Implemented:* the deterministic cross-file recall gate (`tier3_rag_test`, HARD) plus an optional,
key-gated, NON-gating **LLM-judged A/B** (`kb.llm` + `tier3_llm_judge_test`, run nightly): an answerer
answers each question from knowbase context vs RAG context, and a judge scores accuracy against
hand-written gold + hallucination, comparing to a pre-registered threshold (printed, never asserted).

**Invariants asserted as exact ground truth every run:** every artifact has ≥1 `derived_from`
row (zero orphans); re-running an extractor on the same span identity+version yields an
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,9 @@ flowchart LR
- **A frozen RAG-over-source baseline** and the **Tier-3 knowledge-vs-RAG recall gate** — the honest A/B that backs the "knowledge > RAG" thesis.
- **Eight HARD CI eval gates** (see [Development](#development)).

**Not done yet** (and deliberately not faked): the semantic / **LLM-grounded** extraction layer, the nightly LLM-judged A/B, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the [Roadmap](#roadmap).
- **A nightly LLM-judged A/B** (optional, key-gated, **non-gating**) — an answerer LLM answers each question from knowbase's grounded context vs a RAG-over-source context, and a judge LLM scores **answer accuracy** (against hand-written gold) + **hallucination**. Tracked metrics on top of recall; it never blocks CI.

**Not done yet** (and deliberately not faked): the semantic / **LLM-grounded** extraction layer, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the [Roadmap](#roadmap).

## Quickstart

Expand Down Expand Up @@ -226,7 +228,7 @@ flowchart LR

Next milestones:

- [ ] **Nightly LLM-judged A/B** (key-gated, non-gating) — grounded-answer accuracy + hallucination rate on top of recall.
- [x] **Nightly LLM-judged A/B** (key-gated, non-gating) — grounded-answer accuracy + hallucination rate on top of recall. *(shipped)*
- [ ] **LLM-grounded semantic layer** — model-backed artifacts that still carry ≥ 1 span (`extraction_method = "llm_grounded"`).
- [ ] **Incremental re-index on git push** — turn the diff-based invalidation seed into live updates.
- [ ] **ADR mining** from git / PR history.
Expand Down
9 changes: 8 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ dev = [
embed = [
"sentence-transformers>=3,<6",
]
# LLM backend for the optional, nightly, NON-gating LLM-judged A/B (kb.llm + tier3_llm_judge_test).
# Not needed for index/serve/embed; the nightly workflow uses `uv sync --extra dev --extra embed
# --extra llm`. Also makes the existing OpenAI embedding adapter installable.
llm = [
"anthropic>=0.40",
"openai>=1.40",
]

[project.scripts]
kb = "kb.daemon.cli:app"
Expand Down Expand Up @@ -86,7 +93,7 @@ ignore_errors = true
[[tool.mypy.overrides]]
module = [
"pygit2.*", "grimp.*", "sqlparse.*", "fastapi.*", "starlette.*", "fastmcp.*", "mcp.*",
"sentence_transformers.*", "pgvector.*", "openai.*",
"sentence_transformers.*", "pgvector.*", "openai.*", "anthropic.*",
]
ignore_missing_imports = true

Expand Down
21 changes: 21 additions & 0 deletions src/kb/eval/questions.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,24 @@ class Question:
Question("e3", "Which model does the Order entity's items field reference, and where is it?",
ENTITY_CROSS_FILE, frozenset({"entity:app.domain.order.Order"})),
]

# Hand-written reference answers (the gold oracle for the nightly LLM-judged A/B, PR-3b). Derived
# from the fixtures above (we own them), so accuracy is judged against ground truth — not against
# either arm's own retrieval (no knowbase-favoring bias). Keep in sync with FILES / ENTITY_FILES.
GOLD: dict[str, str] = {
"q1": "A JSON array of OrderOut objects; each OrderOut has id (int) and total (float).",
"q2": "The fields of OrderOut: id (int) and total (float).",
"q3": "Accepts an OrderIn request body with a single field item (str); returns an OrderOut "
"(id: int, total: float).",
"q4": "OrderOut, which has id (int) and total (float).",
"q5": "OrderOut (id: int, total: float).",
"q6": "201.",
"q7": "float.",
"q8": "GET /api/orders (also POST and GET /api/orders/{order_id}) returns OrderOut; "
"OrderOut is defined in src/app/schemas.py.",
"e1": "Order (a dataclass) has id (int) and items (a list of LineItem); each LineItem has "
"sku (str) and qty (int, default 1).",
"e2": "Order has id (int) and items, where items is list[LineItem]; LineItem has sku (str) and "
"qty (int).",
"e3": "It references LineItem, defined in src/app/domain/line_item.py.",
}
175 changes: 175 additions & 0 deletions src/kb/eval/tier3_llm_judge_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
"""TRACKED (NON-gating) — Tier 3: LLM-judged answer quality, knowbase context vs RAG context (§9).

Beyond Tier-3 *recall* (tier3_rag_test), this measures *answer quality*: an answerer LLM answers
each question from the knowbase-grounded context and, separately, from the RAG-over-source context.
A judge LLM scores each answer against a hand-written GOLD reference for **accuracy**, and flags
**hallucination** (a claim unsupported by that arm's own provided context).

This is **nightly, key-gated, and NON-gating**: it self-skips without an API key, asserts only that
the A/B actually ran (never the win), and writes a metrics JSON for the CI artifact. The
deterministic Tier-3 recall gate remains the hard floor. Self-judging bias is accepted here (see
DESIGN §9); a distinct judge model can be set via KB_LLM_JUDGE_MODEL.
"""

from __future__ import annotations

import json
import os
import re

import pytest
from sqlalchemy import Engine

from kb.daemon.pipeline import index_commit
from kb.embed.population import embed_snapshot
from kb.eval._fixtures import make_git_repo
from kb.eval.questions import ENTITY_FILES, GOLD, QUESTIONS
from kb.eval.tier1_api_test import FILES
from kb.extract.deterministic.entities import EntityExtractor
from kb.extract.deterministic.fastapi_contract import FastAPIExtractor
from kb.llm.providers import default_llm_provider, has_llm_key
from kb.mcp.records import summarize
from kb.rag.baseline import index_rag_baseline, rag_retrieve
from kb.store import queries as q
from kb.store.queries import provenance_for_artifact

pytestmark = pytest.mark.skipif(
not has_llm_key(), reason="no LLM API key (set ANTHROPIC_API_KEY or OPENAI_API_KEY)"
)

K = 5
WIN_ACCURACY_MARGIN = 0.15 # pre-registered: knowbase wins iff acc margin >= this AND hall <= RAG's

ANSWER_SYSTEM = (
"Answer the question using ONLY the provided context about a codebase. Be concise and "
"specific. If the context does not contain the answer, say you cannot tell from the context."
)
JUDGE_SYSTEM = "You are a strict grader. Respond with ONE JSON object and nothing else."


@pytest.fixture(scope="module")
def prepared(engine: Engine, tmp_path_factory, st_provider) -> tuple[Engine, str]:
repo = tmp_path_factory.mktemp("tier3_llm")
sha = make_git_repo(repo, [{**FILES, **ENTITY_FILES}])[0]
index_commit(
engine,
str(repo),
sha,
extractors=[FastAPIExtractor(), EntityExtractor()],
first_party_root="src",
)
embed_snapshot(engine, sha, st_provider)
index_rag_baseline(engine, str(repo), sha, st_provider)
return engine, sha


def _knowbase_context(conn, sha, question, st_provider) -> str:
qvec = st_provider.embed([question])[0]
blocks = []
for row in q.similar_artifacts_by_embedding(conn, sha, qvec, K):
prov = provenance_for_artifact(conn, sha, row.logical_key)
prov_str = ", ".join(f"{p.file_path}:{p.start_line}" for p in prov)
blocks.append(
f"[{row.logical_key}] kind={row.kind}\n"
f"summary: {summarize(row.kind, row.payload)}\n"
f"details: {json.dumps(row.payload, default=str)[:600]}\n"
f"provenance: {prov_str}"
)
return "\n\n".join(blocks) if blocks else "(no knowledge units found)"


def _rag_context(conn, sha, question, st_provider) -> str:
hits = rag_retrieve(conn, question, st_provider, sha, K)
blocks = [f"# {h.file_path}:{h.start_line}-{h.end_line}\n{h.raw_text}" for h in hits]
return "\n\n".join(blocks) if blocks else "(no source chunks found)"


def _answer(provider, question: str, context: str) -> str:
return provider.complete(ANSWER_SYSTEM, f"Context:\n{context}\n\nQuestion: {question}")


def _judge(provider, question: str, gold: str, answer: str, context: str) -> dict:
prompt = (
f"Question: {question}\n"
f"Gold answer: {gold}\n"
f"Candidate answer: {answer}\n\n"
f"Candidate's source context:\n{context}\n\n"
'Return JSON {"accuracy": 0|1, "hallucinated": 0|1, "note": "..."} where accuracy=1 iff '
"the candidate conveys the gold answer's key facts (paraphrase is fine), and "
"hallucinated=1 iff the candidate states a fact not supported by its source context."
)
return _parse_verdict(provider.complete(JUDGE_SYSTEM, prompt, max_tokens=300))


def _parse_verdict(raw: str) -> dict:
match = re.search(r"\{.*\}", raw, re.S)
if match:
try:
data = json.loads(match.group(0))
return {
"accuracy": int(bool(data.get("accuracy"))),
"hallucinated": int(bool(data.get("hallucinated"))),
"note": str(data.get("note", ""))[:200],
"parse_error": False,
}
except json.JSONDecodeError:
pass
return {"accuracy": 0, "hallucinated": 0, "note": "unparseable", "parse_error": True}


def test_llm_judged_ab(prepared: tuple[Engine, str], st_provider) -> None:
engine, sha = prepared
answerer = default_llm_provider()
judge = default_llm_provider(os.environ.get("KB_LLM_JUDGE_MODEL"))

records = []
with engine.connect() as conn:
for question in QUESTIONS:
gold = GOLD[question.id]
kb_ctx = _knowbase_context(conn, sha, question.question, st_provider)
rag_ctx = _rag_context(conn, sha, question.question, st_provider)
kb_ans = _answer(answerer, question.question, kb_ctx)
rag_ans = _answer(answerer, question.question, rag_ctx)
kb_v = _judge(judge, question.question, gold, kb_ans, kb_ctx)
rag_v = _judge(judge, question.question, gold, rag_ans, rag_ctx)
records.append({"id": question.id, "knowbase": kb_v, "rag": rag_v})

n = len(records)

def mean(arm: str, key: str) -> float:
return sum(r[arm][key] for r in records) / n

kb_acc, rag_acc = mean("knowbase", "accuracy"), mean("rag", "accuracy")
kb_hall, rag_hall = mean("knowbase", "hallucinated"), mean("rag", "hallucinated")
win = (kb_acc - rag_acc >= WIN_ACCURACY_MARGIN) and (kb_hall <= rag_hall)

summary = {
"answerer": answerer.model_id,
"judge": judge.model_id,
"n": n,
"k": K,
"knowbase": {"accuracy": kb_acc, "hallucination": kb_hall},
"rag": {"accuracy": rag_acc, "hallucination": rag_hall},
"pre_registered_threshold": {
"accuracy_margin": WIN_ACCURACY_MARGIN,
"hallucination": "knowbase <= rag",
},
"win": win,
"records": records,
}
out_path = os.environ.get("KB_LLM_AB_METRICS", "tier3_llm_ab_metrics.json")
with open(out_path, "w", encoding="utf-8") as fh:
json.dump(summary, fh, indent=2)

print(
f"\n[tier3-llm] answerer={answerer.model_id} judge={judge.model_id} "
f"n={n} (TRACKED, non-gating)\n"
f" accuracy: knowbase={kb_acc:.3f} RAG={rag_acc:.3f}\n"
f" hallucination: knowbase={kb_hall:.3f} RAG={rag_hall:.3f}\n"
f" pre-registered win (acc margin >= {WIN_ACCURACY_MARGIN} and hall <= RAG): "
f"{'PASS' if win else 'not met'} -> {out_path}"
)

# NON-gating: assert only that the A/B actually ran for every question — never the win.
assert n == len(QUESTIONS)
assert all(r["knowbase"]["note"] is not None for r in records)
6 changes: 6 additions & 0 deletions src/kb/llm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Replaceable LLM adapters for the optional, nightly, NON-gating LLM-judged A/B (DESIGN.md §1, §9).

Used only by ``kb.eval.tier3_llm_judge_test``; never on the index or serve path. Heavy SDK imports
(anthropic / openai) are lazy so importing this package is cheap and collection-safe even when those
packages are not installed.
"""
Loading
Loading