feat: nightly LLM-judged A/B (answer accuracy + hallucination) by v0ropaev · Pull Request #11 · v0ropaev/knowbase

v0ropaev · 2026-06-21T10:40:12Z

Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an answerer LLM answers each question from knowbase's grounded context vs a RAG-over-source context, and a judge LLM scores accuracy (against hand-written gold) + hallucination (claims unsupported by that arm's context). Optional, key-gated, NON-gating — it never blocks CI.

What

kb.llm — LLMProvider protocol + AnthropicProvider (default, claude-opus-4-8) / OpenAIChatProvider, default_llm_provider() via KB_LLM_PROVIDER, has_llm_key(). Mirrors kb.embed.providers; lazy SDK imports behind a new llm extra.
tier3_llm_judge_test.py — skipif no key. Builds both context arms (knowbase grounded units + provenance vs RAG chunk text), answers + judges all 11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B ran — never the win. Self-judging bias accepted (overridable via KB_LLM_JUDGE_MODEL).
questions.GOLD — hand-written reference answers for the 11 questions (we own the fixtures → no arm bias).
RagHit.raw_text — so the RAG arm can feed chunk text to the answerer.
nightly-llm-ab.yml — scheduled + dispatch, pgvector service, --extra llm, secret-gated ANTHROPIC_API_KEY, continue-on-error, uploads the metrics artifact.
pyproject — llm = [anthropic, openai] extra + anthropic.* mypy ignore.

Safety / CI

Regular ci.yml stays green: the judge test module imports cleanly without anthropic/openai installed (lazy imports) and skips without a key; no new required deps; mypy covers the SDKs via ignore_missing_imports. Verified locally: 52 passed + 1 skipped, ruff + mypy --strict clean, provider/JSON-parsing smoke OK.

The deterministic Tier-3 recall gate remains the hard floor. After merge I'll trigger the nightly once via workflow_dispatch (with the secret set) to confirm it runs and uploads metrics.

Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an answerer LLM answers each question from knowbase-grounded context vs a RAG-over-source context, and a judge LLM scores accuracy against hand-written gold + flags hallucination. Optional, key-gated, NON-gating — never blocks CI. - src/kb/llm: LLMProvider protocol + Anthropic (default) / OpenAI adapters + default_llm_provider() / has_llm_key(); mirrors kb.embed.providers, lazy SDK imports (new `llm` extra). Default model claude-opus-4-8. - src/kb/eval/tier3_llm_judge_test.py: skipif no key; builds both context arms (knowbase grounded units + provenance vs RAG chunk text), answers + judges all 11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B ran (never the win). - src/kb/eval/questions.py: GOLD reference answers for the 11 questions. - src/kb/rag/baseline.py: RagHit.raw_text so the RAG arm can feed chunk text. - .github/workflows/nightly-llm-ab.yml: scheduled + dispatch, pgvector service, --extra llm, secret-gated, continue-on-error, uploads the metrics artifact. - pyproject: `llm` extra (anthropic, openai) + anthropic mypy ignore. Regular CI stays green: the judge test collects (lazy imports) and skips without a key; no new required deps. 52 eval tests pass + 1 skipped; ruff + mypy clean.

v0ropaev merged commit 835acbe into master Jun 21, 2026
1 check passed

v0ropaev deleted the feat/nightly-llm-ab branch June 21, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: nightly LLM-judged A/B (answer accuracy + hallucination)#11

feat: nightly LLM-judged A/B (answer accuracy + hallucination)#11
v0ropaev merged 1 commit into
masterfrom
feat/nightly-llm-ab

v0ropaev commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

v0ropaev commented Jun 21, 2026

What

Safety / CI

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant