feat: nightly LLM-judged A/B (answer accuracy + hallucination)#11
Merged
Conversation
Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an answerer LLM answers each question from knowbase-grounded context vs a RAG-over-source context, and a judge LLM scores accuracy against hand-written gold + flags hallucination. Optional, key-gated, NON-gating — never blocks CI. - src/kb/llm: LLMProvider protocol + Anthropic (default) / OpenAI adapters + default_llm_provider() / has_llm_key(); mirrors kb.embed.providers, lazy SDK imports (new `llm` extra). Default model claude-opus-4-8. - src/kb/eval/tier3_llm_judge_test.py: skipif no key; builds both context arms (knowbase grounded units + provenance vs RAG chunk text), answers + judges all 11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B ran (never the win). - src/kb/eval/questions.py: GOLD reference answers for the 11 questions. - src/kb/rag/baseline.py: RagHit.raw_text so the RAG arm can feed chunk text. - .github/workflows/nightly-llm-ab.yml: scheduled + dispatch, pgvector service, --extra llm, secret-gated, continue-on-error, uploads the metrics artifact. - pyproject: `llm` extra (anthropic, openai) + anthropic mypy ignore. Regular CI stays green: the judge test collects (lazy imports) and skips without a key; no new required deps. 52 eval tests pass + 1 skipped; ruff + mypy clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an answerer LLM answers each question from knowbase's grounded context vs a RAG-over-source context, and a judge LLM scores accuracy (against hand-written gold) + hallucination (claims unsupported by that arm's context). Optional, key-gated, NON-gating — it never blocks CI.
What
kb.llm—LLMProviderprotocol +AnthropicProvider(default,claude-opus-4-8) /OpenAIChatProvider,default_llm_provider()viaKB_LLM_PROVIDER,has_llm_key(). Mirrorskb.embed.providers; lazy SDK imports behind a newllmextra.tier3_llm_judge_test.py—skipifno key. Builds both context arms (knowbase grounded units + provenance vs RAG chunk text), answers + judges all 11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B ran — never the win. Self-judging bias accepted (overridable viaKB_LLM_JUDGE_MODEL).questions.GOLD— hand-written reference answers for the 11 questions (we own the fixtures → no arm bias).RagHit.raw_text— so the RAG arm can feed chunk text to the answerer.nightly-llm-ab.yml— scheduled + dispatch,pgvectorservice,--extra llm, secret-gatedANTHROPIC_API_KEY,continue-on-error, uploads the metrics artifact.llm = [anthropic, openai]extra +anthropic.*mypy ignore.Safety / CI
Regular
ci.ymlstays green: the judge test module imports cleanly without anthropic/openai installed (lazy imports) and skips without a key; no new required deps; mypy covers the SDKs viaignore_missing_imports. Verified locally: 52 passed + 1 skipped, ruff +mypy --strictclean, provider/JSON-parsing smoke OK.The deterministic Tier-3 recall gate remains the hard floor. After merge I'll trigger the nightly once via
workflow_dispatch(with the secret set) to confirm it runs and uploads metrics.