Skip to content

feat: nightly LLM-judged A/B (answer accuracy + hallucination)#11

Merged
v0ropaev merged 1 commit into
masterfrom
feat/nightly-llm-ab
Jun 21, 2026
Merged

feat: nightly LLM-judged A/B (answer accuracy + hallucination)#11
v0ropaev merged 1 commit into
masterfrom
feat/nightly-llm-ab

Conversation

@v0ropaev

Copy link
Copy Markdown
Owner

Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an answerer LLM answers each question from knowbase's grounded context vs a RAG-over-source context, and a judge LLM scores accuracy (against hand-written gold) + hallucination (claims unsupported by that arm's context). Optional, key-gated, NON-gating — it never blocks CI.

What

  • kb.llmLLMProvider protocol + AnthropicProvider (default, claude-opus-4-8) / OpenAIChatProvider, default_llm_provider() via KB_LLM_PROVIDER, has_llm_key(). Mirrors kb.embed.providers; lazy SDK imports behind a new llm extra.
  • tier3_llm_judge_test.pyskipif no key. Builds both context arms (knowbase grounded units + provenance vs RAG chunk text), answers + judges all 11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B ran — never the win. Self-judging bias accepted (overridable via KB_LLM_JUDGE_MODEL).
  • questions.GOLD — hand-written reference answers for the 11 questions (we own the fixtures → no arm bias).
  • RagHit.raw_text — so the RAG arm can feed chunk text to the answerer.
  • nightly-llm-ab.yml — scheduled + dispatch, pgvector service, --extra llm, secret-gated ANTHROPIC_API_KEY, continue-on-error, uploads the metrics artifact.
  • pyprojectllm = [anthropic, openai] extra + anthropic.* mypy ignore.

Safety / CI

Regular ci.yml stays green: the judge test module imports cleanly without anthropic/openai installed (lazy imports) and skips without a key; no new required deps; mypy covers the SDKs via ignore_missing_imports. Verified locally: 52 passed + 1 skipped, ruff + mypy --strict clean, provider/JSON-parsing smoke OK.

The deterministic Tier-3 recall gate remains the hard floor. After merge I'll trigger the nightly once via workflow_dispatch (with the secret set) to confirm it runs and uploads metrics.

Adds the answer-quality evidence on top of Tier-3 recall (DESIGN §9): an
answerer LLM answers each question from knowbase-grounded context vs a
RAG-over-source context, and a judge LLM scores accuracy against hand-written
gold + flags hallucination. Optional, key-gated, NON-gating — never blocks CI.

- src/kb/llm: LLMProvider protocol + Anthropic (default) / OpenAI adapters +
  default_llm_provider() / has_llm_key(); mirrors kb.embed.providers, lazy SDK
  imports (new `llm` extra). Default model claude-opus-4-8.
- src/kb/eval/tier3_llm_judge_test.py: skipif no key; builds both context arms
  (knowbase grounded units + provenance vs RAG chunk text), answers + judges all
  11 questions, prints knowbase-vs-RAG accuracy + hallucination and PASS/FAIL vs
  a pre-registered threshold, writes a metrics JSON. Asserts only that the A/B
  ran (never the win).
- src/kb/eval/questions.py: GOLD reference answers for the 11 questions.
- src/kb/rag/baseline.py: RagHit.raw_text so the RAG arm can feed chunk text.
- .github/workflows/nightly-llm-ab.yml: scheduled + dispatch, pgvector service,
  --extra llm, secret-gated, continue-on-error, uploads the metrics artifact.
- pyproject: `llm` extra (anthropic, openai) + anthropic mypy ignore.

Regular CI stays green: the judge test collects (lazy imports) and skips without
a key; no new required deps. 52 eval tests pass + 1 skipped; ruff + mypy clean.
@v0ropaev v0ropaev merged commit 835acbe into master Jun 21, 2026
1 check passed
@v0ropaev v0ropaev deleted the feat/nightly-llm-ab branch June 21, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant