Situation
The QA benchmark currently evaluates answer quality using two custom LLM judges:
- C-claim judge (
tools/benchmark/prompts/c-claim-judge.md): Checks whether each must.fact is PRESENT/ABSENT/UNCERTAIN in the generated answer. One LLM call per fact per scenario.
- Hallucination judge (
tools/benchmark/prompts/hallucination-judge.md): Checks whether Nablarch-specific claims are grounded in the retrieved knowledge. One LLM call per scenario.
There are no standard RAG evaluation metrics anywhere in the benchmark pipeline.
Pain
Custom LLM judges have known reliability problems:
- Inconsistency: The same input can produce different verdicts across runs. The current pipeline handles
UNCERTAIN by exclusion, which masks variance rather than reducing it.
- Opaque failure mode: When a score drops, it is unclear whether the skill degraded or the judge fluctuated. There is no ground truth to compare against.
- No external calibration: Because the metrics are custom, there is no way to compare results against other projects or validate judge quality against established benchmarks.
Benefit
- Developers can detect skill regressions with higher confidence because metrics follow established standards
- Score changes are attributable to skill changes, not judge variance
- Results align with established RAG evaluation standards, enabling external calibration
Success Criteria
Situation
The QA benchmark currently evaluates answer quality using two custom LLM judges:
tools/benchmark/prompts/c-claim-judge.md): Checks whether eachmust.factis PRESENT/ABSENT/UNCERTAIN in the generated answer. One LLM call per fact per scenario.tools/benchmark/prompts/hallucination-judge.md): Checks whether Nablarch-specific claims are grounded in the retrieved knowledge. One LLM call per scenario.There are no standard RAG evaluation metrics anywhere in the benchmark pipeline.
Pain
Custom LLM judges have known reliability problems:
UNCERTAINby exclusion, which masks variance rather than reducing it.Benefit
Success Criteria
docs/benchmark-design.md