Skip to content

As a developer, I want QA benchmark evaluation based on standard RAG metrics so that skill regressions are detected reliably and results are externally comparable #361

@kiyotis

Description

@kiyotis

Situation

The QA benchmark currently evaluates answer quality using two custom LLM judges:

  • C-claim judge (tools/benchmark/prompts/c-claim-judge.md): Checks whether each must.fact is PRESENT/ABSENT/UNCERTAIN in the generated answer. One LLM call per fact per scenario.
  • Hallucination judge (tools/benchmark/prompts/hallucination-judge.md): Checks whether Nablarch-specific claims are grounded in the retrieved knowledge. One LLM call per scenario.

There are no standard RAG evaluation metrics anywhere in the benchmark pipeline.

Pain

Custom LLM judges have known reliability problems:

  • Inconsistency: The same input can produce different verdicts across runs. The current pipeline handles UNCERTAIN by exclusion, which masks variance rather than reducing it.
  • Opaque failure mode: When a score drops, it is unclear whether the skill degraded or the judge fluctuated. There is no ground truth to compare against.
  • No external calibration: Because the metrics are custom, there is no way to compare results against other projects or validate judge quality against established benchmarks.

Benefit

  • Developers can detect skill regressions with higher confidence because metrics follow established standards
  • Score changes are attributable to skill changes, not judge variance
  • Results align with established RAG evaluation standards, enabling external calibration

Success Criteria

  • Answer Correctness, Answer Relevancy, and Faithfulness are computed for each QA scenario and included in the benchmark report
  • The three metrics are validated against the current LLM-judge verdicts on the existing 30 QA scenarios: correlation and disagreement cases documented
  • Benchmark report shows standard metric scores replacing LLM-judge scores
  • Metric selection rationale and PASS/FAIL thresholds are documented in docs/benchmark-design.md
  • All existing benchmark tests pass with no regressions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions