As a developer, I want QA benchmark evaluation based on standard RAG metrics so that skill regressions are detected reliably and results are externally comparable

### Situation

The QA benchmark currently evaluates answer quality using two custom LLM judges:

- **C-claim judge** (`tools/benchmark/prompts/c-claim-judge.md`): Checks whether each `must.fact` is PRESENT/ABSENT/UNCERTAIN in the generated answer. One LLM call per fact per scenario.
- **Hallucination judge** (`tools/benchmark/prompts/hallucination-judge.md`): Checks whether Nablarch-specific claims are grounded in the retrieved knowledge. One LLM call per scenario.

There are no standard RAG evaluation metrics anywhere in the benchmark pipeline.

### Pain

Custom LLM judges have known reliability problems:

- **Inconsistency**: The same input can produce different verdicts across runs. The current pipeline handles `UNCERTAIN` by exclusion, which masks variance rather than reducing it.
- **Opaque failure mode**: When a score drops, it is unclear whether the skill degraded or the judge fluctuated. There is no ground truth to compare against.
- **No external calibration**: Because the metrics are custom, there is no way to compare results against other projects or validate judge quality against established benchmarks.

### Benefit

- Developers can detect skill regressions with higher confidence because metrics follow established standards
- Score changes are attributable to skill changes, not judge variance
- Results align with established RAG evaluation standards, enabling external calibration

### Success Criteria

- [ ] Answer Correctness, Answer Relevancy, and Faithfulness are computed for each QA scenario and included in the benchmark report
- [ ] The three metrics are validated against the current LLM-judge verdicts on the existing 30 QA scenarios: correlation and disagreement cases documented
- [ ] Benchmark report shows standard metric scores replacing LLM-judge scores
- [ ] Metric selection rationale and PASS/FAIL thresholds are documented in `docs/benchmark-design.md`
- [ ] All existing benchmark tests pass with no regressions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a developer, I want QA benchmark evaluation based on standard RAG metrics so that skill regressions are detected reliably and results are externally comparable #361

Situation

Pain

Benefit

Success Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

As a developer, I want QA benchmark evaluation based on standard RAG metrics so that skill regressions are detected reliably and results are externally comparable #361

Description

Situation

Pain

Benefit

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions