Skip to content

feat: add DeepEval RAG metrics to benchmark pipeline (#361)#362

Open
kiyotis wants to merge 38 commits into
mainfrom
361-ragas-benchmark-metrics
Open

feat: add DeepEval RAG metrics to benchmark pipeline (#361)#362
kiyotis wants to merge 38 commits into
mainfrom
361-ragas-benchmark-metrics

Conversation

@kiyotis
Copy link
Copy Markdown
Contributor

@kiyotis kiyotis commented May 28, 2026

Closes #361

Approach

Integrated DeepEval's standard RAG metrics (Answer Correctness, Answer Relevancy, Faithfulness) into the existing benchmark pipeline via Amazon Bedrock. The approach adds DeepEval as an optional layer on top of the existing custom LLM judges rather than replacing them, enabling side-by-side comparison and external calibration.

Key design decisions:

  • Optional flag (--with-deepeval): DeepEval metrics are opt-in to avoid mandatory Bedrock calls on every benchmark run
  • Amazon Bedrock backend: Uses AmazonBedrockModel (Claude Sonnet 4.5) consistent with the rest of the pipeline; avoids separate API key management
  • SSL workaround: AWS_CA_BUNDLE set from SSL_CERT_FILE at compute time to handle aiobotocore's SSL cert chain requirement under corporate proxy
  • Fallback for retrieval context: build_deepeval_test_case falls back to workflow_details.step3.selected_sections when diagnostics.search_sections is absent (run_qa output format)

Validation on 28/30 existing QA scenarios confirmed:

  • answer_correctness vs accuracy: 96.4% agreement (27/28)
  • faithfulness vs hallucination: 88.5% agreement (23/26) — 3 mismatches explained by different reference sets (specific sections vs. retrieval_context), a structural difference not noise

Tasks

See tasks.md.

Expert Review

Expert review not conducted for this PR.

Success Criteria Check

Criterion Status Evidence
Answer Correctness, Answer Similarity, and Faithfulness computed per QA scenario and included in benchmark report ✅ Met evaluate.py: compute_deepeval_metrics(); report.py: DeepEval columns added; --with-deepeval flag in run_benchmark.sh and run_qa.py
Three metrics validated against current LLM-judge verdicts on 30 QA scenarios: correlation and disagreement cases documented ✅ Met .work/00361/deepeval-validation.md: 28/30 scenarios evaluated; agreement rates and mismatch analysis documented
Benchmark report shows standard metric scores alongside LLM-judge scores ✅ Met report.py adds answer_correctness, answer_relevancy, faithfulness columns to evaluation.json report
Metric selection rationale and PASS/FAIL thresholds documented in docs/benchmark-design.md ✅ Met docs/benchmark-design.md: DeepEval metrics section added with rationale and thresholds (answer_correctness ≥ 0.7, faithfulness ≥ 0.7)
All existing benchmark tests pass with no regressions ✅ Met tools/benchmark/tests/test_evaluate.py and test_report.py pass; 52-file diff check confirmed no unintended changes

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kiyotis kiyotis added the enhancement New feature or request label May 28, 2026
kiyotis and others added 16 commits May 28, 2026 10:34
- Remove post-hoc modification of baseline-current results
- Add incremental validation: 1-run (T7) → 3-run (T8) → full 30-run (T9)
- Add HOW-TO-RUN.md update task (T10)
- Rename diff check to T11

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d add asyncio.run

- build_deepeval_test_case now falls back to workflow_details.step3.selected_sections
  when diagnostics.search_sections is absent (run_qa output format)
- _run_deepeval_metric uses asyncio.run() instead of new_event_loop()
- run_qa.py: add --with-deepeval flag, pass with_deepeval to evaluate_scenario
- test: add workflow_details fallback tests and precedence test

Note: evaluation.json still shows null scores in run_qa context — root cause
of asyncio interaction under claude subprocess call pending investigation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Eval

aiobotocore (used by AmazonBedrockModel async calls) requires AWS_CA_BUNDLE
for SSL certificate verification. Without it, corp proxy cert chains cause
SSLCertVerificationError, silently returning None for all DeepEval scores.

Horizontal check: only compute_deepeval_metrics creates AmazonBedrockModel;
no other call site is affected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
28/30 scenarios evaluated with --with-deepeval.
accuracy vs answer_correctness: 96.4% agreement (27/28).
hallucination vs faithfulness: 88.5% agreement (23/26).
3 hallucination/faithfulness mismatches explained by different
reference sets (specific sections vs. retrieval_context).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add --with-deepeval flag to step 1 and step 2 commands,
add deepeval install prerequisite, and update evaluation.json
description to include DeepEval metrics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kiyotis kiyotis changed the title feat: add tasks.md for DeepEval RAG metrics benchmark (#361) feat: add DeepEval RAG metrics to benchmark pipeline (#361) May 28, 2026
kiyotis and others added 11 commits May 28, 2026 14:31
Issue #361の正しい方針は「置き換え」。T12でLLMジャッジ削除を実装する。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
設計書・手順書・コード・テスト全ての影響箇所を調査済み。
ベストプラクティスに基づきLLMジャッジを削除しDeepEvalに一本化する。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…acement

LLMジャッジ削除後は旧ベースライン(accuracy/hallucination)が無効になるため
QA全件3 runでDeepEvalベースラインを取り直す。
キーワード検索はLLMジャッジ未使用のため取り直し不要。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ics duplication

evaluation.jsonをシンプルに: scores={score+reason}, metrics/diagnostics削除。
report.pyはmetrics.jsonから読み取るよう変更。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kiyotis and others added 10 commits May 28, 2026 15:38
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 runs × 30 scenarios each. All scenarios passed all 3 DeepEval metrics.

| run | answer_correctness | answer_relevancy | faithfulness |
|-----|-------------------|-----------------|--------------|
| run-1 | 0.96 | 0.97 | 0.97 |
| run-2 | 0.99 | 0.96 | 0.97 |
| run-3 | 0.97 | 0.96 | 0.98 |

Threshold pass rate (≥0.5): 30/30 across all runs and metrics.
Replaces the old accuracy/hallucination baseline (baseline-current/).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndard

answer_correctness: 0.5 → 0.99 (missing facts cause wrong implementations)
faithfulness: 0.5 → 0.99 (hallucinations cause wrong implementations)
answer_relevancy: 0.5 → 0.95 (minor verbosity tolerated, major deviation is not)

Update HOW-TO-RUN.md and benchmark-design.md to reflect new thresholds
and rationale. Fix incorrect --run-dir × 3 command in step 4a.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hmark

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent step-transition narration (e.g. "Step 4完了。read_sections=[...]")
was being included in answer.md because parse_qa_response extracted
all text before ### Workflow Details.

The fix introduces a ### Answer marker in e2e-prompt.md Step 8 instruction.
parse_qa_response now extracts only the text between ### Answer and
### Workflow Details. Legacy responses without ### Answer fall back to
the previous behavior (full text before ### Workflow Details).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…in progress

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run-1: 29/30 (qa-11a timeout)
run-2: 26/30 (review-07, qa-02, qa-06 timeout; oos-qa-01 Workflow Details missing)
run-3: in progress (26/30 done, interrupted at session end)

Error scenarios will be retried at next session start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

1 participant