nablarch · kiyotis · May 28, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -31,6 +31,9 @@
       }
     ]
   },
+  "env": {
+    "DEEPEVAL_TELEMETRY_OPT_OUT": "true"
+  },
   "permissions": {
     "allow": [
       "Bash(git *)",

diff --git a/.gitignore b/.gitignore
@@ -28,3 +28,6 @@ __pycache__/
 .venv/
 venv/
 .pytest_cache/
+
+# DeepEval internal cache
+.deepeval/
diff --git a/.work/00361/deepeval-validation.md b/.work/00361/deepeval-validation.md
@@ -0,0 +1,89 @@
+# DeepEval Validation Results
+
+**Date**: 2026-05-28  
+**Run**: `tools/benchmark/results/deepeval-validation/run-1/`  
+**Scenarios**: 30 total, 28 evaluated (qa-11b: missing runner output, qa-15: section not found error)
+
+## Summary
+
+| Metric Pair | Agreement Rate | Mismatches |
+|---|---|---|
+| accuracy vs answer_correctness | 27/28 = **96.4%** | 1 case |
+| hallucination vs faithfulness | 23/26 = **88.5%** | 3 cases |
+
+## Score Overview
+
+| id | accuracy | hallucination | answer_correctness | answer_relevancy | faithfulness |
+|---|---|---|---|---|---|
+| impact-01 | 1.00 | 1 | 1.00 | 1.00 | 0.91 |
+| impact-03 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| impact-06 | 1.00 | 1 | 1.00 | 0.97 | 0.96 |
+| impact-08 | 1.00 | 0 | 1.00 | 1.00 | 0.86 |
+| oos-impact-01 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| oos-qa-01 | 1.00 | N/A | 1.00 | 1.00 | 1.00 |
+| pre-01 | 1.00 | 1 | 1.00 | 0.92 | 1.00 |
+| pre-02 | 1.00 | 1 | 1.00 | 1.00 | 0.95 |
+| pre-03 | 1.00 | 1 | 1.00 | 0.79 | 1.00 |
+| qa-01 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| qa-02 | 1.00 | N/A | 1.00 | 1.00 | 1.00 |
+| qa-03 | 1.00 | 1 | 1.00 | 0.93 | 1.00 |
+| qa-04 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| qa-05 | 0.67 | 1 | 0.60 | 0.90 | 0.94 |
+| qa-06 | 1.00 | 1 | 1.00 | 0.89 | 1.00 |
+| qa-07 | 1.00 | 1 | 1.00 | 1.00 | 0.95 |
+| qa-08 | 1.00 | 1 | 1.00 | 1.00 | 0.93 |
+| qa-09 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| qa-10 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| qa-11a | 1.00 | 1 | 1.00 | 0.94 | 0.96 |
+| qa-12a | 1.00 | 0 | 0.90 | 1.00 | 1.00 |
+| qa-12b | 0.50 | 1 | 1.00 | 1.00 | 0.93 |
+| qa-13 | 1.00 | 0 | 1.00 | 1.00 | 1.00 |
+| qa-14 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| review-06 | 1.00 | 1 | 0.90 | 1.00 | 1.00 |
+| review-07 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| review-08 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
+| review-09 | 1.00 | 1 | 1.00 | 1.00 | 0.94 |
+
+## Mismatch Cases
+
+### accuracy vs answer_correctness
+
+**qa-12b**: accuracy=0.50 (FAIL) vs answer_correctness=1.00 (PASS)
+
+- Input: 入力チェックでエラーがあったときに、エラーメッセージをユーザーに返す方法を教えてほしい
+- Analysis: accuracy uses claim-by-claim verdict against `must` facts; LLM judge flagged specific claims as unverified. DeepEval GEval uses a broader "does the output cover the expected facts" criterion, which gave full credit despite partial claim failures. The discrepancy reflects different granularity — claim-level strictness (accuracy) vs. holistic coverage (GEval).
+
+### hallucination vs faithfulness
+
+**impact-08**: hallucination=0 (FAIL) vs faithfulness=0.86 (PASS)
+
+- Input: テスト時だけシステム日時を任意の日付に差し替える方法はあるか？
+- Analysis: The existing hallucination judge flagged specific claims as unsupported. DeepEval faithfulness scored 0.86, meaning some statements were not grounded in context — consistent with the existing judge — but the threshold difference (0 vs 0.7) caused opposite verdicts. hallucination=0 is a binary FAIL; faithfulness=0.86 passes the 0.7 threshold.
+
+**qa-12a**: hallucination=0 (FAIL) vs faithfulness=1.00 (PASS)
+
+- Analysis: Same root cause as impact-08. The existing hallucination judge applied strict claim-by-claim verification and found at least one unsupported claim. DeepEval faithfulness found all retrieved context supported, giving 1.00. Likely the hallucination judge checks against `must` sections while faithfulness checks against `retrieval_context` — different reference sets.
+
+**qa-13**: hallucination=0 (FAIL) vs faithfulness=1.00 (PASS)
+
+- Analysis: Same pattern. The hallucination=0 verdict comes from claim verification against specific knowledge sections. DeepEval faithfulness=1.00 means the answer is entirely grounded in what was retrieved. The reference set mismatch (specific sections vs. retrieved context) explains the divergence.
+
+## Root Cause of hallucination vs faithfulness Divergence
+
+The 3 hallucination/faithfulness mismatches share the same root cause: **different reference sets**.
+
+- **Existing hallucination judge**: verifies claims against specific section content from the knowledge base
+- **DeepEval faithfulness**: verifies statements against `retrieval_context` (what was actually retrieved by the skill)
+
+When retrieval is good (high faithfulness) but the answer omits or misrepresents a required fact (hallucination=0), the two metrics legitimately diverge. This is expected behavior, not a measurement error.
+
+## Conclusion
+
+- **answer_correctness correlates strongly with accuracy** (96.4% agreement). The 1 mismatch is attributable to granularity difference (claim-level vs. holistic).
+- **faithfulness has lower agreement with hallucination** (88.5%), explained by different reference sets — a structural difference, not noise.
+- Both DeepEval metrics add complementary signal: answer_correctness as a holistic accuracy check, faithfulness as a retrieval-grounded hallucination check.
+
+## Skipped Scenarios
+
+- **qa-11b**: No runner output — likely excluded from a previous run. Not a DeepEval issue.
+- **qa-15**: `ValueError: Section s21 not found in check/security-check/security-check-2.チェックリスト.json` — pre-existing data issue unrelated to DeepEval integration.
diff --git a/.work/00361/diff-check.md b/.work/00361/diff-check.md
@@ -0,0 +1,34 @@
+# Diff Check: PR #362
+
+**Date**: 2026-05-28
+
+## Issue #361 Related Changes
+
+| File | Verdict | Note |
+|---|---|---|
+| `tools/benchmark/requirements.txt` | ✅ 想定内 | deepeval依存を追加 |
+| `tools/benchmark/scripts/evaluate.py` | ✅ 想定内 | DeepEval指標計算関数追加、SSL修正 |
+| `tools/benchmark/scripts/report.py` | ✅ 想定内 | DeepEval指標列をレポートに追加 |
+| `tools/benchmark/scripts/run_qa.py` | ✅ 想定内 | --with-deepevalフラグ追加 |
+| `tools/benchmark/tests/test_evaluate.py` | ✅ 想定内 | DeepEval関連テスト追加 |
+| `tools/benchmark/tests/test_report.py` | ✅ 想定内 | DeepEvalレポートテスト追加 |
+| `docs/benchmark-design.md` | ✅ 想定内 | DeepEval指標設計を追記 |
+| `tools/benchmark/HOW-TO-RUN.md` | ✅ 想定内 | --with-deepeval手順を追加 |
+| `.work/00361/notes.md` | ✅ 想定内 | 作業ログ |
+| `.work/00361/tasks.md` | ✅ 想定内 | タスク管理 |
+| `.work/00361/deepeval-validation.md` | ✅ 想定内 | SC2: 相関分析結果 |
+
+## Other Changes (from merged PRs)
+
+このブランチは #352, #354, #358, #360 のマージコミットも含む。これらはすべて別PRでマージ済みの変更がmainからこのブランチへ取り込まれたものであり、意図しない変更ではない。
+
+| File group | Source PR | Verdict |
+|---|---|---|
+| `setup.sh`, `.gitignore`, `README.md` | #352/#354/#358 | ✅ マージ済みPRの変更 |
+| `tools/tests/test-setup.sh`, `tools/tests/reports/` | #354/#355 | ✅ マージ済みPRの変更 |
+| `.claude/rules/`, `.claude/marketplace/`, `plugin.json` | #352/#356/#357 | ✅ マージ済みPRの変更 |
+| `tools/benchmark/results/comparison-main-vs-develop-20260527.md` | 分析用ファイル | ✅ 想定内（results/は.gitignore対象外） |
+
+## Conclusion
+
+意図しない変更なし。
diff --git a/.work/00361/notes.md b/.work/00361/notes.md
@@ -0,0 +1,75 @@
+# Notes
+
+## 2026-05-28
+
+### T1: DeepEvalジャッジLLM接続方式確認
+
+#### 調査結果
+
+**1. DeepEvalのインストール**
+- `uv pip install deepeval` 成功。`aiobotocore` も追加で必要（`uv pip install aiobotocore`）。
+- `import deepeval` OK。
+
+**2. ジャッジLLM接続方式**
+
+採用: **案A（DeepEval組み込みの`AmazonBedrockModel`を使用）**
+
+根拠:
+- DeepEvalには`deepeval.models.AmazonBedrockModel`が組み込みで存在する。
+- `AmazonBedrockModel(model='jp.anthropic.claude-sonnet-4-6', region='ap-northeast-1')` でインスタンス生成OK。
+- 環境に`AWS_CA_BUNDLE=/usr/local/share/ca-certificates/ca.crt`が設定済みのため、SSLエラーを回避できる。
+- 実際に`a_generate('Say hello in one word.')`が成功することを確認。
+- `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY`/`AWS_REGION`は環境変数として設定済み。
+
+却下した案:
+- 案B（claude CLIサブプロセスラップ）: DeepEvalの非同期呼び出し構造に合わせるのが複雑になる上、案Aで既に動作するため不要。
+- 案C（自前実装）: DeepEvalの品質保証済みプロンプトを使えないため不要。
+
+**3. 利用指標**
+
+`AnswerCorrectnessMetric`/`AnswerSimilarityMetric`はDeepEval最新版に存在しない。
+代わりに以下3指標を使用:
+- `GEval`（Answer Correctness用: カスタム基準でfactの網羅性を評価）
+- `AnswerRelevancyMetric`（Relevancy: 入力に対する回答の関連性）
+- `FaithfulnessMetric`（Faithfulness: retrieval contextに対するハルシネーション検出）
+
+## 2026-05-29
+
+### T19: baseline-deepeval 3 run 結果
+
+全30シナリオ × 3 run 完了（一部シナリオは偶発的エラーで再実行して回収）。
+
+| run | answer_correctness | answer_relevancy | faithfulness | 閾値通過 |
+|-----|-------------------|-----------------|--------------|---------|
+| run-1 | 0.96 | 0.97 | 0.97 | 30/30 全指標 |
+| run-2 | 0.99 | 0.96 | 0.97 | 30/30 全指標 |
+| run-3 | 0.97 | 0.96 | 0.98 | 30/30 全指標 |
+
+全指標で閾値（≥0.5）通過率100%、スコアも安定（0.96〜0.99）。
+これを新ベースライン（`baseline-deepeval/`）として確定する。
+
+→ 既存ベンチマークとの対応:
+- `accuracy`（既存）↔ `GEval`（Answer Correctness）
+- `hallucination`（既存）↔ `FaithfulnessMetric`
+
+**4. LLMTestCaseへのマッピング**
+
+既存データから`LLMTestCase`へのマッピング:
+- `input` ← `scenario["when"]["input"]`（シナリオの質問）
+- `actual_output` ← `answer.md`の内容
+- `expected_output` ← `must.facts`を改行結合（Answer Correctness/GEval用）
+- `retrieval_context` ← `diagnostics.search_sections`（section refリスト）の各セクション内容
+
+**注意**: evaluation.jsonに`workflow_details.step3.selected_pages`は存在しない。
+実際のretrieval contextは`diagnostics.search_sections`（section_id形式: `path/to/file.json:sN`）。
+既存の`load_section_content()`関数でコンテンツを取得できる。
+
+**5. T2以降のタスク修正が必要な点**
+
+T4（evaluate.py）:
+- `retrieval_context` の取得元は `diagnostics.search_sections` を使う（`workflow_details.step3.selected_pages`ではない）
+- 3指標は `GEval`（answer_correctness）、`AnswerRelevancyMetric`（answer_relevancy）、`FaithfulnessMetric`（faithfulness）
+- モデル設定: `AmazonBedrockModel(model=os.environ.get('BEDROCK_MODEL_ID', 'jp.anthropic.claude-sonnet-4-6'), region=os.environ.get('AWS_REGION', 'ap-northeast-1'))`
+
+T2（requirements.txt）:
+- `deepeval` と `aiobotocore` の両方を追加
diff --git a/.work/00361/tasks.md b/.work/00361/tasks.md
@@ -0,0 +1,78 @@
+# Tasks: Replace LLM judge with DeepEval RAG metrics in QA benchmark
+
+**PR**: #362
+**Issue**: #361
+**Updated**: 2026-05-29
+
+## ルール（今日の追加事項）
+- `.claude/settings.json` に `DEEPEVAL_TELEMETRY_OPT_OUT=true` を追加済み（Apache 2.0ライセンス、オプトアウト許可）
+
+## ルール
+
+- 推測せず事実ベースで調査・作業・判断する。コードを読まずに影響範囲を推測しない。grepで確認してから書く。
+- 1タスク = 1コミット（調査タスクはnotesへの記録で完結）
+- 実装前にテストを書く（TDD: RED → GREEN）
+- 各タスク完了後すぐにtasks.mdをコミット・プッシュする
+
+---
+
+## In Progress
+
+### T22: ベンチマーク再取得（3 run）
+
+**背景**: T21の修正後、クリーンな状態でベースラインを再取得する。
+
+**作業**:
+- [x] run-1実行 → `baseline-deepeval/run-1/run/` に保存（29/30、qa-11aタイムアウト）
+- [x] run-2実行 → `baseline-deepeval/run-2/run/` に保存（26/30、3タイムアウト + oos-qa-01エラー）
+- [ ] run-1/2のエラーシナリオを単体再実行して上書き（HOW-TO-RUN.md タイムアウト再実行手順）
+  - run-1: qa-11a（タイムアウト）
+  - run-2: review-07, qa-02, qa-06（タイムアウト）、oos-qa-01（Workflow Details欠落）
+- [ ] run-3実行 → `baseline-deepeval/run-3/` に保存
+  - **中断状態**: `tools/benchmark/results/20260529-150210/` に26シナリオ完了済み（summary.jsonなし）
+  - 残り4シナリオ: qa-14, qa-15, oos-impact-01, oos-qa-01
+  - 中断データ（`tools/benchmark/results/20260529-150210/`）を再利用する（ユーザー確認済み）
+  - 残り4シナリオ `--scenario-ids qa-14,qa-15,oos-impact-01,oos-qa-01` を単体実行
+  - 完了後、結果を `20260529-150210/` にコピーして `baseline-deepeval/run-3/run/` として保存
+- [ ] 各run後に `report.py` でレポート生成・閾値割れ確認（HOW-TO-RUN.md ステップ3）
+- [ ] 3 run集計（ステップ4a）
+- [ ] 閾値割れシナリオの改善判断（ステップ5）
+
+**コミット**: `chore: save baseline-deepeval QA benchmark results (3 runs)`
+
+**中間データの場所**:
+- run-1: `tools/benchmark/results/baseline-deepeval/run-1/run/`（gitトラック済み？いいえ、untracked）
+- run-2: `tools/benchmark/results/baseline-deepeval/run-2/run/`（untracked）
+- run-3中断: `tools/benchmark/results/20260529-150210/`（untracked）
+
+---
+
+### T20: 変更差分チェック + diff-check.md 更新
+
+**コミット**: `docs: update diff check for LLM judge removal`
+
+---
+
+## Done
+
+- [x] T21: e2e-prompt.md / run_qa.py 修正（Answerマーカー導入） — committed `6c5213430`
+- [x] T19: QAベンチマーク全件実行・新ベースライン取得（3 run） — 30/30全件、全指標0.96〜0.99（T21修正前のため廃棄）
+
+- [x] T1: 調査 — DeepEvalのジャッジLLM接続方式確認とLLMTestCase入力マッピング — `5530ab20`
+- [x] T2: requirements.txt 新設 + setup.sh — `93669a7b`
+- [x] T3: テスト追加（RED） — DeepEval 3指標計算のunit test — `1efc394e`
+- [x] T4: evaluate.py 実装（GREEN） — DeepEval 3指標計算関数追加 — `1c7a6a0e`
+- [x] T5: report.py — レポートにDeepEval指標列を追加 — `d87da7de`
+- [x] T6: docs/benchmark-design.md — DeepEval指標設計追記 — `93101e85`
+- [x] T7: 動作確認（1件実行）・SSL修正 — `77a43974`
+- [x] T8: 動作確認（3件実行） — (実行のみ)
+- [x] T9: 全件実行 + 相関分析（SC2） — `bbcc37a50`
+- [x] T10: HOW-TO-RUN.md更新（T13で上書き予定） — `f6195085c`
+- [x] T11: 変更差分チェック（T19で更新予定） — `7d1a0d52d`
+- [x] T12: docs/benchmark-design.md 更新 — `4682e518`
+- [x] T13: tools/benchmark/HOW-TO-RUN.md 更新 — `03206b0b`
+- [x] T14: テスト更新（RED） — `e202bbb9`
+- [x] T15: evaluate.py 実装変更（GREEN） — `00bcd0e1`
+- [x] T16: report.py 実装変更 — `5513641a`
+- [x] T17: run_qa.py から --with-deepeval フラグ削除 — `4d97f74d`
+- [x] T18: 動作確認（1件実行）— 実行のみ、コミットなし