Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e8702db
docs: add tasks.md for issue #361 DeepEval RAG metrics
kiyotis May 28, 2026
d114a9c
docs: update tasks.md — revise T7-T11 for correct benchmark flow
kiyotis May 28, 2026
7f1fedf
docs: clarify T1 — rename to judge LLM connection method investigation
kiyotis May 28, 2026
5530ab2
docs: update tasks.md — T1 done, add notes.md with investigation results
kiyotis May 28, 2026
93669a7
chore: add benchmark requirements.txt and setup.sh install step
kiyotis May 28, 2026
1efc394
test: add DeepEval metric computation tests (RED)
kiyotis May 28, 2026
1c7a6a0
feat: add DeepEval metric computation to evaluate.py
kiyotis May 28, 2026
d87da7d
feat: add DeepEval metric columns to benchmark report
kiyotis May 28, 2026
93101e8
docs: add DeepEval metrics design to benchmark-design.md
kiyotis May 28, 2026
695889b
fix: support workflow_details fallback in build_deepeval_test_case an…
kiyotis May 28, 2026
de1aff7
chore: add .deepeval/ to .gitignore
kiyotis May 28, 2026
77a4397
fix: set AWS_CA_BUNDLE from SSL_CERT_FILE for aiobotocore SSL in Deep…
kiyotis May 28, 2026
94f9e69
docs: update tasks.md — T7完了
kiyotis May 28, 2026
bbcc37a
docs: add DeepEval validation results (SC2)
kiyotis May 28, 2026
f619508
docs: update HOW-TO-RUN.md for DeepEval integration
kiyotis May 28, 2026
7d1a0d5
docs: add diff check and update tasks.md — T9/T10/T11完了
kiyotis May 28, 2026
fdd2dd4
docs: update tasks.md — 全タスク完了
kiyotis May 28, 2026
cbe11a1
docs: update tasks.md — add T12-T14 for LLM judge removal
kiyotis May 28, 2026
d87f948
docs: update tasks.md — T12-T19 with full impact scope
kiyotis May 28, 2026
d41574d
docs: update tasks.md — add T19 QA baseline rerun after DeepEval repl…
kiyotis May 28, 2026
3b64cff
docs: update tasks.md — add reason to scores, remove metrics/diagnost…
kiyotis May 28, 2026
4682e51
docs: rewrite benchmark-design.md for DeepEval replacement
kiyotis May 28, 2026
03206b0
docs: rewrite HOW-TO-RUN.md for DeepEval replacement
kiyotis May 28, 2026
e202bbb
test: update tests for DeepEval-only evaluation
kiyotis May 28, 2026
00bcd0e
feat: remove LLM judges from evaluate.py, use DeepEval only
kiyotis May 28, 2026
5513641
feat: remove LLM judge columns from report.py
kiyotis May 28, 2026
4d97f74
feat: remove --with-deepeval flag, DeepEval always runs
kiyotis May 28, 2026
91492a7
docs: update tasks.md — T12-T17 complete
kiyotis May 28, 2026
536bf36
docs: update tasks.md — T12-T18 done, T19-T20 remaining
kiyotis May 28, 2026
69d7967
chore: opt out of DeepEval telemetry + update tasks.md (T19 run-1 done)
kiyotis May 28, 2026
be8ccc8
chore: save baseline-deepeval QA benchmark results (3 runs)
kiyotis May 29, 2026
68c6e42
feat: raise DeepEval thresholds to match mission-critical quality sta…
kiyotis May 29, 2026
df15a9b
docs: update tasks.md — add T21/T22 for answer marker fix and re-benc…
kiyotis May 29, 2026
6c52134
fix: use ### Answer marker to isolate answer from workflow narration
kiyotis May 29, 2026
c53aa64
docs: update tasks.md — T21 done, T22 in progress
kiyotis May 29, 2026
22273ac
docs: update tasks.md + HOW-TO-RUN.md — timeout retry procedure, T22 …
kiyotis May 29, 2026
6665c42
chore: save baseline-deepeval run-1 and run-2 intermediate results
kiyotis May 29, 2026
54fc093
docs: update tasks.md — run-3 resume strategy confirmed
kiyotis May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@
}
]
},
"env": {
"DEEPEVAL_TELEMETRY_OPT_OUT": "true"
},
"permissions": {
"allow": [
"Bash(git *)",
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ __pycache__/
.venv/
venv/
.pytest_cache/

# DeepEval internal cache
.deepeval/
89 changes: 89 additions & 0 deletions .work/00361/deepeval-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# DeepEval Validation Results

**Date**: 2026-05-28
**Run**: `tools/benchmark/results/deepeval-validation/run-1/`
**Scenarios**: 30 total, 28 evaluated (qa-11b: missing runner output, qa-15: section not found error)

## Summary

| Metric Pair | Agreement Rate | Mismatches |
|---|---|---|
| accuracy vs answer_correctness | 27/28 = **96.4%** | 1 case |
| hallucination vs faithfulness | 23/26 = **88.5%** | 3 cases |

## Score Overview

| id | accuracy | hallucination | answer_correctness | answer_relevancy | faithfulness |
|---|---|---|---|---|---|
| impact-01 | 1.00 | 1 | 1.00 | 1.00 | 0.91 |
| impact-03 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| impact-06 | 1.00 | 1 | 1.00 | 0.97 | 0.96 |
| impact-08 | 1.00 | 0 | 1.00 | 1.00 | 0.86 |
| oos-impact-01 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| oos-qa-01 | 1.00 | N/A | 1.00 | 1.00 | 1.00 |
| pre-01 | 1.00 | 1 | 1.00 | 0.92 | 1.00 |
| pre-02 | 1.00 | 1 | 1.00 | 1.00 | 0.95 |
| pre-03 | 1.00 | 1 | 1.00 | 0.79 | 1.00 |
| qa-01 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| qa-02 | 1.00 | N/A | 1.00 | 1.00 | 1.00 |
| qa-03 | 1.00 | 1 | 1.00 | 0.93 | 1.00 |
| qa-04 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| qa-05 | 0.67 | 1 | 0.60 | 0.90 | 0.94 |
| qa-06 | 1.00 | 1 | 1.00 | 0.89 | 1.00 |
| qa-07 | 1.00 | 1 | 1.00 | 1.00 | 0.95 |
| qa-08 | 1.00 | 1 | 1.00 | 1.00 | 0.93 |
| qa-09 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| qa-10 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| qa-11a | 1.00 | 1 | 1.00 | 0.94 | 0.96 |
| qa-12a | 1.00 | 0 | 0.90 | 1.00 | 1.00 |
| qa-12b | 0.50 | 1 | 1.00 | 1.00 | 0.93 |
| qa-13 | 1.00 | 0 | 1.00 | 1.00 | 1.00 |
| qa-14 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| review-06 | 1.00 | 1 | 0.90 | 1.00 | 1.00 |
| review-07 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| review-08 | 1.00 | 1 | 1.00 | 1.00 | 1.00 |
| review-09 | 1.00 | 1 | 1.00 | 1.00 | 0.94 |

## Mismatch Cases

### accuracy vs answer_correctness

**qa-12b**: accuracy=0.50 (FAIL) vs answer_correctness=1.00 (PASS)

- Input: 入力チェックでエラーがあったときに、エラーメッセージをユーザーに返す方法を教えてほしい
- Analysis: accuracy uses claim-by-claim verdict against `must` facts; LLM judge flagged specific claims as unverified. DeepEval GEval uses a broader "does the output cover the expected facts" criterion, which gave full credit despite partial claim failures. The discrepancy reflects different granularity — claim-level strictness (accuracy) vs. holistic coverage (GEval).

### hallucination vs faithfulness

**impact-08**: hallucination=0 (FAIL) vs faithfulness=0.86 (PASS)

- Input: テスト時だけシステム日時を任意の日付に差し替える方法はあるか?
- Analysis: The existing hallucination judge flagged specific claims as unsupported. DeepEval faithfulness scored 0.86, meaning some statements were not grounded in context — consistent with the existing judge — but the threshold difference (0 vs 0.7) caused opposite verdicts. hallucination=0 is a binary FAIL; faithfulness=0.86 passes the 0.7 threshold.

**qa-12a**: hallucination=0 (FAIL) vs faithfulness=1.00 (PASS)

- Analysis: Same root cause as impact-08. The existing hallucination judge applied strict claim-by-claim verification and found at least one unsupported claim. DeepEval faithfulness found all retrieved context supported, giving 1.00. Likely the hallucination judge checks against `must` sections while faithfulness checks against `retrieval_context` — different reference sets.

**qa-13**: hallucination=0 (FAIL) vs faithfulness=1.00 (PASS)

- Analysis: Same pattern. The hallucination=0 verdict comes from claim verification against specific knowledge sections. DeepEval faithfulness=1.00 means the answer is entirely grounded in what was retrieved. The reference set mismatch (specific sections vs. retrieved context) explains the divergence.

## Root Cause of hallucination vs faithfulness Divergence

The 3 hallucination/faithfulness mismatches share the same root cause: **different reference sets**.

- **Existing hallucination judge**: verifies claims against specific section content from the knowledge base
- **DeepEval faithfulness**: verifies statements against `retrieval_context` (what was actually retrieved by the skill)

When retrieval is good (high faithfulness) but the answer omits or misrepresents a required fact (hallucination=0), the two metrics legitimately diverge. This is expected behavior, not a measurement error.

## Conclusion

- **answer_correctness correlates strongly with accuracy** (96.4% agreement). The 1 mismatch is attributable to granularity difference (claim-level vs. holistic).
- **faithfulness has lower agreement with hallucination** (88.5%), explained by different reference sets — a structural difference, not noise.
- Both DeepEval metrics add complementary signal: answer_correctness as a holistic accuracy check, faithfulness as a retrieval-grounded hallucination check.

## Skipped Scenarios

- **qa-11b**: No runner output — likely excluded from a previous run. Not a DeepEval issue.
- **qa-15**: `ValueError: Section s21 not found in check/security-check/security-check-2.チェックリスト.json` — pre-existing data issue unrelated to DeepEval integration.
34 changes: 34 additions & 0 deletions .work/00361/diff-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Diff Check: PR #362

**Date**: 2026-05-28

## Issue #361 Related Changes

| File | Verdict | Note |
|---|---|---|
| `tools/benchmark/requirements.txt` | ✅ 想定内 | deepeval依存を追加 |
| `tools/benchmark/scripts/evaluate.py` | ✅ 想定内 | DeepEval指標計算関数追加、SSL修正 |
| `tools/benchmark/scripts/report.py` | ✅ 想定内 | DeepEval指標列をレポートに追加 |
| `tools/benchmark/scripts/run_qa.py` | ✅ 想定内 | --with-deepevalフラグ追加 |
| `tools/benchmark/tests/test_evaluate.py` | ✅ 想定内 | DeepEval関連テスト追加 |
| `tools/benchmark/tests/test_report.py` | ✅ 想定内 | DeepEvalレポートテスト追加 |
| `docs/benchmark-design.md` | ✅ 想定内 | DeepEval指標設計を追記 |
| `tools/benchmark/HOW-TO-RUN.md` | ✅ 想定内 | --with-deepeval手順を追加 |
| `.work/00361/notes.md` | ✅ 想定内 | 作業ログ |
| `.work/00361/tasks.md` | ✅ 想定内 | タスク管理 |
| `.work/00361/deepeval-validation.md` | ✅ 想定内 | SC2: 相関分析結果 |

## Other Changes (from merged PRs)

このブランチは #352, #354, #358, #360 のマージコミットも含む。これらはすべて別PRでマージ済みの変更がmainからこのブランチへ取り込まれたものであり、意図しない変更ではない。

| File group | Source PR | Verdict |
|---|---|---|
| `setup.sh`, `.gitignore`, `README.md` | #352/#354/#358 | ✅ マージ済みPRの変更 |
| `tools/tests/test-setup.sh`, `tools/tests/reports/` | #354/#355 | ✅ マージ済みPRの変更 |
| `.claude/rules/`, `.claude/marketplace/`, `plugin.json` | #352/#356/#357 | ✅ マージ済みPRの変更 |
| `tools/benchmark/results/comparison-main-vs-develop-20260527.md` | 分析用ファイル | ✅ 想定内(results/は.gitignore対象外) |

## Conclusion

意図しない変更なし。
75 changes: 75 additions & 0 deletions .work/00361/notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Notes

## 2026-05-28

### T1: DeepEvalジャッジLLM接続方式確認

#### 調査結果

**1. DeepEvalのインストール**
- `uv pip install deepeval` 成功。`aiobotocore` も追加で必要(`uv pip install aiobotocore`)。
- `import deepeval` OK。

**2. ジャッジLLM接続方式**

採用: **案A(DeepEval組み込みの`AmazonBedrockModel`を使用)**

根拠:
- DeepEvalには`deepeval.models.AmazonBedrockModel`が組み込みで存在する。
- `AmazonBedrockModel(model='jp.anthropic.claude-sonnet-4-6', region='ap-northeast-1')` でインスタンス生成OK。
- 環境に`AWS_CA_BUNDLE=/usr/local/share/ca-certificates/ca.crt`が設定済みのため、SSLエラーを回避できる。
- 実際に`a_generate('Say hello in one word.')`が成功することを確認。
- `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY`/`AWS_REGION`は環境変数として設定済み。

却下した案:
- 案B(claude CLIサブプロセスラップ): DeepEvalの非同期呼び出し構造に合わせるのが複雑になる上、案Aで既に動作するため不要。
- 案C(自前実装): DeepEvalの品質保証済みプロンプトを使えないため不要。

**3. 利用指標**

`AnswerCorrectnessMetric`/`AnswerSimilarityMetric`はDeepEval最新版に存在しない。
代わりに以下3指標を使用:
- `GEval`(Answer Correctness用: カスタム基準でfactの網羅性を評価)
- `AnswerRelevancyMetric`(Relevancy: 入力に対する回答の関連性)
- `FaithfulnessMetric`(Faithfulness: retrieval contextに対するハルシネーション検出)

## 2026-05-29

### T19: baseline-deepeval 3 run 結果

全30シナリオ × 3 run 完了(一部シナリオは偶発的エラーで再実行して回収)。

| run | answer_correctness | answer_relevancy | faithfulness | 閾値通過 |
|-----|-------------------|-----------------|--------------|---------|
| run-1 | 0.96 | 0.97 | 0.97 | 30/30 全指標 |
| run-2 | 0.99 | 0.96 | 0.97 | 30/30 全指標 |
| run-3 | 0.97 | 0.96 | 0.98 | 30/30 全指標 |

全指標で閾値(≥0.5)通過率100%、スコアも安定(0.96〜0.99)。
これを新ベースライン(`baseline-deepeval/`)として確定する。

→ 既存ベンチマークとの対応:
- `accuracy`(既存)↔ `GEval`(Answer Correctness)
- `hallucination`(既存)↔ `FaithfulnessMetric`

**4. LLMTestCaseへのマッピング**

既存データから`LLMTestCase`へのマッピング:
- `input` ← `scenario["when"]["input"]`(シナリオの質問)
- `actual_output` ← `answer.md`の内容
- `expected_output` ← `must.facts`を改行結合(Answer Correctness/GEval用)
- `retrieval_context` ← `diagnostics.search_sections`(section refリスト)の各セクション内容

**注意**: evaluation.jsonに`workflow_details.step3.selected_pages`は存在しない。
実際のretrieval contextは`diagnostics.search_sections`(section_id形式: `path/to/file.json:sN`)。
既存の`load_section_content()`関数でコンテンツを取得できる。

**5. T2以降のタスク修正が必要な点**

T4(evaluate.py):
- `retrieval_context` の取得元は `diagnostics.search_sections` を使う(`workflow_details.step3.selected_pages`ではない)
- 3指標は `GEval`(answer_correctness)、`AnswerRelevancyMetric`(answer_relevancy)、`FaithfulnessMetric`(faithfulness)
- モデル設定: `AmazonBedrockModel(model=os.environ.get('BEDROCK_MODEL_ID', 'jp.anthropic.claude-sonnet-4-6'), region=os.environ.get('AWS_REGION', 'ap-northeast-1'))`

T2(requirements.txt):
- `deepeval` と `aiobotocore` の両方を追加
78 changes: 78 additions & 0 deletions .work/00361/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Tasks: Replace LLM judge with DeepEval RAG metrics in QA benchmark

**PR**: #362
**Issue**: #361
**Updated**: 2026-05-29

## ルール(今日の追加事項)
- `.claude/settings.json` に `DEEPEVAL_TELEMETRY_OPT_OUT=true` を追加済み(Apache 2.0ライセンス、オプトアウト許可)

## ルール

- 推測せず事実ベースで調査・作業・判断する。コードを読まずに影響範囲を推測しない。grepで確認してから書く。
- 1タスク = 1コミット(調査タスクはnotesへの記録で完結)
- 実装前にテストを書く(TDD: RED → GREEN)
- 各タスク完了後すぐにtasks.mdをコミット・プッシュする

---

## In Progress

### T22: ベンチマーク再取得(3 run)

**背景**: T21の修正後、クリーンな状態でベースラインを再取得する。

**作業**:
- [x] run-1実行 → `baseline-deepeval/run-1/run/` に保存(29/30、qa-11aタイムアウト)
- [x] run-2実行 → `baseline-deepeval/run-2/run/` に保存(26/30、3タイムアウト + oos-qa-01エラー)
- [ ] run-1/2のエラーシナリオを単体再実行して上書き(HOW-TO-RUN.md タイムアウト再実行手順)
- run-1: qa-11a(タイムアウト)
- run-2: review-07, qa-02, qa-06(タイムアウト)、oos-qa-01(Workflow Details欠落)
- [ ] run-3実行 → `baseline-deepeval/run-3/` に保存
- **中断状態**: `tools/benchmark/results/20260529-150210/` に26シナリオ完了済み(summary.jsonなし)
- 残り4シナリオ: qa-14, qa-15, oos-impact-01, oos-qa-01
- 中断データ(`tools/benchmark/results/20260529-150210/`)を再利用する(ユーザー確認済み)
- 残り4シナリオ `--scenario-ids qa-14,qa-15,oos-impact-01,oos-qa-01` を単体実行
- 完了後、結果を `20260529-150210/` にコピーして `baseline-deepeval/run-3/run/` として保存
- [ ] 各run後に `report.py` でレポート生成・閾値割れ確認(HOW-TO-RUN.md ステップ3)
- [ ] 3 run集計(ステップ4a)
- [ ] 閾値割れシナリオの改善判断(ステップ5)

**コミット**: `chore: save baseline-deepeval QA benchmark results (3 runs)`

**中間データの場所**:
- run-1: `tools/benchmark/results/baseline-deepeval/run-1/run/`(gitトラック済み?いいえ、untracked)
- run-2: `tools/benchmark/results/baseline-deepeval/run-2/run/`(untracked)
- run-3中断: `tools/benchmark/results/20260529-150210/`(untracked)

---

### T20: 変更差分チェック + diff-check.md 更新

**コミット**: `docs: update diff check for LLM judge removal`

---

## Done

- [x] T21: e2e-prompt.md / run_qa.py 修正(Answerマーカー導入) — committed `6c5213430`
- [x] T19: QAベンチマーク全件実行・新ベースライン取得(3 run) — 30/30全件、全指標0.96〜0.99(T21修正前のため廃棄)

- [x] T1: 調査 — DeepEvalのジャッジLLM接続方式確認とLLMTestCase入力マッピング — `5530ab20`
- [x] T2: requirements.txt 新設 + setup.sh — `93669a7b`
- [x] T3: テスト追加(RED) — DeepEval 3指標計算のunit test — `1efc394e`
- [x] T4: evaluate.py 実装(GREEN) — DeepEval 3指標計算関数追加 — `1c7a6a0e`
- [x] T5: report.py — レポートにDeepEval指標列を追加 — `d87da7de`
- [x] T6: docs/benchmark-design.md — DeepEval指標設計追記 — `93101e85`
- [x] T7: 動作確認(1件実行)・SSL修正 — `77a43974`
- [x] T8: 動作確認(3件実行) — (実行のみ)
- [x] T9: 全件実行 + 相関分析(SC2) — `bbcc37a50`
- [x] T10: HOW-TO-RUN.md更新(T13で上書き予定) — `f6195085c`
- [x] T11: 変更差分チェック(T19で更新予定) — `7d1a0d52d`
- [x] T12: docs/benchmark-design.md 更新 — `4682e518`
- [x] T13: tools/benchmark/HOW-TO-RUN.md 更新 — `03206b0b`
- [x] T14: テスト更新(RED) — `e202bbb9`
- [x] T15: evaluate.py 実装変更(GREEN) — `00bcd0e1`
- [x] T16: report.py 実装変更 — `5513641a`
- [x] T17: run_qa.py から --with-deepeval フラグ削除 — `4d97f74d`
- [x] T18: 動作確認(1件実行)— 実行のみ、コミットなし
Loading