Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925
Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925AliMahmoudzadeh merged 10 commits intomainfrom
Conversation
Add session-level (messages) input path to CustomerSatisfactionEvaluator, mirroring the pattern from TaskCompletionEvaluator PR #4922. Changes: - Add MessagesOrQueryResponseInputValidator support to ConversationValidator - Add serialize_messages, _preprocess_messages, _drop_mcp_approval_messages, _normalize_function_call_types helper functions - Add _do_eval_multi_turn method and extract shared _parse_prompty_output - Add __call__ overload for messages parameter - Create customer_satisfaction_multi_turn.prompty for session-level scoring - Fix query/response check from 'and' to 'or' (correctness bug) - Update spec.yaml: add messages field, anyOf required, bump version to 2 - Add 11 behavioral tests for the multi-turn messages path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test Results for assets-test64 tests 64 ✅ 2s ⏱️ Results for commit 6824f1e. ♻️ This comment has been updated with latest results. |
These files were accidentally modified and are unrelated to the customer satisfaction multi-turn evaluation feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add DEVELOPER role to MessageRole enum - Add deep messages validation: per-item dict/role checks, user+assistant required, last must be assistant with text - Set _OPTIONAL_PARAMS = ['messages'] - Handle developer role in serialize_messages - Update serialize_messages docstring to match reference format - Remove redundant _is_intermediate_response check from _do_eval_multi_turn (validator now catches this) - Update test_messages_intermediate_response to expect exception - Add 10 new validation tests: rejects invalid role, missing role key, no user/assistant, ending with user/tool, non-dict items; allows consecutive user/assistant, developer role 64 tests pass (44 existing + 20 session-level). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 5-Model Parallel Code Review — PR #4925CustomerSatisfactionEvaluator: Multi-turn support ✅ Positive: No prompty template mismatch (unlike PR #4922)All 5 models confirmed that this PR does NOT have the critical The 🔴 Finding 1: Valid
|
| Finding | Opus 4.6 | Sonnet 4.6 | Sonnet 4 | GPT-5.4 | Gemini | Consensus |
|---|---|---|---|---|---|---|
| output_text rejected | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
| In-place mutation | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
| System msg overwrite | — | ✅ | ✅ | ✅ | ✅ | 4/5 |
| Schema validation gap | — | — | — | ✅ | — | 1/5 |
| No prompty mismatch ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
Review generated by 5-model parallel review — Claude Opus 4.6 · Claude Sonnet 4.6 · Claude Sonnet 4 · GPT-5.4 · Gemini
Test Results for scripts-test123 tests 122 ✅ 41m 17s ⏱️ For more details on these failures, see this check. Results for commit 8eef3a2. ♻️ This comment has been updated with latest results. |
Update CustomerSatisfactionEvaluator to support multi-turn evaluation
Summary
Add session-level (
messages) input path toCustomerSatisfactionEvaluator, mirroring the pattern fromTaskCompletionEvaluatorPR #4922.Changes
_customer_satisfaction.pyConversationValidator.validate_eval_inputto handle flatmessageslist input (in addition to existingquery/responseandconversationpaths)_drop_mcp_approval_messages,_normalize_function_call_types,_preprocess_messages,serialize_messages_MULTI_TURN_PROMPTY_FILEclass attr and loaded_multi_turn_flowin__init____call__overload formessagesparameter_do_evalto new_do_eval_multi_turnwhenmessagesis provided_parse_prompty_outputmethod (used by both single-turn and multi-turn)andtoor(correctness bug — old code only raised if both were missing)customer_satisfaction_multi_turn.prompty(new)spec.yamlmessagestodataMappingSchemarequiredtoanyOf: [query, response] | [messages]Tests