Skip to content

Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925

Merged
AliMahmoudzadeh merged 10 commits intomainfrom
customer-satisfaction-multi-turn
Apr 16, 2026
Merged

Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925
AliMahmoudzadeh merged 10 commits intomainfrom
customer-satisfaction-multi-turn

Conversation

@AliMahmoudzadeh
Copy link
Copy Markdown
Contributor

Update CustomerSatisfactionEvaluator to support multi-turn evaluation

Summary

Add session-level (messages) input path to CustomerSatisfactionEvaluator, mirroring the pattern from TaskCompletionEvaluator PR #4922.

Changes

_customer_satisfaction.py

  • Extended ConversationValidator.validate_eval_input to handle flat messages list input (in addition to existing query/response and conversation paths)
  • Added preprocessing helpers: _drop_mcp_approval_messages, _normalize_function_call_types, _preprocess_messages, serialize_messages
  • Added _MULTI_TURN_PROMPTY_FILE class attr and loaded _multi_turn_flow in __init__
  • Added __call__ overload for messages parameter
  • Routed _do_eval to new _do_eval_multi_turn when messages is provided
  • Extracted shared _parse_prompty_output method (used by both single-turn and multi-turn)
  • Fixed query/response presence check from and to or (correctness bug — old code only raised if both were missing)

customer_satisfaction_multi_turn.prompty (new)

  • Multi-turn prompt adapted for 1-5 Likert satisfaction scoring across full conversation sessions
  • Evaluates helpfulness, completeness, and tone dimensions

spec.yaml

  • Added messages to dataMappingSchema
  • Changed required to anyOf: [query, response] | [messages]
  • Bumped version to 2

Tests

  • Added 11 session-level behavioral tests covering: valid input, empty/invalid messages, system messages, tool calls, intermediate responses, string content, MCP approval, flow routing, and non-dict items
  • All 55 tests pass (44 existing + 11 new)

Ali Mahmoudzadeh and others added 6 commits April 9, 2026 09:03
Add session-level (messages) input path to CustomerSatisfactionEvaluator,
mirroring the pattern from TaskCompletionEvaluator PR #4922.

Changes:
- Add MessagesOrQueryResponseInputValidator support to ConversationValidator
- Add serialize_messages, _preprocess_messages, _drop_mcp_approval_messages,
  _normalize_function_call_types helper functions
- Add _do_eval_multi_turn method and extract shared _parse_prompty_output
- Add __call__ overload for messages parameter
- Create customer_satisfaction_multi_turn.prompty for session-level scoring
- Fix query/response check from 'and' to 'or' (correctness bug)
- Update spec.yaml: add messages field, anyOf required, bump version to 2
- Add 11 behavioral tests for the multi-turn messages path

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Test Results for assets-test

64 tests   64 ✅  2s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit 6824f1e.

♻️ This comment has been updated with latest results.

Comment thread assets/benchmarkspecs/builtin/bigbenchhard/spec.yaml
Comment thread assets/pipelines/data/global_dataset/generic_csv/bigbenchhard/asset.yaml Outdated
Comment thread assets/pipelines/data/global_dataset/generic_csv/bigbenchhard/spec.yaml Outdated
These files were accidentally modified and are unrelated to the
customer satisfaction multi-turn evaluation feature.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add DEVELOPER role to MessageRole enum
- Add deep messages validation: per-item dict/role checks,
  user+assistant required, last must be assistant with text
- Set _OPTIONAL_PARAMS = ['messages']
- Handle developer role in serialize_messages
- Update serialize_messages docstring to match reference format
- Remove redundant _is_intermediate_response check from
  _do_eval_multi_turn (validator now catches this)
- Update test_messages_intermediate_response to expect exception
- Add 10 new validation tests: rejects invalid role, missing
  role key, no user/assistant, ending with user/tool, non-dict
  items; allows consecutive user/assistant, developer role

64 tests pass (44 existing + 20 session-level).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AliMahmoudzadeh AliMahmoudzadeh marked this pull request as ready for review April 14, 2026 23:50
@AliMahmoudzadeh AliMahmoudzadeh requested review from a team as code owners April 14, 2026 23:50
@imatiach-msft
Copy link
Copy Markdown
Contributor

imatiach-msft commented Apr 15, 2026

🔍 5-Model Parallel Code Review — PR #4925

CustomerSatisfactionEvaluator: Multi-turn support
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4, GPT-5.4, Gemini (Goldeneye)
Consensus scoring: X/5 models agreeing on each finding


✅ Positive: No prompty template mismatch (unlike PR #4922)

All 5 models confirmed that this PR does NOT have the critical {{conversation}} binding bug found in PR #4922. The prompty input name (conversation), template variable ({{conversation}}), and Python kwarg ({"conversation": conversation_text}) all match correctly. 👍

The andor bug fix for query/response presence check is also confirmed correct — both query and response are required inputs, so rejecting when either is missing is the right behavior.


🔴 Finding 1: Valid output_text final responses are rejected (5/5 models)

File: _customer_satisfaction.py — final assistant content check
Severity: High

The validator only accepts type == "text" as valid text content in the final assistant message. However, the OpenAI Responses API returns assistant content items with type: "output_text". Conversations captured from agents using the Responses API will be incorrectly rejected as "mid-execution."

The existing evaluator code already treats output_text as valid elsewhere, and _normalize_function_call_types does not normalize output_texttext, so this gap is real.

Recommendation: Accept both "text" and "output_text" in the final assistant text check, or add an output_text normalization step to the preprocessing pipeline.


🔴 Finding 2: Preprocessing mutates caller's message objects in-place (5/5 models)

File: _customer_satisfaction.py_normalize_function_call_types
Severity: High

_normalize_function_call_types() directly modifies item["type"] and pop()s fields like function_call_output / openapi_call_output on the caller's original data. If the same messages list is reused across multiple evaluators in a batch/continuous evaluation pipeline (CustomerSatisfaction → TaskCompletion → etc.), the mutation from one evaluator silently corrupts the input for subsequent evaluators.

Recommendation: Deep-copy the messages at the entry point of _preprocess_messages before performing any in-place mutations.


🟡 Finding 3: System/developer messages overwritten — only last one kept (4/5 models)

File: _customer_satisfaction.pyserialize_messages()
Severity: Medium

serialize_messages() stores system/developer context in a single system_message variable and overwrites it on each system/developer turn. Only the last such message reaches the prompty, so earlier instructions/context are silently lost.

Additionally, Gemini flagged a crash risk: if a system/developer message has list-based content blocks (e.g., [{"type": "input_text", "text": "..."}]), msg.get("content", "") stores the raw list, which will raise a TypeError during string concatenation in _pretty_format_conversation_history().

Recommendation: Accumulate all system/developer messages, and normalize list-based content to plain text before formatting.


ℹ️ Finding 4: messages path skips per-message schema validation (1/5 models — GPT-5.4)

File: _customer_satisfaction.pyConversationValidator.validate_eval_input
Severity: Low

The new messages branch only checks roles and the final assistant turn. It does not run the existing per-message schema validation (_validate_message_dict / _validate_input_messages_list), so malformed items like {"role": "user"} (missing content) are accepted. serialize_messages() then silently skips bad messages, potentially producing a score for an incomplete conversation.

Recommendation: Consider reusing existing per-message validators for the messages path.


Verdict: APPROVE WITH SUGGESTIONS

The core multi-turn flow is well-designed and avoids the critical prompty mismatch bug from PR #4922. The andor fix is correct. The behavioral tests are comprehensive.

The output_text rejection (Finding 1) is the most impactful issue — it will cause real failures for Responses API users. The in-place mutation (Finding 2) is a correctness concern for batch pipelines. Both are worth addressing before or shortly after merge.


📊 Model Agreement Matrix
Finding Opus 4.6 Sonnet 4.6 Sonnet 4 GPT-5.4 Gemini Consensus
output_text rejected 5/5
In-place mutation 5/5
System msg overwrite 4/5
Schema validation gap 1/5
No prompty mismatch ✅ 5/5

Review generated by 5-model parallel review — Claude Opus 4.6 · Claude Sonnet 4.6 · Claude Sonnet 4 · GPT-5.4 · Gemini

imatiach-msft
imatiach-msft previously approved these changes Apr 15, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2026

Test Results for scripts-test

123 tests   122 ✅  41m 17s ⏱️
  1 suites    0 💤
  1 files      1 ❌

For more details on these failures, see this check.

Results for commit 8eef3a2.

♻️ This comment has been updated with latest results.

@AliMahmoudzadeh AliMahmoudzadeh merged commit c1b7ddf into main Apr 16, 2026
38 checks passed
@AliMahmoudzadeh AliMahmoudzadeh deleted the customer-satisfaction-multi-turn branch April 16, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants