Update CustomerSatisfactionEvaluator to support multi-turn evaluation by AliMahmoudzadeh · Pull Request #4925 · Azure/azureml-assets

AliMahmoudzadeh · 2026-04-09T16:10:09Z

Update CustomerSatisfactionEvaluator to support multi-turn evaluation

Summary

Add session-level (messages) input path to CustomerSatisfactionEvaluator, mirroring the pattern from TaskCompletionEvaluator PR #4922.

Changes

`_customer_satisfaction.py`

Extended ConversationValidator.validate_eval_input to handle flat messages list input (in addition to existing query/response and conversation paths)
Added preprocessing helpers: _drop_mcp_approval_messages, _normalize_function_call_types, _preprocess_messages, serialize_messages
Added _MULTI_TURN_PROMPTY_FILE class attr and loaded _multi_turn_flow in __init__
Added __call__ overload for messages parameter
Routed _do_eval to new _do_eval_multi_turn when messages is provided
Extracted shared _parse_prompty_output method (used by both single-turn and multi-turn)
Fixed query/response presence check from and to or (correctness bug — old code only raised if both were missing)

`customer_satisfaction_multi_turn.prompty` (new)

Multi-turn prompt adapted for 1-5 Likert satisfaction scoring across full conversation sessions
Evaluates helpfulness, completeness, and tone dimensions

`spec.yaml`

Added messages to dataMappingSchema
Changed required to anyOf: [query, response] | [messages]
Bumped version to 2

Tests

Added 11 session-level behavioral tests covering: valid input, empty/invalid messages, system messages, tool calls, intermediate responses, string content, MCP approval, flow routing, and non-dict items
All 55 tests pass (44 existing + 11 new)

Add session-level (messages) input path to CustomerSatisfactionEvaluator, mirroring the pattern from TaskCompletionEvaluator PR #4922. Changes: - Add MessagesOrQueryResponseInputValidator support to ConversationValidator - Add serialize_messages, _preprocess_messages, _drop_mcp_approval_messages, _normalize_function_call_types helper functions - Add _do_eval_multi_turn method and extract shared _parse_prompty_output - Add __call__ overload for messages parameter - Create customer_satisfaction_multi_turn.prompty for session-level scoring - Fix query/response check from 'and' to 'or' (correctness bug) - Update spec.yaml: add messages field, anyOf required, bump version to 2 - Add 11 behavioral tests for the multi-turn messages path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-09T16:10:59Z

Test Results for assets-test

64 tests 64 ✅ 2s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 6824f1e.

♻️ This comment has been updated with latest results.

These files were accidentally modified and are unrelated to the customer satisfaction multi-turn evaluation feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add DEVELOPER role to MessageRole enum - Add deep messages validation: per-item dict/role checks, user+assistant required, last must be assistant with text - Set _OPTIONAL_PARAMS = ['messages'] - Handle developer role in serialize_messages - Update serialize_messages docstring to match reference format - Remove redundant _is_intermediate_response check from _do_eval_multi_turn (validator now catches this) - Update test_messages_intermediate_response to expect exception - Add 10 new validation tests: rejects invalid role, missing role key, no user/assistant, ending with user/tool, non-dict items; allows consecutive user/assistant, developer role 64 tests pass (44 existing + 20 session-level). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

imatiach-msft · 2026-04-15T02:21:22Z

🔍 5-Model Parallel Code Review — PR #4925

CustomerSatisfactionEvaluator: Multi-turn support
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4, GPT-5.4, Gemini (Goldeneye)
Consensus scoring: X/5 models agreeing on each finding

✅ Positive: No prompty template mismatch (unlike PR #4922)

All 5 models confirmed that this PR does NOT have the critical {{conversation}} binding bug found in PR #4922. The prompty input name (conversation), template variable ({{conversation}}), and Python kwarg ({"conversation": conversation_text}) all match correctly. 👍

The and → or bug fix for query/response presence check is also confirmed correct — both query and response are required inputs, so rejecting when either is missing is the right behavior.

🔴 Finding 1: Valid `output_text` final responses are rejected (5/5 models)

File: _customer_satisfaction.py — final assistant content check
Severity: High

The validator only accepts type == "text" as valid text content in the final assistant message. However, the OpenAI Responses API returns assistant content items with type: "output_text". Conversations captured from agents using the Responses API will be incorrectly rejected as "mid-execution."

The existing evaluator code already treats output_text as valid elsewhere, and _normalize_function_call_types does not normalize output_text → text, so this gap is real.

Recommendation: Accept both "text" and "output_text" in the final assistant text check, or add an output_text normalization step to the preprocessing pipeline.

🔴 Finding 2: Preprocessing mutates caller's message objects in-place (5/5 models)

File: _customer_satisfaction.py — _normalize_function_call_types
Severity: High

_normalize_function_call_types() directly modifies item["type"] and pop()s fields like function_call_output / openapi_call_output on the caller's original data. If the same messages list is reused across multiple evaluators in a batch/continuous evaluation pipeline (CustomerSatisfaction → TaskCompletion → etc.), the mutation from one evaluator silently corrupts the input for subsequent evaluators.

Recommendation: Deep-copy the messages at the entry point of _preprocess_messages before performing any in-place mutations.

🟡 Finding 3: System/developer messages overwritten — only last one kept (4/5 models)

File: _customer_satisfaction.py — serialize_messages()
Severity: Medium

serialize_messages() stores system/developer context in a single system_message variable and overwrites it on each system/developer turn. Only the last such message reaches the prompty, so earlier instructions/context are silently lost.

Additionally, Gemini flagged a crash risk: if a system/developer message has list-based content blocks (e.g., [{"type": "input_text", "text": "..."}]), msg.get("content", "") stores the raw list, which will raise a TypeError during string concatenation in _pretty_format_conversation_history().

Recommendation: Accumulate all system/developer messages, and normalize list-based content to plain text before formatting.

ℹ️ Finding 4: `messages` path skips per-message schema validation (1/5 models — GPT-5.4)

File: _customer_satisfaction.py — ConversationValidator.validate_eval_input
Severity: Low

The new messages branch only checks roles and the final assistant turn. It does not run the existing per-message schema validation (_validate_message_dict / _validate_input_messages_list), so malformed items like {"role": "user"} (missing content) are accepted. serialize_messages() then silently skips bad messages, potentially producing a score for an incomplete conversation.

Recommendation: Consider reusing existing per-message validators for the messages path.

Verdict: APPROVE WITH SUGGESTIONS ✅

The core multi-turn flow is well-designed and avoids the critical prompty mismatch bug from PR #4922. The and → or fix is correct. The behavioral tests are comprehensive.

The output_text rejection (Finding 1) is the most impactful issue — it will cause real failures for Responses API users. The in-place mutation (Finding 2) is a correctness concern for batch pipelines. Both are worth addressing before or shortly after merge.

📊 Model Agreement Matrix

Finding	Opus 4.6	Sonnet 4.6	Sonnet 4	GPT-5.4	Gemini	Consensus
output_text rejected	✅	✅	✅	✅	✅	5/5
In-place mutation	✅	✅	✅	✅	✅	5/5
System msg overwrite	—	✅	✅	✅	✅	4/5
Schema validation gap	—	—	—	✅	—	1/5
No prompty mismatch ✅	✅	✅	✅	✅	✅	5/5

Review generated by 5-model parallel review — Claude Opus 4.6 · Claude Sonnet 4.6 · Claude Sonnet 4 · GPT-5.4 · Gemini

github-actions · 2026-04-15T04:10:26Z

Test Results for scripts-test

123 tests 122 ✅ 41m 17s ⏱️
1 suites 0 💤
1 files 1 ❌

For more details on these failures, see this check.

Results for commit 8eef3a2.

♻️ This comment has been updated with latest results.

Ali Mahmoudzadeh and others added 6 commits April 9, 2026 09:03

added assets

dcbc656

updated the regex patterns

5f6fb41

changed to jsonl

cf975d6

Add bigbenchhard data asset for BBH benchmark

bc4bffd

update schema

3977d1d

AliMahmoudzadeh temporarily deployed to Testing April 9, 2026 16:10 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing April 9, 2026 16:12 — with GitHub Actions Inactive

imatiach-msft reviewed Apr 10, 2026

View reviewed changes

Comment thread assets/benchmarkspecs/builtin/bigbenchhard/spec.yaml

imatiach-msft reviewed Apr 10, 2026

View reviewed changes

Comment thread assets/pipelines/data/global_dataset/generic_csv/bigbenchhard/asset.yaml Outdated

imatiach-msft reviewed Apr 10, 2026

View reviewed changes

Comment thread assets/pipelines/data/global_dataset/generic_csv/bigbenchhard/spec.yaml Outdated

Revert unwanted bigbenchhard file changes

20123f3

These files were accidentally modified and are unrelated to the customer satisfaction multi-turn evaluation feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AliMahmoudzadeh temporarily deployed to Testing April 11, 2026 01:20 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing April 11, 2026 01:23 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing April 11, 2026 01:26 — with GitHub Actions Inactive

AliMahmoudzadeh requested a review from imatiach-msft April 14, 2026 23:49

AliMahmoudzadeh marked this pull request as ready for review April 14, 2026 23:50

AliMahmoudzadeh requested review from a team as code owners April 14, 2026 23:50

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 00:59 — with GitHub Actions Inactive

imatiach-msft reviewed Apr 15, 2026

View reviewed changes

Comment thread assets/evaluators/builtin/customer_satisfaction/evaluator/_customer_satisfaction.py

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 02:19 — with GitHub Actions Inactive

AliMahmoudzadeh had a problem deploying to Testing April 15, 2026 02:19 — with GitHub Actions Failure

imatiach-msft previously approved these changes Apr 15, 2026

View reviewed changes

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 03:29 — with GitHub Actions Inactive

added preview tag

8eef3a2

AliMahmoudzadeh dismissed imatiach-msft’s stale review via 8eef3a2 April 15, 2026 06:34

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 07:12 — with GitHub Actions Inactive

AliMahmoudzadeh had a problem deploying to Testing April 15, 2026 07:58 — with GitHub Actions Failure

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 07:58 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing April 15, 2026 08:59 — with GitHub Actions Inactive

imatiach-msft approved these changes Apr 15, 2026

View reviewed changes

AliMahmoudzadeh had a problem deploying to Testing April 16, 2026 00:02 — with GitHub Actions Failure

Merge branch 'main' into customer-satisfaction-multi-turn

6824f1e

AliMahmoudzadeh temporarily deployed to Testing April 16, 2026 04:00 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing April 16, 2026 04:02 — with GitHub Actions Inactive

AliMahmoudzadeh merged commit c1b7ddf into main Apr 16, 2026
38 checks passed

AliMahmoudzadeh deleted the customer-satisfaction-multi-turn branch April 16, 2026 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925

Update CustomerSatisfactionEvaluator to support multi-turn evaluation#4925
AliMahmoudzadeh merged 10 commits intomainfrom
customer-satisfaction-multi-turn

AliMahmoudzadeh commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imatiach-msft commented Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AliMahmoudzadeh commented Apr 9, 2026