Skip to content

feat(eval): multi-turn conversational test case — live turn-by-turn evaluation #1052

@christso

Description

@christso

Summary

Support evaluating multi-turn conversations where the agent generates each assistant turn, with per-turn grading. Today, multi-turn evals script all intermediate assistant responses in input — the LLM only generates the last response.

Research basis

Cross-framework comparison in agentevals-research PR #57 (8 frameworks analyzed). Key findings that shaped this design:

Finding Impact on design
All frameworks converge on turns array Confirms our schema choice
Azure SDK uses min aggregation ("weakest link") alongside mean Add `aggregation: mean
Braintrust shows per-turn grading can't catch conversation-level failures Add top-level assertions for post-conversation grading
`on_turn_failure: stop continue` is sufficient
Google ADK + DeepEval ship simulated user as core Move to Phase 2, not Phase 3
Azure EvaluatorBase makes any grader work in multi-turn mode All existing AgentV graders should work in mode: conversation
DeepEval sliding window (default 10 turns) for context bounding Add window_size option for long conversations
MultiChallenge taxonomy (instruction retention, inference memory, versioned editing, self-coherence) Informs built-in conversation grader prompts

Proposed schema

Uses AgentV's string shorthand for assertions — plain strings become rubric criteria automatically (see examples/features/rubric/).

tests:
  - id: travel-planning
    mode: conversation
    criteria: Agent maintains context and provides relevant travel advice across turns
    aggregation: mean   # mean (default) | min ("weakest link") | max
    input:
      - role: system
      - input: You are a helpful travel planning assistant.

    turns:

      - input: I'm planning a two-week trip to Japan next spring. What areas should I focus on?
        # expected_output is a reference for grading — the actual LLM response
        # (not this) carries forward to the next turn
        expected_output: |-
          Japan in spring is a wonderful choice! I'd recommend focusing on
          Kyoto for its incredible temple complexes, the Japanese Alps for
          stunning mountain hiking, and the Kumano Kodo pilgrimage trails.
          Spring is perfect timing — cherry blossom season runs late March
          through mid-April, and the weather is mild for outdoor exploration.
        assertions:
          - Recommends specific Japan regions or cities
          - Acknowledges spring timing (cherry blossom season, weather)


      - input: I'm mostly interested in traditional culture and nature, not big cities.
        expected_output: |-
          Great choices! For traditional culture and nature, I'd suggest
          staying at a temple lodging in Koyasan, visiting the thatched-roof
          villages of Shirakawa-go, and exploring rural Kyoto areas like
          Ohara and Kurama. The Kumano Kodo trail I mentioned would also
          be perfect for you.
        assertions:
          - Shifts recommendations toward rural/traditional areas
          - Does not lead with Tokyo/Osaka nightlife or urban attractions
          - References or builds on regions mentioned in previous turn


      - input: What about the budget? I'm thinking around $3000 not including flights.
        assertions:
          - Provides advice within $3000 budget
          - Budget advice accounts for cultural/nature preference from prior turns
          - Does not suggest luxury resorts or expensive city hotels


      - input: Can you suggest a rough day-by-day itinerary based on everything we discussed?
        assertions:
          - Itinerary covers approximately 14 days
          - Emphasizes traditional culture and nature activities
          - Feasible within $3000 budget
          - Includes specific locations discussed in earlier turns

    # Conversation-level
    assertions:
      - Agent consistently remembers destination, preferences, budget, and timing across all turns
      - Agent never contradicts its own prior recommendations
      - Each turn builds on prior context rather than starting fresh

Mixing shorthand with structured assertions

Per-turn assertions can also use the full evaluator config when you need weights, required, or non-rubric types:

      - input: Queue
        assertions:
          - Asks about required tags (RTR, RTK, WFC, etc.)
          - type: contains
            value: CSS
          - type: rubrics
            criteria:
              - id: no-premature-diagnosis
                outcome: Does not jump to a solution before gathering information
                weight: 2
                required: true

How it works

  1. input provides system prompt and initial context (same as today)
  2. For each entry in turns:
    a. Append the user message to the accumulated message history
    b. Call the provider with the full history — LLM generates an assistant response
    c. Grade the response against that turn's assertions (if present) and expected_output (if present, compared via implicit llm-grader)
    d. Append the actual LLM response (not expected_output) to history for the next turn
  3. After all turns: run top-level assertions over the full transcript (if present). Also run top-level criteria as implicit llm-grader if no per-turn or conversation assertions exist (backward-compatible fallback).
  4. Final test score = aggregation of per-turn scores + conversation assertion scores. Turns without assertions/expected_output score 1.0.

Turn schema

interface ConversationTurn {
  // Each turn is a user message. The runner generates the assistant response.
  readonly input: TestMessageContent;  // string or structured content
  readonly expected_output?: TestMessageContent;
  readonly assertions?: readonly (string | EvaluatorConfig)[];  // strings = rubric shorthand
}

Grader template variables per turn

Existing template variables — no new variables needed:

  • {{ input }} — full conversation history up to and including this user message
  • {{ output }} — the LLM's response for this turn only
  • {{ criteria }} — the top-level test criteria (shared across all turns)
  • {{ expected_output }} — this turn's expected_output (if present)

For top-level assertions:

  • {{ input }} — the full conversation transcript (all turns)
  • {{ output }} — the last assistant response
  • {{ criteria }} — the top-level test criteria

Validation rules

  • turns requires mode: conversation — error if turns present without it
  • mode: conversation requires turns — error if mode set but turns missing/empty
  • Each turn must have non-empty input
  • Tests without mode/turns behave identically to today (no regression)
  • turns is incompatible with top-level expected_output — error if both present
  • aggregation only valid when mode: conversation is set

Interaction with existing features

Feature Interaction
depends_on Works normally — a conversation test can depend on other tests and vice versa
trials Each trial runs the full conversation independently
workspace Not applicable in v1 (LLM providers only)
assertions (top-level) Grades the full conversation after all turns complete. Complementary with per-turn assertions
on_turn_failure continue (default) or stop. When stop, remaining turns are skipped and scored 0
window_size Optional. Limits context passed to per-turn graders (default: all turns). Useful for long conversations

Output shape

One EvaluationResult per test. Per-turn rubric assertions show exactly what passed/failed:

{
  "test_id": "travel-planning",
  "score": 0.88,
  "execution_status": "ok",
  "scores": [
    {
      "name": "turn-1", "type": "rubrics", "score": 1.0, "verdict": "pass",
      "assertions": [
        {"text": "Recommends specific Japan regions or cities", "passed": true},
        {"text": "Acknowledges spring timing", "passed": true}
      ]
    },
    {
      "name": "turn-2", "type": "rubrics", "score": 0.67, "verdict": "fail",
      "assertions": [
        {"text": "Shifts recommendations toward rural/traditional areas", "passed": true},
        {"text": "Does not lead with urban attractions", "passed": true},
        {"text": "References or builds on regions mentioned in previous turn", "passed": false}
      ]
    },
    {
      "name": "turn-3", "type": "rubrics", "score": 1.0, "verdict": "pass",
      "assertions": [
        {"text": "Provides advice within $3000 budget", "passed": true},
        {"text": "Budget accounts for culture/nature preference", "passed": true},
        {"text": "Does not suggest luxury resorts", "passed": true}
      ]
    },
    {
      "name": "turn-4", "type": "rubrics", "score": 0.75, "verdict": "fail",
      "assertions": [
        {"text": "Itinerary covers approximately 14 days", "passed": true},
        {"text": "Emphasizes traditional culture and nature", "passed": true},
        {"text": "Feasible within $3000 budget", "passed": true},
        {"text": "Includes specific locations from earlier turns", "passed": false}
      ]
    },
    {
      "name": "assertions", "type": "rubrics", "score": 0.67, "verdict": "fail",
      "assertions": [
        {"text": "Remembers destination, preferences, budget, timing across all turns", "passed": true},
        {"text": "Never contradicts own prior recommendations", "passed": true},
        {"text": "Each turn builds on prior context", "passed": false}
      ]
    }
  ],
  "output": [
    {"role": "user", "content": "I'm planning a two-week trip to Japan..."},
    {"role": "assistant", "content": "Japan in spring is wonderful! I'd recommend..."},
    {"role": "user", "content": "I'm mostly interested in traditional culture..."},
    {"role": "assistant", "content": "For traditional culture and nature, consider..."},
    {"role": "user", "content": "What about the budget?..."},
    {"role": "assistant", "content": "With $3000, you can comfortably..."},
    {"role": "user", "content": "Can you suggest a rough itinerary?..."},
    {"role": "assistant", "content": "Here's a 14-day itinerary..."}
  ]
}

Implementation

Phase 1: Core turn runner + conversation assertions

  • Add ConversationTurn type, mode, turns, on_turn_failure, window_size to EvalTest. Add aggregation to EvalTest
  • Zod schema and YAML parser updates
  • Turn-by-turn loop in runEvalCase: accumulate messages, call provider, grade, repeat
  • Conversation assertions run after all turns
  • Aggregation: mean (default), min, max
  • All existing grader types work per-turn (no grader changes needed)
  • String shorthand in per-turn assertions works identically to top-level (existing parser handles this)

Phase 2: Simulated user + advanced patterns

  • LLM-as-user: define user persona, let the LLM generate user messages dynamically
  • max_turns limit for open-ended conversations
  • Agent provider support (session-based)

Files to modify

File Change
packages/core/src/evaluation/types.ts Add ConversationTurn, mode, turns, on_turn_failure, window_size to EvalTest. Add aggregation to EvalTest
packages/core/src/evaluation/validation/eval-file.schema.ts Zod schema for new fields
packages/core/src/evaluation/yaml-parser.ts Parse new fields
packages/core/src/evaluation/orchestrator.ts Conversation runner path in runEvalCase
packages/core/src/evaluation/formatting/prompt-builder.ts Per-turn prompt building with accumulated history
packages/core/scripts/generate-eval-schema.ts Regenerate JSON schema
packages/core/test/evaluation/conversation-mode.test.ts New test file
examples/features/multi-turn-conversation/ Update example

Acceptance criteria

  • Tests without mode/turns behave identically (no regression)
  • Each turn gets a fresh LLM call with accumulated history
  • Per-turn assertions (string shorthand and structured) grade only that turn's response
  • Actual LLM response (not expected_output) carries forward to next turn
  • Top-level assertions run after all turns with full transcript
  • aggregation: min uses weakest turn score
  • on_turn_failure: stop skips remaining turns
  • on_turn_failure: continue (default) runs all turns regardless
  • window_size limits context for per-turn graders
  • output in results contains full conversation transcript
  • scores array has per-criterion assertions for each turn and conversation-level check
  • Works with depends_on and trials
  • Validation rejects invalid combinations
  • Example updated and passing

Example source

The example above (travel planning) is adapted from promptfoo's eval-conversation-relevance example (MIT license). It tests context retention across 4 turns where each turn builds on prior preferences.

For a larger benchmark, MT-Bench (Apache-2.0) provides 80 two-turn eval pairs with reference answers across 8 categories.

Use cases

  1. Decision tree troubleshooting: Verify the agent asks the right diagnostic questions in order
  2. Customer support flows: Test context retention, persona maintenance, and resolution across turns
  3. User correction handling: Test graceful recovery when the user corrects a previous answer
  4. Workflow plugin eval: Verify an agent follows a prescribed skill pipeline

Research references

Framework Key pattern adopted
DeepEval ConversationalTestCase turns array, sliding window, conversation-level metrics
Azure AI Eval SDK aggregation: min ("weakest link"), EvaluatorBase universal multi-turn support
Braintrust Top-level assertions for conversation-level grading (separate from per-turn)
Google ADK Simulated user as Phase 2 priority
MT-Bench-101 "Weakest link" scoring semantics
MultiChallenge Evaluation dimensions taxonomy (instruction retention, self-coherence, versioned editing)
promptfoo YAML conversation eval format, conversation-relevance assertion

Full research: agentevals-research PR #57

Prior art in this repo

Related cleanup (separate issue)

Remove conversation_id top-level field from EvalTest. Hard removal, no deprecation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions