Summary
Support evaluating multi-turn conversations where the agent generates each assistant turn, with per-turn grading. Today, multi-turn evals script all intermediate assistant responses in input — the LLM only generates the last response.
Research basis
Cross-framework comparison in agentevals-research PR #57 (8 frameworks analyzed). Key findings that shaped this design:
| Finding |
Impact on design |
All frameworks converge on turns array |
Confirms our schema choice |
Azure SDK uses min aggregation ("weakest link") alongside mean |
Add `aggregation: mean |
| Braintrust shows per-turn grading can't catch conversation-level failures |
Add top-level assertions for post-conversation grading |
| `on_turn_failure: stop |
continue` is sufficient |
| Google ADK + DeepEval ship simulated user as core |
Move to Phase 2, not Phase 3 |
Azure EvaluatorBase makes any grader work in multi-turn mode |
All existing AgentV graders should work in mode: conversation |
| DeepEval sliding window (default 10 turns) for context bounding |
Add window_size option for long conversations |
| MultiChallenge taxonomy (instruction retention, inference memory, versioned editing, self-coherence) |
Informs built-in conversation grader prompts |
Proposed schema
Uses AgentV's string shorthand for assertions — plain strings become rubric criteria automatically (see examples/features/rubric/).
tests:
- id: travel-planning
mode: conversation
criteria: Agent maintains context and provides relevant travel advice across turns
aggregation: mean # mean (default) | min ("weakest link") | max
input:
- role: system
- input: You are a helpful travel planning assistant.
turns:
- input: I'm planning a two-week trip to Japan next spring. What areas should I focus on?
# expected_output is a reference for grading — the actual LLM response
# (not this) carries forward to the next turn
expected_output: |-
Japan in spring is a wonderful choice! I'd recommend focusing on
Kyoto for its incredible temple complexes, the Japanese Alps for
stunning mountain hiking, and the Kumano Kodo pilgrimage trails.
Spring is perfect timing — cherry blossom season runs late March
through mid-April, and the weather is mild for outdoor exploration.
assertions:
- Recommends specific Japan regions or cities
- Acknowledges spring timing (cherry blossom season, weather)
- input: I'm mostly interested in traditional culture and nature, not big cities.
expected_output: |-
Great choices! For traditional culture and nature, I'd suggest
staying at a temple lodging in Koyasan, visiting the thatched-roof
villages of Shirakawa-go, and exploring rural Kyoto areas like
Ohara and Kurama. The Kumano Kodo trail I mentioned would also
be perfect for you.
assertions:
- Shifts recommendations toward rural/traditional areas
- Does not lead with Tokyo/Osaka nightlife or urban attractions
- References or builds on regions mentioned in previous turn
- input: What about the budget? I'm thinking around $3000 not including flights.
assertions:
- Provides advice within $3000 budget
- Budget advice accounts for cultural/nature preference from prior turns
- Does not suggest luxury resorts or expensive city hotels
- input: Can you suggest a rough day-by-day itinerary based on everything we discussed?
assertions:
- Itinerary covers approximately 14 days
- Emphasizes traditional culture and nature activities
- Feasible within $3000 budget
- Includes specific locations discussed in earlier turns
# Conversation-level
assertions:
- Agent consistently remembers destination, preferences, budget, and timing across all turns
- Agent never contradicts its own prior recommendations
- Each turn builds on prior context rather than starting fresh
Mixing shorthand with structured assertions
Per-turn assertions can also use the full evaluator config when you need weights, required, or non-rubric types:
- input: Queue
assertions:
- Asks about required tags (RTR, RTK, WFC, etc.)
- type: contains
value: CSS
- type: rubrics
criteria:
- id: no-premature-diagnosis
outcome: Does not jump to a solution before gathering information
weight: 2
required: true
How it works
input provides system prompt and initial context (same as today)
- For each entry in
turns:
a. Append the user message to the accumulated message history
b. Call the provider with the full history — LLM generates an assistant response
c. Grade the response against that turn's assertions (if present) and expected_output (if present, compared via implicit llm-grader)
d. Append the actual LLM response (not expected_output) to history for the next turn
- After all turns: run top-level
assertions over the full transcript (if present). Also run top-level criteria as implicit llm-grader if no per-turn or conversation assertions exist (backward-compatible fallback).
- Final test score = aggregation of per-turn scores + conversation assertion scores. Turns without assertions/expected_output score 1.0.
Turn schema
interface ConversationTurn {
// Each turn is a user message. The runner generates the assistant response.
readonly input: TestMessageContent; // string or structured content
readonly expected_output?: TestMessageContent;
readonly assertions?: readonly (string | EvaluatorConfig)[]; // strings = rubric shorthand
}
Grader template variables per turn
Existing template variables — no new variables needed:
{{ input }} — full conversation history up to and including this user message
{{ output }} — the LLM's response for this turn only
{{ criteria }} — the top-level test criteria (shared across all turns)
{{ expected_output }} — this turn's expected_output (if present)
For top-level assertions:
{{ input }} — the full conversation transcript (all turns)
{{ output }} — the last assistant response
{{ criteria }} — the top-level test criteria
Validation rules
turns requires mode: conversation — error if turns present without it
mode: conversation requires turns — error if mode set but turns missing/empty
- Each turn must have non-empty
input
- Tests without
mode/turns behave identically to today (no regression)
turns is incompatible with top-level expected_output — error if both present
aggregation only valid when mode: conversation is set
Interaction with existing features
| Feature |
Interaction |
depends_on |
Works normally — a conversation test can depend on other tests and vice versa |
trials |
Each trial runs the full conversation independently |
workspace |
Not applicable in v1 (LLM providers only) |
assertions (top-level) |
Grades the full conversation after all turns complete. Complementary with per-turn assertions |
on_turn_failure |
continue (default) or stop. When stop, remaining turns are skipped and scored 0 |
window_size |
Optional. Limits context passed to per-turn graders (default: all turns). Useful for long conversations |
Output shape
One EvaluationResult per test. Per-turn rubric assertions show exactly what passed/failed:
{
"test_id": "travel-planning",
"score": 0.88,
"execution_status": "ok",
"scores": [
{
"name": "turn-1", "type": "rubrics", "score": 1.0, "verdict": "pass",
"assertions": [
{"text": "Recommends specific Japan regions or cities", "passed": true},
{"text": "Acknowledges spring timing", "passed": true}
]
},
{
"name": "turn-2", "type": "rubrics", "score": 0.67, "verdict": "fail",
"assertions": [
{"text": "Shifts recommendations toward rural/traditional areas", "passed": true},
{"text": "Does not lead with urban attractions", "passed": true},
{"text": "References or builds on regions mentioned in previous turn", "passed": false}
]
},
{
"name": "turn-3", "type": "rubrics", "score": 1.0, "verdict": "pass",
"assertions": [
{"text": "Provides advice within $3000 budget", "passed": true},
{"text": "Budget accounts for culture/nature preference", "passed": true},
{"text": "Does not suggest luxury resorts", "passed": true}
]
},
{
"name": "turn-4", "type": "rubrics", "score": 0.75, "verdict": "fail",
"assertions": [
{"text": "Itinerary covers approximately 14 days", "passed": true},
{"text": "Emphasizes traditional culture and nature", "passed": true},
{"text": "Feasible within $3000 budget", "passed": true},
{"text": "Includes specific locations from earlier turns", "passed": false}
]
},
{
"name": "assertions", "type": "rubrics", "score": 0.67, "verdict": "fail",
"assertions": [
{"text": "Remembers destination, preferences, budget, timing across all turns", "passed": true},
{"text": "Never contradicts own prior recommendations", "passed": true},
{"text": "Each turn builds on prior context", "passed": false}
]
}
],
"output": [
{"role": "user", "content": "I'm planning a two-week trip to Japan..."},
{"role": "assistant", "content": "Japan in spring is wonderful! I'd recommend..."},
{"role": "user", "content": "I'm mostly interested in traditional culture..."},
{"role": "assistant", "content": "For traditional culture and nature, consider..."},
{"role": "user", "content": "What about the budget?..."},
{"role": "assistant", "content": "With $3000, you can comfortably..."},
{"role": "user", "content": "Can you suggest a rough itinerary?..."},
{"role": "assistant", "content": "Here's a 14-day itinerary..."}
]
}
Implementation
Phase 1: Core turn runner + conversation assertions
- Add
ConversationTurn type, mode, turns, on_turn_failure, window_size to EvalTest. Add aggregation to EvalTest
- Zod schema and YAML parser updates
- Turn-by-turn loop in
runEvalCase: accumulate messages, call provider, grade, repeat
- Conversation assertions run after all turns
- Aggregation:
mean (default), min, max
- All existing grader types work per-turn (no grader changes needed)
- String shorthand in per-turn assertions works identically to top-level (existing parser handles this)
Phase 2: Simulated user + advanced patterns
- LLM-as-user: define user persona, let the LLM generate user messages dynamically
max_turns limit for open-ended conversations
- Agent provider support (session-based)
Files to modify
| File |
Change |
packages/core/src/evaluation/types.ts |
Add ConversationTurn, mode, turns, on_turn_failure, window_size to EvalTest. Add aggregation to EvalTest |
packages/core/src/evaluation/validation/eval-file.schema.ts |
Zod schema for new fields |
packages/core/src/evaluation/yaml-parser.ts |
Parse new fields |
packages/core/src/evaluation/orchestrator.ts |
Conversation runner path in runEvalCase |
packages/core/src/evaluation/formatting/prompt-builder.ts |
Per-turn prompt building with accumulated history |
packages/core/scripts/generate-eval-schema.ts |
Regenerate JSON schema |
packages/core/test/evaluation/conversation-mode.test.ts |
New test file |
examples/features/multi-turn-conversation/ |
Update example |
Acceptance criteria
Example source
The example above (travel planning) is adapted from promptfoo's eval-conversation-relevance example (MIT license). It tests context retention across 4 turns where each turn builds on prior preferences.
For a larger benchmark, MT-Bench (Apache-2.0) provides 80 two-turn eval pairs with reference answers across 8 categories.
Use cases
- Decision tree troubleshooting: Verify the agent asks the right diagnostic questions in order
- Customer support flows: Test context retention, persona maintenance, and resolution across turns
- User correction handling: Test graceful recovery when the user corrects a previous answer
- Workflow plugin eval: Verify an agent follows a prescribed skill pipeline
Research references
| Framework |
Key pattern adopted |
| DeepEval |
ConversationalTestCase turns array, sliding window, conversation-level metrics |
| Azure AI Eval SDK |
aggregation: min ("weakest link"), EvaluatorBase universal multi-turn support |
| Braintrust |
Top-level assertions for conversation-level grading (separate from per-turn) |
| Google ADK |
Simulated user as Phase 2 priority |
| MT-Bench-101 |
"Weakest link" scoring semantics |
| MultiChallenge |
Evaluation dimensions taxonomy (instruction retention, self-coherence, versioned editing) |
| promptfoo |
YAML conversation eval format, conversation-relevance assertion |
Full research: agentevals-research PR #57
Prior art in this repo
Related cleanup (separate issue)
Remove conversation_id top-level field from EvalTest. Hard removal, no deprecation.
Summary
Support evaluating multi-turn conversations where the agent generates each assistant turn, with per-turn grading. Today, multi-turn evals script all intermediate assistant responses in
input— the LLM only generates the last response.Research basis
Cross-framework comparison in agentevals-research PR #57 (8 frameworks analyzed). Key findings that shaped this design:
turnsarrayminaggregation ("weakest link") alongsidemeanassertionsfor post-conversation gradingEvaluatorBasemakes any grader work in multi-turn modemode: conversationwindow_sizeoption for long conversationsProposed schema
Uses AgentV's string shorthand for assertions — plain strings become rubric criteria automatically (see
examples/features/rubric/).Mixing shorthand with structured assertions
Per-turn assertions can also use the full evaluator config when you need weights,
required, or non-rubric types:How it works
inputprovides system prompt and initial context (same as today)turns:a. Append the user message to the accumulated message history
b. Call the provider with the full history — LLM generates an assistant response
c. Grade the response against that turn's
assertions(if present) andexpected_output(if present, compared via implicit llm-grader)d. Append the actual LLM response (not expected_output) to history for the next turn
assertionsover the full transcript (if present). Also run top-levelcriteriaas implicit llm-grader if no per-turn or conversation assertions exist (backward-compatible fallback).Turn schema
Grader template variables per turn
Existing template variables — no new variables needed:
{{ input }}— full conversation history up to and including this user message{{ output }}— the LLM's response for this turn only{{ criteria }}— the top-level test criteria (shared across all turns){{ expected_output }}— this turn's expected_output (if present)For top-level
assertions:{{ input }}— the full conversation transcript (all turns){{ output }}— the last assistant response{{ criteria }}— the top-level test criteriaValidation rules
turnsrequiresmode: conversation— error ifturnspresent without itmode: conversationrequiresturns— error if mode set but turns missing/emptyinputmode/turnsbehave identically to today (no regression)turnsis incompatible with top-levelexpected_output— error if both presentaggregationonly valid whenmode: conversationis setInteraction with existing features
depends_ontrialsworkspaceassertions(top-level)on_turn_failurecontinue(default) orstop. Whenstop, remaining turns are skipped and scored 0window_sizeOutput shape
One
EvaluationResultper test. Per-turn rubric assertions show exactly what passed/failed:{ "test_id": "travel-planning", "score": 0.88, "execution_status": "ok", "scores": [ { "name": "turn-1", "type": "rubrics", "score": 1.0, "verdict": "pass", "assertions": [ {"text": "Recommends specific Japan regions or cities", "passed": true}, {"text": "Acknowledges spring timing", "passed": true} ] }, { "name": "turn-2", "type": "rubrics", "score": 0.67, "verdict": "fail", "assertions": [ {"text": "Shifts recommendations toward rural/traditional areas", "passed": true}, {"text": "Does not lead with urban attractions", "passed": true}, {"text": "References or builds on regions mentioned in previous turn", "passed": false} ] }, { "name": "turn-3", "type": "rubrics", "score": 1.0, "verdict": "pass", "assertions": [ {"text": "Provides advice within $3000 budget", "passed": true}, {"text": "Budget accounts for culture/nature preference", "passed": true}, {"text": "Does not suggest luxury resorts", "passed": true} ] }, { "name": "turn-4", "type": "rubrics", "score": 0.75, "verdict": "fail", "assertions": [ {"text": "Itinerary covers approximately 14 days", "passed": true}, {"text": "Emphasizes traditional culture and nature", "passed": true}, {"text": "Feasible within $3000 budget", "passed": true}, {"text": "Includes specific locations from earlier turns", "passed": false} ] }, { "name": "assertions", "type": "rubrics", "score": 0.67, "verdict": "fail", "assertions": [ {"text": "Remembers destination, preferences, budget, timing across all turns", "passed": true}, {"text": "Never contradicts own prior recommendations", "passed": true}, {"text": "Each turn builds on prior context", "passed": false} ] } ], "output": [ {"role": "user", "content": "I'm planning a two-week trip to Japan..."}, {"role": "assistant", "content": "Japan in spring is wonderful! I'd recommend..."}, {"role": "user", "content": "I'm mostly interested in traditional culture..."}, {"role": "assistant", "content": "For traditional culture and nature, consider..."}, {"role": "user", "content": "What about the budget?..."}, {"role": "assistant", "content": "With $3000, you can comfortably..."}, {"role": "user", "content": "Can you suggest a rough itinerary?..."}, {"role": "assistant", "content": "Here's a 14-day itinerary..."} ] }Implementation
Phase 1: Core turn runner + conversation assertions
ConversationTurntype,mode,turns,on_turn_failure,window_sizetoEvalTest. AddaggregationtoEvalTestrunEvalCase: accumulate messages, call provider, grade, repeatmean(default),min,maxPhase 2: Simulated user + advanced patterns
max_turnslimit for open-ended conversationsFiles to modify
packages/core/src/evaluation/types.tsConversationTurn,mode,turns,on_turn_failure,window_sizetoEvalTest. AddaggregationtoEvalTestpackages/core/src/evaluation/validation/eval-file.schema.tspackages/core/src/evaluation/yaml-parser.tspackages/core/src/evaluation/orchestrator.tsrunEvalCasepackages/core/src/evaluation/formatting/prompt-builder.tspackages/core/scripts/generate-eval-schema.tspackages/core/test/evaluation/conversation-mode.test.tsexamples/features/multi-turn-conversation/Acceptance criteria
mode/turnsbehave identically (no regression)aggregation: minuses weakest turn scoreon_turn_failure: stopskips remaining turnson_turn_failure: continue(default) runs all turns regardlesswindow_sizelimits context for per-turn gradersoutputin results contains full conversation transcriptscoresarray has per-criterion assertions for each turn and conversation-level checkdepends_onandtrialsExample source
The example above (travel planning) is adapted from promptfoo's eval-conversation-relevance example (MIT license). It tests context retention across 4 turns where each turn builds on prior preferences.
For a larger benchmark, MT-Bench (Apache-2.0) provides 80 two-turn eval pairs with reference answers across 8 categories.
Use cases
Research references
ConversationalTestCaseturns array, sliding window, conversation-level metricsaggregation: min("weakest link"),EvaluatorBaseuniversal multi-turn supportconversation-relevanceassertionFull research: agentevals-research PR #57
Prior art in this repo
depends_onDAG schedulerRelated cleanup (separate issue)
Remove
conversation_idtop-level field fromEvalTest. Hard removal, no deprecation.