feat(eval): multi-turn conversational test case — live turn-by-turn evaluation

## Summary

Support evaluating multi-turn conversations where the agent generates **each** assistant turn, with per-turn grading. Today, multi-turn evals script all intermediate assistant responses in `input` — the LLM only generates the last response.

## Research basis

Cross-framework comparison in agentevals-research PR #57 (8 frameworks analyzed). Key findings that shaped this design:

| Finding | Impact on design |
|---------|-----------------|
| All frameworks converge on `turns` array | Confirms our schema choice |
| Azure SDK uses `min` aggregation ("weakest link") alongside `mean` | Add `aggregation: mean | min | max` option |
| Braintrust shows per-turn grading can't catch conversation-level failures | Add top-level `assertions` for post-conversation grading |
| `on_turn_failure: stop | continue` is sufficient | No peer has richer branching — confirmed |
| Google ADK + DeepEval ship simulated user as core | Move to Phase 2, not Phase 3 |
| Azure `EvaluatorBase` makes any grader work in multi-turn mode | All existing AgentV graders should work in `mode: conversation` |
| DeepEval sliding window (default 10 turns) for context bounding | Add `window_size` option for long conversations |
| MultiChallenge taxonomy (instruction retention, inference memory, versioned editing, self-coherence) | Informs built-in conversation grader prompts |

## Proposed schema

Uses AgentV's string shorthand for assertions — plain strings become rubric criteria automatically (see `examples/features/rubric/`).

```yaml
tests:
  - id: travel-planning
    mode: conversation
    criteria: Agent maintains context and provides relevant travel advice across turns
    aggregation: mean   # mean (default) | min ("weakest link") | max
    input:
      - role: system
      - input: You are a helpful travel planning assistant.

    turns:

      - input: I'm planning a two-week trip to Japan next spring. What areas should I focus on?
        # expected_output is a reference for grading — the actual LLM response
        # (not this) carries forward to the next turn
        expected_output: |-
          Japan in spring is a wonderful choice! I'd recommend focusing on
          Kyoto for its incredible temple complexes, the Japanese Alps for
          stunning mountain hiking, and the Kumano Kodo pilgrimage trails.
          Spring is perfect timing — cherry blossom season runs late March
          through mid-April, and the weather is mild for outdoor exploration.
        assertions:
          - Recommends specific Japan regions or cities
          - Acknowledges spring timing (cherry blossom season, weather)


      - input: I'm mostly interested in traditional culture and nature, not big cities.
        expected_output: |-
          Great choices! For traditional culture and nature, I'd suggest
          staying at a temple lodging in Koyasan, visiting the thatched-roof
          villages of Shirakawa-go, and exploring rural Kyoto areas like
          Ohara and Kurama. The Kumano Kodo trail I mentioned would also
          be perfect for you.
        assertions:
          - Shifts recommendations toward rural/traditional areas
          - Does not lead with Tokyo/Osaka nightlife or urban attractions
          - References or builds on regions mentioned in previous turn


      - input: What about the budget? I'm thinking around $3000 not including flights.
        assertions:
          - Provides advice within $3000 budget
          - Budget advice accounts for cultural/nature preference from prior turns
          - Does not suggest luxury resorts or expensive city hotels


      - input: Can you suggest a rough day-by-day itinerary based on everything we discussed?
        assertions:
          - Itinerary covers approximately 14 days
          - Emphasizes traditional culture and nature activities
          - Feasible within $3000 budget
          - Includes specific locations discussed in earlier turns

    # Conversation-level
    assertions:
      - Agent consistently remembers destination, preferences, budget, and timing across all turns
      - Agent never contradicts its own prior recommendations
      - Each turn builds on prior context rather than starting fresh

```

### Mixing shorthand with structured assertions

Per-turn assertions can also use the full evaluator config when you need weights, `required`, or non-rubric types:

```yaml

      - input: Queue
        assertions:
          - Asks about required tags (RTR, RTK, WFC, etc.)
          - type: contains
            value: CSS
          - type: rubrics
            criteria:
              - id: no-premature-diagnosis
                outcome: Does not jump to a solution before gathering information
                weight: 2
                required: true
```

## How it works

1. `input` provides system prompt and initial context (same as today)
2. For each entry in `turns`:
   a. Append the user message to the accumulated message history
   b. Call the provider with the full history — LLM generates an assistant response
   c. Grade the response against that turn's `assertions` (if present) and `expected_output` (if present, compared via implicit llm-grader)
   d. Append the **actual** LLM response (not expected_output) to history for the next turn
3. After all turns: run top-level `assertions` over the full transcript (if present). Also run top-level `criteria` as implicit llm-grader if no per-turn or conversation assertions exist (backward-compatible fallback).
4. Final test score = aggregation of per-turn scores + conversation assertion scores. Turns without assertions/expected_output score 1.0.

### Turn schema

```typescript
interface ConversationTurn {
  // Each turn is a user message. The runner generates the assistant response.
  readonly input: TestMessageContent;  // string or structured content
  readonly expected_output?: TestMessageContent;
  readonly assertions?: readonly (string | EvaluatorConfig)[];  // strings = rubric shorthand
}
```

### Grader template variables per turn

Existing template variables — no new variables needed:
- `{{ input }}` — full conversation history up to and including this user message
- `{{ output }}` — the LLM's response for this turn only
- `{{ criteria }}` — the top-level test criteria (shared across all turns)
- `{{ expected_output }}` — this turn's expected_output (if present)

For top-level `assertions`:
- `{{ input }}` — the full conversation transcript (all turns)
- `{{ output }}` — the last assistant response
- `{{ criteria }}` — the top-level test criteria

### Validation rules

- `turns` requires `mode: conversation` — error if `turns` present without it
- `mode: conversation` requires `turns` — error if mode set but turns missing/empty
- Each turn must have non-empty `input`
- Tests without `mode`/`turns` behave identically to today (no regression)
- `turns` is incompatible with top-level `expected_output` — error if both present
- `aggregation` only valid when `mode: conversation` is set

### Interaction with existing features

| Feature | Interaction |
|---------|------------|
| `depends_on` | Works normally — a conversation test can depend on other tests and vice versa |
| `trials` | Each trial runs the full conversation independently |
| `workspace` | Not applicable in v1 (LLM providers only) |
| `assertions` (top-level) | Grades the full conversation after all turns complete. Complementary with per-turn assertions |
| `on_turn_failure` | `continue` (default) or `stop`. When `stop`, remaining turns are skipped and scored 0 |
| `window_size` | Optional. Limits context passed to per-turn graders (default: all turns). Useful for long conversations |

## Output shape

One `EvaluationResult` per test. Per-turn rubric assertions show exactly what passed/failed:

```jsonl
{
  "test_id": "travel-planning",
  "score": 0.88,
  "execution_status": "ok",
  "scores": [
    {
      "name": "turn-1", "type": "rubrics", "score": 1.0, "verdict": "pass",
      "assertions": [
        {"text": "Recommends specific Japan regions or cities", "passed": true},
        {"text": "Acknowledges spring timing", "passed": true}
      ]
    },
    {
      "name": "turn-2", "type": "rubrics", "score": 0.67, "verdict": "fail",
      "assertions": [
        {"text": "Shifts recommendations toward rural/traditional areas", "passed": true},
        {"text": "Does not lead with urban attractions", "passed": true},
        {"text": "References or builds on regions mentioned in previous turn", "passed": false}
      ]
    },
    {
      "name": "turn-3", "type": "rubrics", "score": 1.0, "verdict": "pass",
      "assertions": [
        {"text": "Provides advice within $3000 budget", "passed": true},
        {"text": "Budget accounts for culture/nature preference", "passed": true},
        {"text": "Does not suggest luxury resorts", "passed": true}
      ]
    },
    {
      "name": "turn-4", "type": "rubrics", "score": 0.75, "verdict": "fail",
      "assertions": [
        {"text": "Itinerary covers approximately 14 days", "passed": true},
        {"text": "Emphasizes traditional culture and nature", "passed": true},
        {"text": "Feasible within $3000 budget", "passed": true},
        {"text": "Includes specific locations from earlier turns", "passed": false}
      ]
    },
    {
      "name": "assertions", "type": "rubrics", "score": 0.67, "verdict": "fail",
      "assertions": [
        {"text": "Remembers destination, preferences, budget, timing across all turns", "passed": true},
        {"text": "Never contradicts own prior recommendations", "passed": true},
        {"text": "Each turn builds on prior context", "passed": false}
      ]
    }
  ],
  "output": [
    {"role": "user", "content": "I'm planning a two-week trip to Japan..."},
    {"role": "assistant", "content": "Japan in spring is wonderful! I'd recommend..."},
    {"role": "user", "content": "I'm mostly interested in traditional culture..."},
    {"role": "assistant", "content": "For traditional culture and nature, consider..."},
    {"role": "user", "content": "What about the budget?..."},
    {"role": "assistant", "content": "With $3000, you can comfortably..."},
    {"role": "user", "content": "Can you suggest a rough itinerary?..."},
    {"role": "assistant", "content": "Here's a 14-day itinerary..."}
  ]
}
```

## Implementation

### Phase 1: Core turn runner + conversation assertions
- Add `ConversationTurn` type, `mode`, `turns`, `on_turn_failure`, `window_size` to `EvalTest`. Add `aggregation` to `EvalTest`
- Zod schema and YAML parser updates
- Turn-by-turn loop in `runEvalCase`: accumulate messages, call provider, grade, repeat
- Conversation assertions run after all turns
- Aggregation: `mean` (default), `min`, `max`
- All existing grader types work per-turn (no grader changes needed)
- String shorthand in per-turn assertions works identically to top-level (existing parser handles this)

### Phase 2: Simulated user + advanced patterns
- LLM-as-user: define user persona, let the LLM generate user messages dynamically
- `max_turns` limit for open-ended conversations
- Agent provider support (session-based)

### Files to modify

| File | Change |
|------|--------|
| `packages/core/src/evaluation/types.ts` | Add `ConversationTurn`, `mode`, `turns`, `on_turn_failure`, `window_size` to `EvalTest`. Add `aggregation` to `EvalTest` |
| `packages/core/src/evaluation/validation/eval-file.schema.ts` | Zod schema for new fields |
| `packages/core/src/evaluation/yaml-parser.ts` | Parse new fields |
| `packages/core/src/evaluation/orchestrator.ts` | Conversation runner path in `runEvalCase` |
| `packages/core/src/evaluation/formatting/prompt-builder.ts` | Per-turn prompt building with accumulated history |
| `packages/core/scripts/generate-eval-schema.ts` | Regenerate JSON schema |
| `packages/core/test/evaluation/conversation-mode.test.ts` | New test file |
| `examples/features/multi-turn-conversation/` | Update example |

### Acceptance criteria

- [ ] Tests without `mode`/`turns` behave identically (no regression)
- [ ] Each turn gets a fresh LLM call with accumulated history
- [ ] Per-turn assertions (string shorthand and structured) grade only that turn's response
- [ ] Actual LLM response (not expected_output) carries forward to next turn
- [ ] Top-level assertions run after all turns with full transcript
- [ ] `aggregation: min` uses weakest turn score
- [ ] `on_turn_failure: stop` skips remaining turns
- [ ] `on_turn_failure: continue` (default) runs all turns regardless
- [ ] `window_size` limits context for per-turn graders
- [ ] `output` in results contains full conversation transcript
- [ ] `scores` array has per-criterion assertions for each turn and conversation-level check
- [ ] Works with `depends_on` and `trials`
- [ ] Validation rejects invalid combinations
- [ ] Example updated and passing

## Example source

The example above (travel planning) is adapted from [promptfoo's eval-conversation-relevance example](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-conversation-relevance) (MIT license). It tests context retention across 4 turns where each turn builds on prior preferences.

For a larger benchmark, [MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/mt_bench/question.jsonl) (Apache-2.0) provides 80 two-turn eval pairs with reference answers across 8 categories.

## Use cases

1. **Decision tree troubleshooting**: Verify the agent asks the right diagnostic questions in order
2. **Customer support flows**: Test context retention, persona maintenance, and resolution across turns
3. **User correction handling**: Test graceful recovery when the user corrects a previous answer
4. **Workflow plugin eval**: Verify an agent follows a prescribed skill pipeline

## Research references

| Framework | Key pattern adopted |
|-----------|-------------------|
| DeepEval | `ConversationalTestCase` turns array, sliding window, conversation-level metrics |
| Azure AI Eval SDK | `aggregation: min` ("weakest link"), `EvaluatorBase` universal multi-turn support |
| Braintrust | Top-level assertions for conversation-level grading (separate from per-turn) |
| Google ADK | Simulated user as Phase 2 priority |
| MT-Bench-101 | "Weakest link" scoring semantics |
| MultiChallenge | Evaluation dimensions taxonomy (instruction retention, self-coherence, versioned editing) |
| promptfoo | YAML conversation eval format, `conversation-relevance` assertion |

Full research: agentevals-research PR #57

## Prior art in this repo

- #505 / PR #507: Multi-turn conversation example (scripted intermediate turns)
- #331 / PR #1051: `depends_on` DAG scheduler

## Related cleanup (separate issue)

Remove `conversation_id` top-level field from `EvalTest`. Hard removal, no deprecation.

File	Change
`packages/core/src/evaluation/types.ts`	Add `ConversationTurn`, `mode`, `turns`, `on_turn_failure`, `window_size` to `EvalTest`. Add `aggregation` to `EvalTest`
`packages/core/src/evaluation/validation/eval-file.schema.ts`	Zod schema for new fields
`packages/core/src/evaluation/yaml-parser.ts`	Parse new fields
`packages/core/src/evaluation/orchestrator.ts`	Conversation runner path in `runEvalCase`
`packages/core/src/evaluation/formatting/prompt-builder.ts`	Per-turn prompt building with accumulated history
`packages/core/scripts/generate-eval-schema.ts`	Regenerate JSON schema
`packages/core/test/evaluation/conversation-mode.test.ts`	New test file
`examples/features/multi-turn-conversation/`	Update example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): multi-turn conversational test case — live turn-by-turn evaluation #1052

Summary

Research basis

Proposed schema

Mixing shorthand with structured assertions

How it works

Turn schema

Grader template variables per turn

Validation rules

Interaction with existing features

Output shape

Implementation

Phase 1: Core turn runner + conversation assertions

Phase 2: Simulated user + advanced patterns

Files to modify

Acceptance criteria

Example source

Use cases

Research references

Prior art in this repo

Related cleanup (separate issue)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finding	Impact on design
All frameworks converge on `turns` array	Confirms our schema choice
Azure SDK uses `min` aggregation ("weakest link") alongside `mean`	Add `aggregation: mean
Braintrust shows per-turn grading can't catch conversation-level failures	Add top-level `assertions` for post-conversation grading
`on_turn_failure: stop	continue` is sufficient
Google ADK + DeepEval ship simulated user as core	Move to Phase 2, not Phase 3
Azure `EvaluatorBase` makes any grader work in multi-turn mode	All existing AgentV graders should work in `mode: conversation`
DeepEval sliding window (default 10 turns) for context bounding	Add `window_size` option for long conversations
MultiChallenge taxonomy (instruction retention, inference memory, versioned editing, self-coherence)	Informs built-in conversation grader prompts

Feature	Interaction
`depends_on`	Works normally — a conversation test can depend on other tests and vice versa
`trials`	Each trial runs the full conversation independently
`workspace`	Not applicable in v1 (LLM providers only)
`assertions` (top-level)	Grades the full conversation after all turns complete. Complementary with per-turn assertions
`on_turn_failure`	`continue` (default) or `stop`. When `stop`, remaining turns are skipped and scored 0
`window_size`	Optional. Limits context passed to per-turn graders (default: all turns). Useful for long conversations

Framework	Key pattern adopted
DeepEval	`ConversationalTestCase` turns array, sliding window, conversation-level metrics
Azure AI Eval SDK	`aggregation: min` ("weakest link"), `EvaluatorBase` universal multi-turn support
Braintrust	Top-level assertions for conversation-level grading (separate from per-turn)
Google ADK	Simulated user as Phase 2 priority
MT-Bench-101	"Weakest link" scoring semantics
MultiChallenge	Evaluation dimensions taxonomy (instruction retention, self-coherence, versioned editing)
promptfoo	YAML conversation eval format, `conversation-relevance` assertion

feat(eval): multi-turn conversational test case — live turn-by-turn evaluation #1052

Description

Summary

Research basis

Proposed schema

Mixing shorthand with structured assertions

How it works

Turn schema

Grader template variables per turn

Validation rules

Interaction with existing features

Output shape

Implementation

Phase 1: Core turn runner + conversation assertions

Phase 2: Simulated user + advanced patterns

Files to modify

Acceptance criteria

Example source

Use cases

Research references

Prior art in this repo

Related cleanup (separate issue)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions