EntityProcess · christso · Apr 12, 2026 · Apr 12, 2026 · Apr 12, 2026 · Apr 12, 2026
diff --git a/docs/plans/1052-conversation-mode.md b/docs/plans/1052-conversation-mode.md
@@ -0,0 +1,55 @@
+# Issue #1052: Multi-turn Conversational Test Case — Live Turn-by-Turn Evaluation
+
+## Problem
+
+Today, multi-turn evals script all intermediate assistant responses in `input` — the LLM only generates the last response. This means conversation context retention, progressive reasoning, and turn-by-turn quality cannot be measured independently.
+
+## Solution
+
+Add `mode: conversation` with a `turns` array that drives turn-by-turn LLM evaluation with per-turn and conversation-level grading.
+
+### New Schema Fields
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `mode` | `'conversation'` | - | Enables conversation evaluation mode |
+| `turns` | `ConversationTurn[]` | - | Ordered user messages; each generates an LLM call |
+| `aggregation` | `'mean' \| 'min' \| 'max'` | `'mean'` | How turn scores combine into final score |
+| `on_turn_failure` | `'continue' \| 'stop'` | `'continue'` | What to do when a turn's assertions fail |
+| `window_size` | `number` | all turns | Sliding window for context passed to graders |
+
+### How It Works
+
+1. `input` provides system prompt and initial context (same as today)
+2. For each entry in `turns`:
+   a. Append the user message to accumulated history
+   b. Call the provider with full history — LLM generates assistant response
+   c. Grade the response against turn's `assertions` and `expected_output`
+   d. Append actual LLM response (not expected_output) to history
+3. After all turns: run top-level `assertions` over full transcript
+4. Final score = aggregation of per-turn + conversation assertion scores
+
+### Validation Rules
+
+- `turns` requires `mode: conversation`
+- `mode: conversation` requires `turns`
+- `turns` incompatible with top-level `expected_output`
+- `aggregation` only valid with `mode: conversation`
+- Each turn must have non-empty `input`
+
+### Files Modified
+
+| File | Change |
+|------|--------|
+| `packages/core/src/evaluation/types.ts` | ConversationTurn, mode, turns, etc. on EvalTest |
+| `packages/core/src/evaluation/validation/eval-file.schema.ts` | Zod schema for new fields |
+| `packages/core/src/evaluation/yaml-parser.ts` | Parse conversation fields |
+| `packages/core/src/evaluation/orchestrator.ts` | Conversation runner in runEvalCase |
+| `packages/core/test/evaluation/conversation-mode.test.ts` | Unit tests |
+| `examples/features/multi-turn-conversation-live/` | UAT example |
+
+## References
+
+- Issue: #1052
+- Research: agentevals-research PR #57
+- Prior art: #505 / PR #507 (scripted multi-turn), #331 / PR #1051 (depends_on)
diff --git a/examples/features/multi-turn-conversation-live/README.md b/examples/features/multi-turn-conversation-live/README.md
@@ -0,0 +1,22 @@
+# Multi-Turn Conversation (Live)
+
+This example demonstrates **live turn-by-turn conversation evaluation** where the LLM generates each assistant response (unlike `multi-turn-conversation/` which scripts intermediate turns).
+
+## Features Shown
+
+- `mode: conversation` — enables live turn-by-turn evaluation
+- `turns[]` — each entry is a user message that generates an LLM call
+- Per-turn `assertions` — string shorthand (rubric) and structured evaluators
+- `aggregation: mean | min | max` — how turn scores combine
+- `on_turn_failure: stop | continue` — behavior on assertion failure
+- Top-level `assertions` — conversation-level grading after all turns
+
+## Running
+
+```bash
+# With default target
+bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml
+
+# With specific test
+bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml --test-id context-retention
+```
diff --git a/examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml b/examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml
@@ -0,0 +1,105 @@
+# Multi-turn conversation evaluation (live turn-by-turn)
+# Each turn generates a fresh LLM call; per-turn assertions grade each response.
+# This is different from multi-turn-conversation/ which scripts intermediate turns.
+
+description: Live multi-turn conversation evaluation with per-turn grading
+
+execution:
+  target: llm
+
+tests:
+  # Test 1: Basic context retention across turns
+  - id: context-retention
+    mode: conversation
+    criteria: Agent maintains context and provides relevant responses across turns
+    aggregation: mean
+    input:
+      - role: system
+        content: |-
+          You are a helpful math tutor. Be concise and accurate.
+          Always show your work step by step.
+    turns:
+      - input: What is 15% of 200?
+        assertions:
+          - Correctly calculates 15% of 200 as 30
+          - Shows the calculation steps
+      - input: Now double that result.
+        assertions:
+          - References the previous answer of 30
+          - Correctly calculates double as 60
+      - input: What were the original numbers I asked about?
+        assertions:
+          - Recalls that the user asked about 15% and 200
+          - Demonstrates memory of the conversation context
+
+  # Test 2: With aggregation: min (weakest-link scoring)
+  - id: weakest-link-scoring
+    mode: conversation
+    criteria: Agent provides accurate, well-structured responses
+    aggregation: min
+    input:
+      - role: system
+        content: You are a concise geography expert. Answer in 1-2 sentences.
+    turns:
+      - input: What is the capital of France?
+        assertions:
+          - Correctly identifies Paris as the capital of France
+      - input: What country is it in?
+        assertions:
+          - Recognizes the question refers to Paris from the previous turn
+          - Confirms Paris is in France
+
+  # Test 3: With on_turn_failure: stop
+  - id: stop-on-failure
+    mode: conversation
+    on_turn_failure: stop
+    criteria: Agent follows instructions precisely
+    input:
+      - role: system
+        content: You are a helpful assistant. Be precise and accurate.
+    turns:
+      - input: What is 2 + 2?
+        assertions:
+          - Answers with 4
+      - input: Multiply that by 3.
+        assertions:
+          - References the previous answer
+          - Calculates 12 correctly
+
+  # Test 4: Mixed string and structured assertions
+  - id: mixed-assertions
+    mode: conversation
+    criteria: Agent writes correct, well-formed Python code
+    input:
+      - role: system
+        content: You are a helpful coding assistant.
+    turns:
+      - input: Write a Python function that adds two numbers.
+        assertions:
+          - Contains a Python function definition
+          - type: contains
+            value: def
+      - input: Now add type hints to the function.
+        assertions:
+          - Includes type hints (int, float, or similar)
+          - type: contains
+            value: "->"
+
+  # Test 5: Conversation-level assertions
+  - id: conversation-coherence
+    mode: conversation
+    criteria: Agent maintains a coherent, helpful conversation
+    input:
+      - role: system
+        content: You are a helpful travel advisor. Be concise.
+    turns:
+      - input: I want to visit somewhere warm in December.
+        assertions:
+          - Suggests at least one warm destination
+      - input: I prefer beaches over cities.
+        assertions:
+          - Adjusts recommendations toward beach destinations
+          - Does not suggest purely urban destinations
+    assertions:
+      - Agent maintains consistency — later suggestions align with earlier preferences
+      - Agent does not contradict its own prior recommendations