Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions docs/plans/1052-conversation-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Issue #1052: Multi-turn Conversational Test Case — Live Turn-by-Turn Evaluation

## Problem

Today, multi-turn evals script all intermediate assistant responses in `input` — the LLM only generates the last response. This means conversation context retention, progressive reasoning, and turn-by-turn quality cannot be measured independently.

## Solution

Add `mode: conversation` with a `turns` array that drives turn-by-turn LLM evaluation with per-turn and conversation-level grading.

### New Schema Fields

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `mode` | `'conversation'` | - | Enables conversation evaluation mode |
| `turns` | `ConversationTurn[]` | - | Ordered user messages; each generates an LLM call |
| `aggregation` | `'mean' \| 'min' \| 'max'` | `'mean'` | How turn scores combine into final score |
| `on_turn_failure` | `'continue' \| 'stop'` | `'continue'` | What to do when a turn's assertions fail |
| `window_size` | `number` | all turns | Sliding window for context passed to graders |

### How It Works

1. `input` provides system prompt and initial context (same as today)
2. For each entry in `turns`:
a. Append the user message to accumulated history
b. Call the provider with full history — LLM generates assistant response
c. Grade the response against turn's `assertions` and `expected_output`
d. Append actual LLM response (not expected_output) to history
3. After all turns: run top-level `assertions` over full transcript
4. Final score = aggregation of per-turn + conversation assertion scores

### Validation Rules

- `turns` requires `mode: conversation`
- `mode: conversation` requires `turns`
- `turns` incompatible with top-level `expected_output`
- `aggregation` only valid with `mode: conversation`
- Each turn must have non-empty `input`

### Files Modified

| File | Change |
|------|--------|
| `packages/core/src/evaluation/types.ts` | ConversationTurn, mode, turns, etc. on EvalTest |
| `packages/core/src/evaluation/validation/eval-file.schema.ts` | Zod schema for new fields |
| `packages/core/src/evaluation/yaml-parser.ts` | Parse conversation fields |
| `packages/core/src/evaluation/orchestrator.ts` | Conversation runner in runEvalCase |
| `packages/core/test/evaluation/conversation-mode.test.ts` | Unit tests |
| `examples/features/multi-turn-conversation-live/` | UAT example |

## References

- Issue: #1052
- Research: agentevals-research PR #57
- Prior art: #505 / PR #507 (scripted multi-turn), #331 / PR #1051 (depends_on)
22 changes: 22 additions & 0 deletions examples/features/multi-turn-conversation-live/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Multi-Turn Conversation (Live)

This example demonstrates **live turn-by-turn conversation evaluation** where the LLM generates each assistant response (unlike `multi-turn-conversation/` which scripts intermediate turns).

## Features Shown

- `mode: conversation` — enables live turn-by-turn evaluation
- `turns[]` — each entry is a user message that generates an LLM call
- Per-turn `assertions` — string shorthand (rubric) and structured evaluators
- `aggregation: mean | min | max` — how turn scores combine
- `on_turn_failure: stop | continue` — behavior on assertion failure
- Top-level `assertions` — conversation-level grading after all turns

## Running

```bash
# With default target
bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml

# With specific test
bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml --test-id context-retention
```
105 changes: 105 additions & 0 deletions examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Multi-turn conversation evaluation (live turn-by-turn)
# Each turn generates a fresh LLM call; per-turn assertions grade each response.
# This is different from multi-turn-conversation/ which scripts intermediate turns.

description: Live multi-turn conversation evaluation with per-turn grading

execution:
target: llm

tests:
# Test 1: Basic context retention across turns
- id: context-retention
mode: conversation
criteria: Agent maintains context and provides relevant responses across turns
aggregation: mean
input:
- role: system
content: |-
You are a helpful math tutor. Be concise and accurate.
Always show your work step by step.
turns:
- input: What is 15% of 200?
assertions:
- Correctly calculates 15% of 200 as 30
- Shows the calculation steps
- input: Now double that result.
assertions:
- References the previous answer of 30
- Correctly calculates double as 60
- input: What were the original numbers I asked about?
assertions:
- Recalls that the user asked about 15% and 200
- Demonstrates memory of the conversation context

# Test 2: With aggregation: min (weakest-link scoring)
- id: weakest-link-scoring
mode: conversation
criteria: Agent provides accurate, well-structured responses
aggregation: min
input:
- role: system
content: You are a concise geography expert. Answer in 1-2 sentences.
turns:
- input: What is the capital of France?
assertions:
- Correctly identifies Paris as the capital of France
- input: What country is it in?
assertions:
- Recognizes the question refers to Paris from the previous turn
- Confirms Paris is in France

# Test 3: With on_turn_failure: stop
- id: stop-on-failure
mode: conversation
on_turn_failure: stop
criteria: Agent follows instructions precisely
input:
- role: system
content: You are a helpful assistant. Be precise and accurate.
turns:
- input: What is 2 + 2?
assertions:
- Answers with 4
- input: Multiply that by 3.
assertions:
- References the previous answer
- Calculates 12 correctly

# Test 4: Mixed string and structured assertions
- id: mixed-assertions
mode: conversation
criteria: Agent writes correct, well-formed Python code
input:
- role: system
content: You are a helpful coding assistant.
turns:
- input: Write a Python function that adds two numbers.
assertions:
- Contains a Python function definition
- type: contains
value: def
- input: Now add type hints to the function.
assertions:
- Includes type hints (int, float, or similar)
- type: contains
value: "->"

# Test 5: Conversation-level assertions
- id: conversation-coherence
mode: conversation
criteria: Agent maintains a coherent, helpful conversation
input:
- role: system
content: You are a helpful travel advisor. Be concise.
turns:
- input: I want to visit somewhere warm in December.
assertions:
- Suggests at least one warm destination
- input: I prefer beaches over cities.
assertions:
- Adjusts recommendations toward beach destinations
- Does not suggest purely urban destinations
assertions:
- Agent maintains consistency — later suggestions align with earlier preferences
- Agent does not contradict its own prior recommendations
Loading
Loading