Skip to content

feat(eval): multi-turn conversation mode with turn-by-turn evaluation#1054

Merged
christso merged 4 commits intomainfrom
feat/1052-conversation-mode
Apr 12, 2026
Merged

feat(eval): multi-turn conversation mode with turn-by-turn evaluation#1054
christso merged 4 commits intomainfrom
feat/1052-conversation-mode

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Closes #1052

Summary

Adds mode: conversation with turns array for live turn-by-turn LLM evaluation with per-turn and conversation-level grading.

Changes

  • Types: ConversationTurn, ConversationMode, ConversationAggregation, TurnFailurePolicy
  • Zod schema updates for new YAML fields
  • YAML parser support for conversation turns
  • Conversation runner in orchestrator with turn-by-turn provider calls
  • Score aggregation (mean/min/max), on_turn_failure (continue/stop), window_size
  • Cross-field validation rules

Verification

  • bun run typecheck
  • bun run build
  • bun run test ✅ (all 1944 tests pass)

…tion

Implements issue #1052: support for evaluating multi-turn conversations
where the agent generates each assistant turn with per-turn grading.

- Add ConversationTurn type, mode, turns, aggregation, on_turn_failure, window_size to EvalTest
- Zod schema and YAML parser updates for new fields
- Turn-by-turn loop in orchestrator: accumulate messages, call provider, grade, repeat
- Conversation assertions run after all turns
- Aggregation: mean (default), min (weakest-link), max
- String shorthand in per-turn assertions works identically to top-level
- Cross-field validation (turns requires mode:conversation, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 12, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8affcd0
Status: ✅  Deploy successful!
Preview URL: https://9530562a.agentv.pages.dev
Branch Preview URL: https://feat-1052-conversation-mode.agentv.pages.dev

View logs

christso and others added 3 commits April 12, 2026 10:36
Adds examples/features/multi-turn-conversation-live/ with 5 test cases
exercising conversation mode features: context retention, aggregation
modes, on_turn_failure, mixed assertions, and conversation-level assertions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests for conversation-mode orchestrator, validation rules,
and score aggregation (mean/min/max).

Also fixes buildTurnAssertions to emit type: 'llm-grader' with rubrics
instead of type: 'rubrics' (which is not registered in the builtin registry).
The evaluator-parser uses the same pattern for YAML-sourced rubrics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- YAML loader: include `turns` in completeness gate so conversation-only
  cases (no top-level criteria/assertions) are not silently skipped
- Orchestrator: stop falling back to evalCase.assertions per-turn — turns
  without own assertions score 1.0 instead of double-counting top-level
- Orchestrator: pass full transcript as candidate for conversation-level
  grading instead of only the last assistant reply
- Orchestrator: serialize structured message content with JSON.stringify
  instead of producing [object Object] in transcript strings
- Validator: reject whitespace-only and empty-array turn inputs
- Tests: add regression coverage for double-counting, transcript candidate,
  and whitespace input validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso marked this pull request as ready for review April 12, 2026 12:48
@christso christso merged commit bdcd007 into main Apr 12, 2026
4 checks passed
@christso christso deleted the feat/1052-conversation-mode branch April 12, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(eval): multi-turn conversational test case — live turn-by-turn evaluation

1 participant