Skip to content

fix: preserve multi-turn history in Strands ConversationTurn#454

Merged
tejaskash merged 2 commits intomainfrom
fix/conversation-turn-multi-message-dropping
May 5, 2026
Merged

fix: preserve multi-turn history in Strands ConversationTurn#454
tejaskash merged 2 commits intomainfrom
fix/conversation-turn-multi-message-dropping

Conversation

@tejaskash
Copy link
Copy Markdown
Contributor

Summary

StrandsEventParser.extract_conversation_turn silently drops conversation history when a span carries multiple gen_ai.*.message events — common on every multi-turn agent invocation. This PR preserves the full history, separates prior assistant turns from the current model output, and keeps the existing API working via a deprecation shim.

The bug

Two interacting defects in span_to_adot_serializer/strands_converter.py:

1. User-message collapseextract_conversation_turn stored gen_ai.user.message content in a scalar:

case cls.EVENT_USER_MESSAGE:
    user_message = event_attrs.get("content", "")  # overwrites on every event

A span carrying N gen_ai.user.message events (the conversation history a Strands agent replays to the model) would retain only the last one. For a 9-turn conversation, 8 user turns were silently dropped before the span was ever serialized.

2. Assistant-message miscategorizationgen_ai.assistant.message events and gen_ai.choice events were both .append()ed to the same assistant_messages list:

case cls.EVENT_CHOICE:           # current-turn model output
    ...
    assistant_messages.append(...)
case cls.EVENT_ASSISTANT_MESSAGE:  # prior-turn history input
    ...
    assistant_messages.append(...)

Per OTel GenAI semconv, gen_ai.{system,user,assistant,tool}.message events describe input context (including prior assistant turns the model reads as history), while gen_ai.choice describes the current-turn output. Mixing them means downstream consumers — ADOTDocumentBuilder.build_conversation_log_record, Online Evaluations, Observability dashboards — cannot tell history apart from new output and receive a ConversationTurn that misrepresents what the model actually did.

Reproduction

A span from a 3-turn conversation (3 user msgs, 2 prior assistant msgs, 1 new choice) produced a ConversationTurn with:

Field Expected Actual (before fix)
user content retained 3 turns 1 (newest only)
assistant_messages 1 entry (choice) 3 (2 history + choice)
history_messages 2 entries field did not exist

Fix

ConversationTurn (adot_models.py) adds two fields:

  • user_messages: List[str] — every user turn observed on the span, in order.
  • history_messages: List[Dict] — prior assistant turns from gen_ai.assistant.message events, kept separate from the current-turn output.

user_message is retained as a @property that returns the most recent entry and emits a DeprecationWarning when it truncates more than one turn, so existing callers keep working while being nudged toward user_messages. Constructing ConversationTurn(user_message=...) still works — the initializer accepts either the legacy scalar or the new list.

StrandsEventParser.extract_conversation_turn (strands_converter.py) appends every gen_ai.user.message to user_messages, routes gen_ai.assistant.message to history_messages, and keeps assistant_messages scoped to gen_ai.choice output only.

ADOTDocumentBuilder.build_conversation_log_record (adot_models.py) emits all user turns followed by all history assistant turns as input.messages, so the serialized ADOT record carries the full context the model actually received.

Tests

  • Updated test_extract_conversation_turn_assistant_messagegen_ai.assistant.message now lands in history_messages, not assistant_messages.
  • New test_extract_conversation_turn_preserves_all_user_messages — multiple gen_ai.user.message events all retained.
  • New test_extract_conversation_turn_separates_history_from_choice — the end-to-end multi-turn case from the bug report.
  • New ConversationTurn dataclass tests: legacy scalar init, list init, deprecation-warning behavior, rejection of both-at-once, history preservation.
  • New test_build_conversation_log_record_multi_turn_history — builder emits [user, user, user, assistant-history, assistant-history] on input and single-entry output.

Full suite: 1726 passed, 0 failed. Serializer module: 41/41. ruff, ruff format, and pre-commit clean.

Backwards compatibility

  • ConversationTurn(user_message=...) constructor form still works.
  • turn.user_message still returns a string (the most recent user turn).
  • ADOT wire format body.input.messages / body.output.messages shape is unchanged; input.messages just contains more entries on multi-turn spans.
  • Consumers that read assistant_messages now get only the current-turn output. If a caller previously relied on the buggy behavior where history messages showed up in assistant_messages, they should migrate to history_messages. This is the intended correctness change.

Related

There is a companion bug with the same failure mode in aws-opentelemetry-distro's LLOHandler.process_spans (llo_handler.py:329 dedupes events by name; llo_handler.py:148-152 tags gen_ai.assistant.message with source: "output"). That needs a separate fix in the aws-observability/aws-otel-python-instrumentation repo; this PR only addresses the AgentCore SDK side.

Test plan

  • Existing serializer tests pass unchanged (where correct) or updated (where they asserted the buggy behavior)
  • New regression tests cover multi-turn user, multi-turn history, and mixed-role preservation
  • Full tests/ suite passes (1726 passed)
  • ruff check / ruff format / pre-commit hooks clean
  • End-to-end repro shows user_messages holds all turns and assistant_messages holds only the current gen_ai.choice

StrandsEventParser.extract_conversation_turn previously stored the
gen_ai.user.message content in a scalar, so spans carrying multiple
user-message events (common in multi-turn agent conversations) dropped
all but the newest entry. It also appended gen_ai.assistant.message
events — which per OTel GenAI semconv are prior-turn input context —
into the same list as gen_ai.choice outputs, mixing history with the
current model response.

ConversationTurn now exposes user_messages (List[str]) and
history_messages (List[Dict]) alongside the existing assistant_messages
and tool_results. user_message is retained as a backwards-compatible
property that returns the most recent user turn and emits a
DeprecationWarning when it truncates. ADOTDocumentBuilder now emits
every user turn plus all prior assistant turns as input.messages, so
downstream consumers (Online Evaluations, Observability dashboards)
see the full context the model received.

Refs: V2197498381. Companion fix needed upstream in
aws-opentelemetry-distro LLOHandler, which has the same class of bug.
@tejaskash tejaskash requested a review from a team May 5, 2026 15:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

⚠️ Breaking Change Warning

Found 3 potential breaking change(s) in this PR:

�[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:88: ConversationTurn.assistant_messages: �[33mAttribute value was changed�[39m: None -> list(assistant_messages or [])
�[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:89: ConversationTurn.tool_results: �[33mAttribute value was changed�[39m: None -> list(tool_results or [])
�[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:65: ConversationTurn.init: �[33mReturn types are incompatible�[39m: None -> None


Note: This is an automated static analysis check. Some flagged changes may be intentional.
Please confirm each item is expected and, if so, add a migration note to CHANGELOG.md.

@tejaskash
Copy link
Copy Markdown
Contributor Author

Thanks, bot. Walking through the three flags:

  1. ConversationTurn.assistant_messages: None -> list(assistant_messages) if assistant_messages is not None else [] — false positive. This is a signature diff, not a value change. The class moved from an auto-generated @dataclass __init__ (where the field annotation was List[Dict[str, Any]] with no default, so callers had to pass a value) to a custom __init__ that defaults missing args to []. Existing callers that passed assistant_messages=... get identical behavior; callers that omitted it previously raised TypeError and now get an empty list. Not a breaking change.

  2. ConversationTurn.tool_results: None -> list(tool_results) if tool_results is not None else [] — same as above. Signature change from required-positional to defaulted, non-breaking.

  3. ConversationTurn.__init__: Return types are incompatible: None -> None — false positive, the bot is diffing an auto-synthesized dataclass __init__ against the explicit one. Both return None.

The checker did miss the one intentional behavior change worth calling out: assistant_messages no longer contains prior gen_ai.assistant.message turns — those move to the new history_messages field. This is the bug fix (per OTel GenAI semconv, those events are input context, not current-turn output). Callers that iterated assistant_messages expecting both should either read both fields, or read body.input.messages from build_conversation_log_record, which now carries the full history. The legacy user_message scalar accessor and constructor form continue to work via a deprecation shim.

aidandaly24
aidandaly24 previously approved these changes May 5, 2026
Copy link
Copy Markdown
Contributor

@aidandaly24 aidandaly24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for a real data-loss bug. Backwards compatibility is well handled, tests are thorough. Two minor nits inline — neither is blocking.

tool_results.append(content)

if user_message and assistant_messages:
if user_messages and (assistant_messages or history_messages):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The condition changed from user_message and assistant_messages to user_messages and (assistant_messages or history_messages). This means a span with user_messages + tool_results but no assistant_messages or history_messages now returns None — which could drop valid tool-use-only turns that previously returned a ConversationTurn (when history messages were incorrectly bucketed into assistant_messages).

Might be worth adding a test that confirms the intended behavior for a user-message + tool-result-only span (no choice, no history), and deciding whether tool_results should also satisfy this guard.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ agree

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — gate is back to requiring assistant_messages. See the reply in the parent thread for details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_extract_conversation_turn_tool_results_only_returns_none for your suggested case. Quick verification of the "previously returned ConversationTurn" claim: the old guard was user_message and assistant_messages, and tool_results were never in the guard — so a user+tool_result-only span with no assistant events returned None in the old code too. No regression. The gate was broadened to user_messages and (assistant_messages or history_messages) in the first pass of this PR, but per jariys comment it is now tightened back to require assistant_messages. Whether tool-only turns should produce a ConversationTurn is a separate feature decision for the Evaluations team; out of scope for this bugfix.

input_messages: List[Dict[str, Any]] = [
{"content": {"content": user_msg}, "role": "user"} for user_msg in conversation.user_messages
]
input_messages.extend(conversation.history_messages)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This serializes input as [user, user, user, assistant-history, assistant-history] rather than preserving the original interleaving ([user, assistant, user, assistant, user]). The test asserts this flat order so it seems intentional — but worth confirming with the Observability/Evaluations team whether downstream consumers expect role-grouped input or chronological interleaving. If they ever need to reconstruct the actual conversation flow, this ordering makes it impossible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with chronological. ConversationTurn.input_messages now interleaves user and assistant events in arrival order — for [u1, a1, u2, a2, u3] downstream sees [user, assistant, user, assistant, user], matching the real conversation the model received. Test asserts that order.

Comment thread src/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py Outdated
returning the most recent user turn; new code should read ``user_messages``.
"""

user_messages: List[str] = field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These four field(default_factory=list) declarations are dead weight — the custom __init__ below sets the instance vars directly, so these are never used. Delete them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gone. The new class is a plain class with explicit attribute assignment in __init__.

Comment thread src/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py Outdated
Comment thread src/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py Outdated
case cls.EVENT_USER_MESSAGE:
user_message = event_attrs.get("content", "")
content = event_attrs.get("content", "")
if content:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if content: silently drops empty events. The old scalar path had different behavior — user_message = "" would fail the outer if user_message gate and reject the whole turn. Now empty events are skipped individually. If the intent is to drop empty content, add a log line so it's visible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logger.debug(...) on all three empty-content skip paths (user, assistant, tool). The outer-gate semantics are also restored — a parse that yields only an empty user event now fails the gate (has_user_input requires a non-empty input_messages entry with role=user), matching the old if user_message behavior.

Comment thread src/bedrock_agentcore/evaluation/span_to_adot_serializer/strands_converter.py Outdated

def test_user_message_alias_returns_latest(self):
"""user_message returns the latest entry; accessing it with history emits a DeprecationWarning."""
import warnings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import warnings inside the test body — move it to the module-level imports. Same issue at line ~199 with import pytest, which is already imported at line 5.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoisted import warnings to the module-level imports. import pytest was already imported at line 5 — the duplicate inside the test body is gone.

Comment thread tests/bedrock_agentcore/evaluation/span_to_adot_serializer/test_adot_models.py Outdated
@@ -129,7 +129,43 @@ def test_extract_conversation_turn_assistant_message(self, mock_event):
turn = StrandsEventParser.extract_conversation_turn(events)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test for the no-choice path here — [user.message, assistant.message] with no gen_ai.choice event. The broadened gate enables it now but there's no coverage for what output.messages looks like in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_extract_conversation_turn_assistant_message_only_returns_none. Since the gate now requires assistant_messages, the no-choice path returns None and never reaches build_conversation_log_record.

@jariy17
Copy link
Copy Markdown
Contributor

jariy17 commented May 5, 2026

⚠️ Breaking Change Warning

Found 3 potential breaking change(s) in this PR:

�[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:91: ConversationTurn.assistant_messages: �[33mAttribute value was changed�[39m: None -> list(assistant_messages) if assistant_messages is not None else [] �[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:92: ConversationTurn.tool_results: �[33mAttribute value was changed�[39m: None -> list(tool_results) if tool_results is not None else [] �[1msrc/bedrock_agentcore/evaluation/span_to_adot_serializer/adot_models.py�[0m:70: ConversationTurn.init: �[33mReturn types are incompatible�[39m: None -> None

Note: This is an automated static analysis check. Some flagged changes may be intentional.
Please confirm each item is expected and, if so, add a migration note to CHANGELOG.md.

look into these too

…model

Addresses review comments from jariy17, aidandaly24, notgitika on #454.

Data model: replace `user_messages: List[str]` + `history_messages: List[Dict]`
split with a single `input_messages: List[Dict]` field that preserves the
chronological arrival order of `gen_ai.user.message` and
`gen_ai.assistant.message` events. For a span `[u1, a1, u2, a2, u3, choice]`,
downstream now receives `[user, assistant, user, assistant, user]` so the
actual conversation flow is reconstructable, per OTel GenAI semconv.

ConversationTurn: drop @DataClass decorator (it was conflicting with the
custom __init__ and leaving the class-level field defaults dead), write
explicit __init__, __repr__, __eq__ so equality compares actual instance
state. `user_message: str` property retained for backwards compatibility,
now returns the most recent user entry from `input_messages` and warns when
history is present.

Parser: gate tightened back to require `assistant_messages` so a span with
no `gen_ai.choice` does not emit an empty-output ConversationTurn. Empty
event content now logs at debug level instead of silently skipping.

Tests: move duplicated fixtures into a shared conftest.py; hoist
`import warnings` / `import pytest` to module-level; update the multi-turn
regression test to assert chronological ordering rather than role-grouped;
add regression tests for no-choice-path and user+tool_result-only-path
returning None.
@tejaskash tejaskash merged commit 10da345 into main May 5, 2026
41 of 42 checks passed
@tejaskash tejaskash deleted the fix/conversation-turn-multi-message-dropping branch May 5, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants