fix: preserve multi-turn history in Strands ConversationTurn#454
fix: preserve multi-turn history in Strands ConversationTurn#454
Conversation
StrandsEventParser.extract_conversation_turn previously stored the gen_ai.user.message content in a scalar, so spans carrying multiple user-message events (common in multi-turn agent conversations) dropped all but the newest entry. It also appended gen_ai.assistant.message events — which per OTel GenAI semconv are prior-turn input context — into the same list as gen_ai.choice outputs, mixing history with the current model response. ConversationTurn now exposes user_messages (List[str]) and history_messages (List[Dict]) alongside the existing assistant_messages and tool_results. user_message is retained as a backwards-compatible property that returns the most recent user turn and emits a DeprecationWarning when it truncates. ADOTDocumentBuilder now emits every user turn plus all prior assistant turns as input.messages, so downstream consumers (Online Evaluations, Observability dashboards) see the full context the model received. Refs: V2197498381. Companion fix needed upstream in aws-opentelemetry-distro LLOHandler, which has the same class of bug.
|
|
Thanks, bot. Walking through the three flags:
The checker did miss the one intentional behavior change worth calling out: |
aidandaly24
left a comment
There was a problem hiding this comment.
Clean fix for a real data-loss bug. Backwards compatibility is well handled, tests are thorough. Two minor nits inline — neither is blocking.
| tool_results.append(content) | ||
|
|
||
| if user_message and assistant_messages: | ||
| if user_messages and (assistant_messages or history_messages): |
There was a problem hiding this comment.
nit: The condition changed from user_message and assistant_messages to user_messages and (assistant_messages or history_messages). This means a span with user_messages + tool_results but no assistant_messages or history_messages now returns None — which could drop valid tool-use-only turns that previously returned a ConversationTurn (when history messages were incorrectly bucketed into assistant_messages).
Might be worth adding a test that confirms the intended behavior for a user-message + tool-result-only span (no choice, no history), and deciding whether tool_results should also satisfy this guard.
There was a problem hiding this comment.
Fixed — gate is back to requiring assistant_messages. See the reply in the parent thread for details.
There was a problem hiding this comment.
Added test_extract_conversation_turn_tool_results_only_returns_none for your suggested case. Quick verification of the "previously returned ConversationTurn" claim: the old guard was user_message and assistant_messages, and tool_results were never in the guard — so a user+tool_result-only span with no assistant events returned None in the old code too. No regression. The gate was broadened to user_messages and (assistant_messages or history_messages) in the first pass of this PR, but per jariys comment it is now tightened back to require assistant_messages. Whether tool-only turns should produce a ConversationTurn is a separate feature decision for the Evaluations team; out of scope for this bugfix.
| input_messages: List[Dict[str, Any]] = [ | ||
| {"content": {"content": user_msg}, "role": "user"} for user_msg in conversation.user_messages | ||
| ] | ||
| input_messages.extend(conversation.history_messages) |
There was a problem hiding this comment.
nit: This serializes input as [user, user, user, assistant-history, assistant-history] rather than preserving the original interleaving ([user, assistant, user, assistant, user]). The test asserts this flat order so it seems intentional — but worth confirming with the Observability/Evaluations team whether downstream consumers expect role-grouped input or chronological interleaving. If they ever need to reconstruct the actual conversation flow, this ordering makes it impossible.
There was a problem hiding this comment.
Went with chronological. ConversationTurn.input_messages now interleaves user and assistant events in arrival order — for [u1, a1, u2, a2, u3] downstream sees [user, assistant, user, assistant, user], matching the real conversation the model received. Test asserts that order.
| returning the most recent user turn; new code should read ``user_messages``. | ||
| """ | ||
|
|
||
| user_messages: List[str] = field(default_factory=list) |
There was a problem hiding this comment.
These four field(default_factory=list) declarations are dead weight — the custom __init__ below sets the instance vars directly, so these are never used. Delete them.
There was a problem hiding this comment.
Gone. The new class is a plain class with explicit attribute assignment in __init__.
| case cls.EVENT_USER_MESSAGE: | ||
| user_message = event_attrs.get("content", "") | ||
| content = event_attrs.get("content", "") | ||
| if content: |
There was a problem hiding this comment.
if content: silently drops empty events. The old scalar path had different behavior — user_message = "" would fail the outer if user_message gate and reject the whole turn. Now empty events are skipped individually. If the intent is to drop empty content, add a log line so it's visible.
There was a problem hiding this comment.
Added logger.debug(...) on all three empty-content skip paths (user, assistant, tool). The outer-gate semantics are also restored — a parse that yields only an empty user event now fails the gate (has_user_input requires a non-empty input_messages entry with role=user), matching the old if user_message behavior.
|
|
||
| def test_user_message_alias_returns_latest(self): | ||
| """user_message returns the latest entry; accessing it with history emits a DeprecationWarning.""" | ||
| import warnings |
There was a problem hiding this comment.
import warnings inside the test body — move it to the module-level imports. Same issue at line ~199 with import pytest, which is already imported at line 5.
There was a problem hiding this comment.
Hoisted import warnings to the module-level imports. import pytest was already imported at line 5 — the duplicate inside the test body is gone.
| @@ -129,7 +129,43 @@ def test_extract_conversation_turn_assistant_message(self, mock_event): | |||
| turn = StrandsEventParser.extract_conversation_turn(events) | |||
|
|
|||
There was a problem hiding this comment.
Let's add a test for the no-choice path here — [user.message, assistant.message] with no gen_ai.choice event. The broadened gate enables it now but there's no coverage for what output.messages looks like in that case.
There was a problem hiding this comment.
Added test_extract_conversation_turn_assistant_message_only_returns_none. Since the gate now requires assistant_messages, the no-choice path returns None and never reaches build_conversation_log_record.
look into these too |
…model Addresses review comments from jariy17, aidandaly24, notgitika on #454. Data model: replace `user_messages: List[str]` + `history_messages: List[Dict]` split with a single `input_messages: List[Dict]` field that preserves the chronological arrival order of `gen_ai.user.message` and `gen_ai.assistant.message` events. For a span `[u1, a1, u2, a2, u3, choice]`, downstream now receives `[user, assistant, user, assistant, user]` so the actual conversation flow is reconstructable, per OTel GenAI semconv. ConversationTurn: drop @DataClass decorator (it was conflicting with the custom __init__ and leaving the class-level field defaults dead), write explicit __init__, __repr__, __eq__ so equality compares actual instance state. `user_message: str` property retained for backwards compatibility, now returns the most recent user entry from `input_messages` and warns when history is present. Parser: gate tightened back to require `assistant_messages` so a span with no `gen_ai.choice` does not emit an empty-output ConversationTurn. Empty event content now logs at debug level instead of silently skipping. Tests: move duplicated fixtures into a shared conftest.py; hoist `import warnings` / `import pytest` to module-level; update the multi-turn regression test to assert chronological ordering rather than role-grouped; add regression tests for no-choice-path and user+tool_result-only-path returning None.
Summary
StrandsEventParser.extract_conversation_turnsilently drops conversation history when a span carries multiplegen_ai.*.messageevents — common on every multi-turn agent invocation. This PR preserves the full history, separates prior assistant turns from the current model output, and keeps the existing API working via a deprecation shim.The bug
Two interacting defects in
span_to_adot_serializer/strands_converter.py:1. User-message collapse —
extract_conversation_turnstoredgen_ai.user.messagecontent in a scalar:A span carrying N
gen_ai.user.messageevents (the conversation history a Strands agent replays to the model) would retain only the last one. For a 9-turn conversation, 8 user turns were silently dropped before the span was ever serialized.2. Assistant-message miscategorization —
gen_ai.assistant.messageevents andgen_ai.choiceevents were both.append()ed to the sameassistant_messageslist:Per OTel GenAI semconv,
gen_ai.{system,user,assistant,tool}.messageevents describe input context (including prior assistant turns the model reads as history), whilegen_ai.choicedescribes the current-turn output. Mixing them means downstream consumers —ADOTDocumentBuilder.build_conversation_log_record, Online Evaluations, Observability dashboards — cannot tell history apart from new output and receive a ConversationTurn that misrepresents what the model actually did.Reproduction
A span from a 3-turn conversation (3 user msgs, 2 prior assistant msgs, 1 new choice) produced a
ConversationTurnwith:assistant_messageshistory_messagesFix
ConversationTurn(adot_models.py) adds two fields:user_messages: List[str]— every user turn observed on the span, in order.history_messages: List[Dict]— prior assistant turns fromgen_ai.assistant.messageevents, kept separate from the current-turn output.user_messageis retained as a@propertythat returns the most recent entry and emits aDeprecationWarningwhen it truncates more than one turn, so existing callers keep working while being nudged towarduser_messages. ConstructingConversationTurn(user_message=...)still works — the initializer accepts either the legacy scalar or the new list.StrandsEventParser.extract_conversation_turn(strands_converter.py) appends everygen_ai.user.messagetouser_messages, routesgen_ai.assistant.messagetohistory_messages, and keepsassistant_messagesscoped togen_ai.choiceoutput only.ADOTDocumentBuilder.build_conversation_log_record(adot_models.py) emits all user turns followed by all history assistant turns asinput.messages, so the serialized ADOT record carries the full context the model actually received.Tests
test_extract_conversation_turn_assistant_message—gen_ai.assistant.messagenow lands inhistory_messages, notassistant_messages.test_extract_conversation_turn_preserves_all_user_messages— multiplegen_ai.user.messageevents all retained.test_extract_conversation_turn_separates_history_from_choice— the end-to-end multi-turn case from the bug report.ConversationTurndataclass tests: legacy scalar init, list init, deprecation-warning behavior, rejection of both-at-once, history preservation.test_build_conversation_log_record_multi_turn_history— builder emits[user, user, user, assistant-history, assistant-history]on input and single-entry output.Full suite: 1726 passed, 0 failed. Serializer module: 41/41. ruff, ruff format, and pre-commit clean.
Backwards compatibility
ConversationTurn(user_message=...)constructor form still works.turn.user_messagestill returns a string (the most recent user turn).body.input.messages/body.output.messagesshape is unchanged;input.messagesjust contains more entries on multi-turn spans.assistant_messagesnow get only the current-turn output. If a caller previously relied on the buggy behavior where history messages showed up inassistant_messages, they should migrate tohistory_messages. This is the intended correctness change.Related
There is a companion bug with the same failure mode in
aws-opentelemetry-distro'sLLOHandler.process_spans(llo_handler.py:329dedupes events by name;llo_handler.py:148-152tagsgen_ai.assistant.messagewithsource: "output"). That needs a separate fix in theaws-observability/aws-otel-python-instrumentationrepo; this PR only addresses the AgentCore SDK side.Test plan
tests/suite passes (1726 passed)ruff check/ruff format/ pre-commit hooks cleanuser_messagesholds all turns andassistant_messagesholds only the currentgen_ai.choice