test: add multi-turn conversation serialization integration test#455
test: add multi-turn conversation serialization integration test#455
Conversation
End-to-end regression coverage for the fix in #454. Runs a real 4-turn Strands agent against Bedrock, locates the final `chat` span (the span kind that carries full history as `gen_ai.{user,assistant}.message` events), and asserts that `convert_strands_to_adot` produces: - Every prior user turn preserved in `body.input.messages` (Bug 1 regression). - Every prior assistant turn in `body.input.messages`, not `output.messages` (Bug 2 regression). - Exactly one entry in `body.output.messages` — the current turn's `gen_ai.choice` — with no stale history content. - Chronological role interleaving in `input_messages` (`[user, assistant, user, assistant, ..., user]`), not role-grouped. Unit tests cover the same assertions with mocked events; this test pins the real Strands span shape so future changes to Strands instrumentation or the serializer are caught at the integration boundary. Refs: V2197498381, #454.
aidandaly24
left a comment
There was a problem hiding this comment.
Good integration test — high value for pinning real Strands behavior. One high-risk concern and one nit inline.
| f"Expected chronological role interleaving {expected_pattern}; got {roles_in_order}. " | ||
| f"Pre-fix role-grouped ordering would have emitted " | ||
| f"[{'user, ' * len(turns)}{'assistant, ' * expected_history_count}] instead." | ||
| ) |
There was a problem hiding this comment.
Potential contradiction with PR #454's builder ordering.
This asserts chronological interleaving: [user, assistant, user, assistant, ..., user].
However, build_conversation_log_record in PR #454 serializes as role-grouped:
input_messages = [{...} for user_msg in conversation.user_messages] # all users
input_messages.extend(conversation.history_messages) # then all assistantsThat would produce [user, user, user, user, assistant, assistant, assistant] — failing this assertion.
Can you confirm: does convert_strands_to_adot have separate ordering logic that produces interleaved output? Or was the builder updated to interleave after my review on #454? If neither, this test will fail in CI against the current build_conversation_log_record implementation.
| ): | ||
| assert stale not in joined_output, ( | ||
| f"output.messages contains stale history entry containing {stale!r}: {output_texts}" | ||
| ) |
There was a problem hiding this comment.
nit: These stale-content assertions are fragile — they assume specific model responses ("nice to meet you, taro", etc.) that Claude may not produce. When the model responds differently, these pass vacuously without actually testing anything. The len(output_messages) == 1 assertion above already structurally validates that history isn't leaking into output. Consider removing these or replacing with a structural check (e.g., verify no output text substring appears in any input entry).
Summary
Follow-up to #454. That PR fixed two bugs in
StrandsEventParser/ConversationTurn/ADOTDocumentBuilder, and the unit tests covered them with MagicMock-based spans. This adds an end-to-end integration test that exercises the full pipeline — real Strands agent → real Bedrock call → real OTel spans →convert_strands_to_adot→ ADOT document shape assertions.What the test does
Runs a 4-turn Strands conversation against Claude (default model, configurable region via
BEDROCK_TEST_REGION), captures the in-memory spans, and locates the finalchatspan — the Strands span kind that carries the full conversation history asgen_ai.user.message/gen_ai.assistant.message/gen_ai.choiceevents. Asserts the ADOT log record derived from that span:body.input.messagesbody.input.messages(notoutput.messages)body.output.messages, containing the current-turngen_ai.choiceinput_messagesrole order interleaves[user, assistant, user, assistant, ..., user]Why this test matters over the unit tests
Unit tests assert against a MagicMock event stream shaped the way I thought Strands emitted spans. This test pins the actual shape Strands produces in production. Specifically, it discovered during development that Strands emits three span kinds per turn (
chat,execute_event_loop_cycle,invoke_agent) — onlychatcarries the full history events where the bug manifested. Generic "grab the last log record" checks would miss this.If Strands upstream changes its instrumentation (renames events, splits span kinds, changes attribute shapes), this test catches it. If a future refactor of
span_to_adot_serializerbreaks serialization for real span objects in ways mocked spans don't reflect, this test catches it.Running it
Requires Bedrock model access (default model is whatever Strands picks, currently
us.anthropic.claude-sonnet-4-*).Verification
Ran locally against Bedrock in
us-east-1. Final-turn chat span produced:input.messages= 4 user entries + 3 assistant-history entries (all 4 prior user turns preserved, all 3 prior assistant turns land in input)output.messages= 1 entry (the currentgen_ai.choice, content"Your name is Taro and you live in Seattle.")[user, assistant, user, assistant, user, assistant, user](chronological)Pre-fix, this test would fail at the very first assertion (4 expected user entries, 1 observed).
Test plan
ruff check/ruff formatclean