test: add multi-turn conversation serialization integration test by tejaskash · Pull Request #455 · aws/bedrock-agentcore-sdk-python

tejaskash · 2026-05-05T20:53:31Z

Summary

Follow-up to #454. That PR fixed two bugs in StrandsEventParser / ConversationTurn / ADOTDocumentBuilder, and the unit tests covered them with MagicMock-based spans. This adds an end-to-end integration test that exercises the full pipeline — real Strands agent → real Bedrock call → real OTel spans → convert_strands_to_adot → ADOT document shape assertions.

What the test does

Runs a 4-turn Strands conversation against Claude (default model, configurable region via BEDROCK_TEST_REGION), captures the in-memory spans, and locates the final chat span — the Strands span kind that carries the full conversation history as gen_ai.user.message / gen_ai.assistant.message / gen_ai.choice events. Asserts the ADOT log record derived from that span:

Assertion	Rationale
Every prior user turn appears in `body.input.messages`	Bug 1 regression — the scalar-overwrite dedup would have left only the newest
Every prior assistant turn appears in `body.input.messages` (not `output.messages`)	Bug 2 regression — prior turns were being bucketed as current-turn output
Exactly one entry in `body.output.messages`, containing the current-turn `gen_ai.choice`	Output contamination check
Output does not contain any stale history text	Explicit contamination check
`input_messages` role order interleaves `[user, assistant, user, assistant, ..., user]`	Chronological ordering — the ordering issue raised during review of #454

Why this test matters over the unit tests

Unit tests assert against a MagicMock event stream shaped the way I thought Strands emitted spans. This test pins the actual shape Strands produces in production. Specifically, it discovered during development that Strands emits three span kinds per turn (chat, execute_event_loop_cycle, invoke_agent) — only chat carries the full history events where the bug manifested. Generic "grab the last log record" checks would miss this.

If Strands upstream changes its instrumentation (renames events, splits span kinds, changes attribute shapes), this test catches it. If a future refactor of span_to_adot_serializer breaks serialization for real span objects in ways mocked spans don't reflect, this test catches it.

Running it

AWS_PROFILE=<profile> BEDROCK_TEST_REGION=us-west-2 \
  uv run pytest tests_integ/evaluation/integrations/strands_agents_evals/test_multi_turn_conversation_serialization.py -xvs

Requires Bedrock model access (default model is whatever Strands picks, currently us.anthropic.claude-sonnet-4-*).

Verification

Ran locally against Bedrock in us-east-1. Final-turn chat span produced:

input.messages = 4 user entries + 3 assistant-history entries (all 4 prior user turns preserved, all 3 prior assistant turns land in input)
output.messages = 1 entry (the current gen_ai.choice, content "Your name is Taro and you live in Seattle.")
Role order = [user, assistant, user, assistant, user, assistant, user] (chronological)

Pre-fix, this test would fail at the very first assertion (4 expected user entries, 1 observed).

Test plan

Runs green against real Bedrock (us-east-1, Claude Sonnet 4, 4-turn conversation)
ruff check / ruff format clean
Would fail pre-fix: preserve multi-turn history in Strands ConversationTurn #454 fix — assertions target the exact shape the bug produced

End-to-end regression coverage for the fix in #454. Runs a real 4-turn Strands agent against Bedrock, locates the final `chat` span (the span kind that carries full history as `gen_ai.{user,assistant}.message` events), and asserts that `convert_strands_to_adot` produces: - Every prior user turn preserved in `body.input.messages` (Bug 1 regression). - Every prior assistant turn in `body.input.messages`, not `output.messages` (Bug 2 regression). - Exactly one entry in `body.output.messages` — the current turn's `gen_ai.choice` — with no stale history content. - Chronological role interleaving in `input_messages` (`[user, assistant, user, assistant, ..., user]`), not role-grouped. Unit tests cover the same assertions with mocked events; this test pins the real Strands span shape so future changes to Strands instrumentation or the serializer are caught at the integration boundary. Refs: V2197498381, #454.

aidandaly24

Good integration test — high value for pinning real Strands behavior. One high-risk concern and one nit inline.

aidandaly24 · 2026-05-05T21:33:54Z

+            f"Expected chronological role interleaving {expected_pattern}; got {roles_in_order}. "
+            f"Pre-fix role-grouped ordering would have emitted "
+            f"[{'user, ' * len(turns)}{'assistant, ' * expected_history_count}] instead."
+        )


Potential contradiction with PR #454's builder ordering.

This asserts chronological interleaving: [user, assistant, user, assistant, ..., user].

However, build_conversation_log_record in PR #454 serializes as role-grouped:

input_messages = [{...} for user_msg in conversation.user_messages] # all users input_messages.extend(conversation.history_messages) # then all assistants

That would produce [user, user, user, user, assistant, assistant, assistant] — failing this assertion.

Can you confirm: does convert_strands_to_adot have separate ordering logic that produces interleaved output? Or was the builder updated to interleave after my review on #454? If neither, this test will fail in CI against the current build_conversation_log_record implementation.

aidandaly24 · 2026-05-05T21:33:55Z

+        ):
+            assert stale not in joined_output, (
+                f"output.messages contains stale history entry containing {stale!r}: {output_texts}"
+            )


nit: These stale-content assertions are fragile — they assume specific model responses ("nice to meet you, taro", etc.) that Claude may not produce. When the model responds differently, these pass vacuously without actually testing anything. The len(output_messages) == 1 assertion above already structurally validates that history isn't leaking into output. Consider removing these or replacing with a structural check (e.g., verify no output text substring appears in any input entry).

tejaskash requested a review from a team May 5, 2026 20:53

tejaskash temporarily deployed to auto-approve May 5, 2026 20:59 — with GitHub Actions Inactive

tejaskash deployed to auto-approve May 5, 2026 20:59 — with GitHub Actions Active

tejaskash temporarily deployed to auto-approve May 5, 2026 20:59 — with GitHub Actions Inactive

aidandaly24 reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add multi-turn conversation serialization integration test#455

test: add multi-turn conversation serialization integration test#455
tejaskash wants to merge 1 commit intomainfrom
test/multi-turn-conversation-serialization-integ

tejaskash commented May 5, 2026

Uh oh!

aidandaly24 left a comment

Uh oh!

aidandaly24 May 5, 2026

Uh oh!

aidandaly24 May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tejaskash commented May 5, 2026

Summary

What the test does

Why this test matters over the unit tests

Running it

Verification

Test plan

Uh oh!

aidandaly24 left a comment

Choose a reason for hiding this comment

Uh oh!

aidandaly24 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

aidandaly24 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants