Skip to content

test: add multi-turn conversation serialization integration test#455

Open
tejaskash wants to merge 1 commit intomainfrom
test/multi-turn-conversation-serialization-integ
Open

test: add multi-turn conversation serialization integration test#455
tejaskash wants to merge 1 commit intomainfrom
test/multi-turn-conversation-serialization-integ

Conversation

@tejaskash
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #454. That PR fixed two bugs in StrandsEventParser / ConversationTurn / ADOTDocumentBuilder, and the unit tests covered them with MagicMock-based spans. This adds an end-to-end integration test that exercises the full pipeline — real Strands agent → real Bedrock call → real OTel spans → convert_strands_to_adot → ADOT document shape assertions.

What the test does

Runs a 4-turn Strands conversation against Claude (default model, configurable region via BEDROCK_TEST_REGION), captures the in-memory spans, and locates the final chat span — the Strands span kind that carries the full conversation history as gen_ai.user.message / gen_ai.assistant.message / gen_ai.choice events. Asserts the ADOT log record derived from that span:

Assertion Rationale
Every prior user turn appears in body.input.messages Bug 1 regression — the scalar-overwrite dedup would have left only the newest
Every prior assistant turn appears in body.input.messages (not output.messages) Bug 2 regression — prior turns were being bucketed as current-turn output
Exactly one entry in body.output.messages, containing the current-turn gen_ai.choice Output contamination check
Output does not contain any stale history text Explicit contamination check
input_messages role order interleaves [user, assistant, user, assistant, ..., user] Chronological ordering — the ordering issue raised during review of #454

Why this test matters over the unit tests

Unit tests assert against a MagicMock event stream shaped the way I thought Strands emitted spans. This test pins the actual shape Strands produces in production. Specifically, it discovered during development that Strands emits three span kinds per turn (chat, execute_event_loop_cycle, invoke_agent) — only chat carries the full history events where the bug manifested. Generic "grab the last log record" checks would miss this.

If Strands upstream changes its instrumentation (renames events, splits span kinds, changes attribute shapes), this test catches it. If a future refactor of span_to_adot_serializer breaks serialization for real span objects in ways mocked spans don't reflect, this test catches it.

Running it

AWS_PROFILE=<profile> BEDROCK_TEST_REGION=us-west-2 \
  uv run pytest tests_integ/evaluation/integrations/strands_agents_evals/test_multi_turn_conversation_serialization.py -xvs

Requires Bedrock model access (default model is whatever Strands picks, currently us.anthropic.claude-sonnet-4-*).

Verification

Ran locally against Bedrock in us-east-1. Final-turn chat span produced:

  • input.messages = 4 user entries + 3 assistant-history entries (all 4 prior user turns preserved, all 3 prior assistant turns land in input)
  • output.messages = 1 entry (the current gen_ai.choice, content "Your name is Taro and you live in Seattle.")
  • Role order = [user, assistant, user, assistant, user, assistant, user] (chronological)

Pre-fix, this test would fail at the very first assertion (4 expected user entries, 1 observed).

Test plan

End-to-end regression coverage for the fix in #454. Runs a real 4-turn
Strands agent against Bedrock, locates the final `chat` span (the span kind
that carries full history as `gen_ai.{user,assistant}.message` events), and
asserts that `convert_strands_to_adot` produces:

- Every prior user turn preserved in `body.input.messages` (Bug 1 regression).
- Every prior assistant turn in `body.input.messages`, not `output.messages`
  (Bug 2 regression).
- Exactly one entry in `body.output.messages` — the current turn's
  `gen_ai.choice` — with no stale history content.
- Chronological role interleaving in `input_messages`
  (`[user, assistant, user, assistant, ..., user]`), not role-grouped.

Unit tests cover the same assertions with mocked events; this test pins
the real Strands span shape so future changes to Strands instrumentation
or the serializer are caught at the integration boundary.

Refs: V2197498381, #454.
@tejaskash tejaskash requested a review from a team May 5, 2026 20:53
@tejaskash tejaskash deployed to auto-approve May 5, 2026 20:59 — with GitHub Actions Active
Copy link
Copy Markdown
Contributor

@aidandaly24 aidandaly24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good integration test — high value for pinning real Strands behavior. One high-risk concern and one nit inline.

f"Expected chronological role interleaving {expected_pattern}; got {roles_in_order}. "
f"Pre-fix role-grouped ordering would have emitted "
f"[{'user, ' * len(turns)}{'assistant, ' * expected_history_count}] instead."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential contradiction with PR #454's builder ordering.

This asserts chronological interleaving: [user, assistant, user, assistant, ..., user].

However, build_conversation_log_record in PR #454 serializes as role-grouped:

input_messages = [{...} for user_msg in conversation.user_messages]  # all users
input_messages.extend(conversation.history_messages)  # then all assistants

That would produce [user, user, user, user, assistant, assistant, assistant] — failing this assertion.

Can you confirm: does convert_strands_to_adot have separate ordering logic that produces interleaved output? Or was the builder updated to interleave after my review on #454? If neither, this test will fail in CI against the current build_conversation_log_record implementation.

):
assert stale not in joined_output, (
f"output.messages contains stale history entry containing {stale!r}: {output_texts}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These stale-content assertions are fragile — they assume specific model responses ("nice to meet you, taro", etc.) that Claude may not produce. When the model responds differently, these pass vacuously without actually testing anything. The len(output_messages) == 1 assertion above already structurally validates that history isn't leaking into output. Consider removing these or replacing with a structural check (e.g., verify no output text substring appears in any input entry).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants