Skip to content

fix(payments): T06 cap memory polling at the documented 5-minute ceiling#1676

Closed
fahadfa-aws wants to merge 1 commit into
awslabs:mainfrom
fahadfa-aws:fix/payments-t06-memory-poll-timeout
Closed

fix(payments): T06 cap memory polling at the documented 5-minute ceiling#1676
fahadfa-aws wants to merge 1 commit into
awslabs:mainfrom
fahadfa-aws:fix/payments-t06-memory-poll-timeout

Conversation

@fahadfa-aws

Copy link
Copy Markdown

Issue

research_agent_with_memory.py:205 polls get_memory in a while True: loop with no upper bound. The inline print at line 201 says "usually 30-90s" but the README at line 163 already documents a 5-minute ceiling: "If it stays CREATING beyond 5 minutes, call get_memory once to inspect failureReason. Check CloudWatch logs in the bedrock-agentcore log group for index build errors."

The script never enforces the ceiling. The original test session ran at 150s — already past "usually 30-90s" — with no signal that anything might be unusual. A genuinely stuck memory polls forever.

Changes

research_agent_with_memory.py:200-220 — add MAX_WAIT_SECONDS = 300 (matching the README's documented ceiling). When elapsed crosses it, raise TimeoutError with the exact get-memory CLI invocation from the README so the operator has the next step inline. The existing FAILED-state RuntimeError path is unchanged. The existing finally-block memory cleanup still runs on either exit. Update the inline "waiting" message to surface the ceiling so users have a calibrated expectation.

12 lines added, 1 changed. No behavior change for the happy path.

Verification

Mocked-clock unit tests — three scenarios, all pass:

  • Stuck memory (always returns CREATING) → TimeoutError raised at exactly 300s, 31 poll calls
  • FAILED status on 3rd poll → RuntimeError raised before timeout
  • Happy path (ACTIVE on 5th poll, ~40s) → returns normally, no behavior change

Live AWS happy-path — created a real AgentCore::Memory in us-west-2, ran the patched loop end-to-end:

  • Memory reached ACTIVE at 150s (matching the original test-session timing)
  • Loop printed status updates every 10s, stayed within MAX_WAIT_SECONDS, exited cleanly
  • DeleteMemory cleanup succeeded after

Stuck-state probe — tried provoking a CREATING-stuck state with deliberately weird inputs (empty namespace ///, 500-char namespace). The service reached ACTIVE within 80-120s in every case, so I couldn't induce a true stuck state with bad inputs alone — that's evidence the genuinely-stuck case is a tail/service-side condition, exactly what the timeout exists to surface.

AWS Knowledge MCP cross-check — no official AgentCore Memory creation-time SLA published; the README's 5-minute ceiling is the strongest available reference, which this fix matches exactly.

The polling loop waiting for AgentCore Memory to reach ACTIVE used
`while True:` with no upper bound. The inline print told users
"usually 30-90s" but the README at line 163 already documents a
5-minute practical ceiling and tells operators to "call get_memory
once to inspect failureReason" if it stays CREATING beyond that.

The script never enforces the ceiling. A genuinely stuck memory polls
forever, and the original test run at 150s was already past the
"usually" range without any signal that something might be unusual.

Add MAX_WAIT_SECONDS = 300 (matching the README's ceiling). When the
elapsed time crosses it, raise TimeoutError with the exact get-memory
command from the README so the operator has the next step in front of
them. The message also points at CloudWatch logs in the
bedrock-agentcore log group.

The existing FAILED-state RuntimeError path is unchanged; the existing
finally-block memory cleanup still runs on either exit. Update the
inline waiting message to mention the ceiling so users have a calibrated
expectation rather than "30-90s" optimism.

Verified:
- Mocked-clock tests covering stuck/FAILED/happy-path scenarios all pass.
- Live AWS happy-path: real memory creation in us-west-2 reached ACTIVE
  at 150s through the patched loop, then cleaned up normally.
@fahadfa-aws

Copy link
Copy Markdown
Author

@mvangara10 — flagging this for your review when you have a moment. Tagged across the full set of payments-tutorial fixes I've been pushing today; happy to walk through any of them. Audit logs and test evidence are referenced in the PR description.

@fahadfa-aws

Copy link
Copy Markdown
Author

Superseded by #1738 (consolidated PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant