fix(payments): T06 cap memory polling at the documented 5-minute ceiling#1676
Closed
fahadfa-aws wants to merge 1 commit into
Closed
fix(payments): T06 cap memory polling at the documented 5-minute ceiling#1676fahadfa-aws wants to merge 1 commit into
fahadfa-aws wants to merge 1 commit into
Conversation
The polling loop waiting for AgentCore Memory to reach ACTIVE used `while True:` with no upper bound. The inline print told users "usually 30-90s" but the README at line 163 already documents a 5-minute practical ceiling and tells operators to "call get_memory once to inspect failureReason" if it stays CREATING beyond that. The script never enforces the ceiling. A genuinely stuck memory polls forever, and the original test run at 150s was already past the "usually" range without any signal that something might be unusual. Add MAX_WAIT_SECONDS = 300 (matching the README's ceiling). When the elapsed time crosses it, raise TimeoutError with the exact get-memory command from the README so the operator has the next step in front of them. The message also points at CloudWatch logs in the bedrock-agentcore log group. The existing FAILED-state RuntimeError path is unchanged; the existing finally-block memory cleanup still runs on either exit. Update the inline waiting message to mention the ceiling so users have a calibrated expectation rather than "30-90s" optimism. Verified: - Mocked-clock tests covering stuck/FAILED/happy-path scenarios all pass. - Live AWS happy-path: real memory creation in us-west-2 reached ACTIVE at 150s through the patched loop, then cleaned up normally.
Author
|
@mvangara10 — flagging this for your review when you have a moment. Tagged across the full set of payments-tutorial fixes I've been pushing today; happy to walk through any of them. Audit logs and test evidence are referenced in the PR description. |
Author
|
Superseded by #1738 (consolidated PR) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
research_agent_with_memory.py:205pollsget_memoryin awhile True:loop with no upper bound. The inline print at line 201 says "usually 30-90s" but the README at line 163 already documents a 5-minute ceiling: "If it stays CREATING beyond 5 minutes, callget_memoryonce to inspectfailureReason. Check CloudWatch logs in the bedrock-agentcore log group for index build errors."The script never enforces the ceiling. The original test session ran at 150s — already past "usually 30-90s" — with no signal that anything might be unusual. A genuinely stuck memory polls forever.
Changes
research_agent_with_memory.py:200-220— addMAX_WAIT_SECONDS = 300(matching the README's documented ceiling). When elapsed crosses it, raiseTimeoutErrorwith the exactget-memoryCLI invocation from the README so the operator has the next step inline. The existingFAILED-stateRuntimeErrorpath is unchanged. The existingfinally-block memory cleanup still runs on either exit. Update the inline "waiting" message to surface the ceiling so users have a calibrated expectation.12 lines added, 1 changed. No behavior change for the happy path.
Verification
Mocked-clock unit tests — three scenarios, all pass:
TimeoutErrorraised at exactly 300s, 31 poll callsRuntimeErrorraised before timeoutLive AWS happy-path — created a real
AgentCore::Memoryinus-west-2, ran the patched loop end-to-end:Stuck-state probe — tried provoking a CREATING-stuck state with deliberately weird inputs (empty namespace
///, 500-char namespace). The service reached ACTIVE within 80-120s in every case, so I couldn't induce a true stuck state with bad inputs alone — that's evidence the genuinely-stuck case is a tail/service-side condition, exactly what the timeout exists to surface.AWS Knowledge MCP cross-check — no official AgentCore Memory creation-time SLA published; the README's 5-minute ceiling is the strongest available reference, which this fix matches exactly.