fix(deploy): handle PYTHON_3_14 region mismatch and recover from REVIEW_IN_PROGRESS stacks#1157
fix(deploy): handle PYTHON_3_14 region mismatch and recover from REVIEW_IN_PROGRESS stacks#1157aidandaly24 wants to merge 6 commits into
Conversation
…, stack-status)
- validate command: render warnings (yellow) before printing 'Valid'
so PYTHON_3_14/region mismatch is actually surfaced to users
- schema/constants.ts: drop unused isRuntimeAvailableInRegion helper;
callers compare PYTHON_3_14_SUPPORTED_REGIONS directly
- deploy/actions.ts: drop dynamic import of recoverReviewInProgressStack
and add it to the static cloudformation import block; wire
isEarlyValidationError into the deploy catch handler to upgrade the
user-facing message and recommend --recover / PYTHON_3_13
- stack-cleanup.ts: split status-vs-execution-status sets; treat
Status=CREATE_COMPLETE + ExecutionStatus in {AVAILABLE, EXECUTE_FAILED,
UNAVAILABLE, OBSOLETE} as recoverable; refuse to delete when
ListChangeSets returns zero summaries (we lose the safety evidence)
- stack-status.ts: clarify RECOVERABLE_REVIEW_STATUSES comment — flag
advertises eligibility, real safety check lives in stack-cleanup
- validate/action.ts: deduplicate unsupportedRegions before joining
- stack-cleanup tests: add cases for empty list refusal, AVAILABLE +
CREATE_COMPLETE, and EXECUTE_FAILED
- collectRuntimeRegionWarnings: extend to MCP runtime tools and gateway Lambda/AgentCoreRuntime targets so PYTHON_3_14 warnings cover the full MCP surface (was: agents only) - schema/constants.ts: add cross-repo consistency note pointing at the sibling agentcore-l3-cdk-constructs constant - errors.ts: relax isEarlyValidationError to match any AWS::EarlyValidation::* hook subtype (was: PropertyValidation only, with redundant alternatives) - deploy/actions.ts: wire isReviewInProgressError into the deploy catch block to surface a friendly --recover hint when a retried deploy hits a stuck stack - stack-cleanup.ts: empty-summaries case now logs a warning via new onWarning callback and proceeds (REVIEW_IN_PROGRESS is itself authoritative evidence the stack has no resources). Also use do/while for the post-DeleteStack poll so timeoutMs=0 still gets one iteration - tests: add coverage for multiple PYTHON_3_14 agents, no-runtimes case, MCP tool + gateway target warnings, broader EarlyValidation hooks, recovery timeout, non-ValidationError propagation, empty-summaries warn-and-proceed
- stack-cleanup.ts: broaden stack-not-found detection to also match ValidationException and any error message containing 'does not exist' (AWS SDK v3 surface variations could otherwise cause us to poll until timeout even after a successful delete) - stack-cleanup.ts: add onProgress callback emitting heartbeats with the current StackStatus and elapsed time for each poll iteration so users don't perceive the recovery flow as hung - deploy/actions.ts: drain *all* stacks stuck in REVIEW_IN_PROGRESS in a recovery loop bounded by stackNames.length, so multi-stack deploys don't force the user to invoke --recover repeatedly - deploy/actions.ts: log an explicit warning when --recover is requested but the blocking stack is not auto-recoverable, so the flag never appears to silently do nothing - deploy/actions.ts: pipe the new onWarning + onProgress callbacks from recoverReviewInProgressStack through to the deploy logger - deploy/command.tsx: document that --recover should be combined with --yes when bootstrap may also be needed - validate/action.ts: case-insensitive region comparison so non-canonical casing in aws-targets.json doesn't trigger a false PYTHON_3_14 warning - tests: add coverage for ValidationException, 'does not exist' message variants, onProgress heartbeats, and case-insensitive region matching Notes on round-3 findings already addressed in earlier rounds: - collectRuntimeRegionWarnings inspects mcpRuntimeTools (round 2) - isReviewInProgressError wired into deploy catch (round 2) - empty-summaries case warns and proceeds (round 2)
- stack-cleanup.ts: tighten do/while comment to accurately describe
the synchronous-completion case (was: overstated guarantee about
timeoutMs=0)
- stack-cleanup.ts: drop redundant 'changeSetSummaries.length === 0 ||'
guard since Array#every already returns true for empty arrays
- errors.ts: tighten isReviewInProgressError to match specific
CloudFormation phrasings ('is in REVIEW_IN_PROGRESS', 'currently in
REVIEW_IN_PROGRESS', 'REVIEW_IN_PROGRESS state') instead of a bare
token, so passing references in event-stream messages don't trigger
a misleading --recover hint. Added negative test for substring trap.
- deploy/actions.ts: replace integer counter with a Set<string> of
recovered stack names; loop exits if the same blocking stack
reappears (eventual-consistency lag or evolving deployability
semantics) rather than retrying. Added comment documenting the
'at most one recovery per stack' contract.
Round 4 HIGH 'unable to perform review' is a reviewer-environment
issue (no git binary in sandbox); no code change required.
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
Thanks for the thorough work on this — the three-pronged approach (new-project default, validate-time warning, recover-from-stuck-state) is well factored and the test coverage is solid. The stack-cleanup helper in particular is careful about the right things (paginated ListChangeSets, non-executed change-set guard, warn-and-proceed on empty summaries, onProgress heartbeat, robust stack-not-found detection, do/while polling for the synchronous-delete case).
I have one serious concern that I think needs to be resolved before this can merge, detailed in the inline comments. In short: the sibling PR in aws/agentcore-l3-cdk-constructs that lowers DEFAULT_MCP_PYTHON_VERSION isn't on main yet, and because the CLI bundles the L3 constructs at build time (scripts/bundle.mjs), the shipped artifact will still default MCP Lambda / AgentCoreRuntime compute to PYTHON_3_14 whenever pythonVersion is unset in the user's config. The PR checklist item "Any dependent changes have been merged and published" is therefore not yet accurate, and the validate-time warning as written won't catch this case (it only flags an explicit 'PYTHON_3_14', not a missing/defaulted value), so the user gets a clean agentcore validate followed by a stuck-stack deploy — the exact regression #907 is trying to prevent.
Any of the following resolves it:
- Land the L3 constructs PR first, bump that package version, update this CLI to depend on the new version, and re-run
npm run bundleto confirm the shipped constructs carry PYTHON_3_13. - Add a commit here that pins/patches the bundled constructs to the forked branch so this PR's shipped artifact is internally coherent.
- Extend
collectRuntimeRegionWarningsto also flag MCP Lambda and AgentCoreRuntime compute entries that have nopythonVersionset when the target region is unsupported (since the L3 default would silently pick PYTHON_3_14). This makes the CLI safe by itself even if the L3 fix slips.
| } | ||
|
|
||
| for (const tool of projectSpec.mcpRuntimeTools ?? []) { | ||
| if (getPythonVersionFromCompute(tool.compute) === 'PYTHON_3_14') { |
There was a problem hiding this comment.
The === 'PYTHON_3_14' check only fires when pythonVersion is explicitly set to 'PYTHON_3_14'. When compute.pythonVersion (Lambda) or compute.runtime.pythonVersion (AgentCoreRuntime) is simply absent, getPythonVersionFromCompute returns undefined and this branch is skipped.
That's a problem given how the bundled CDK constructs behave today: in aws/agentcore-l3-cdk-constructs on main, McpLambdaCompute.ts has const pythonVersion = compute.pythonVersion ?? DEFAULT_MCP_PYTHON_VERSION; where DEFAULT_MCP_PYTHON_VERSION = 'PYTHON_3_14'. So a config with an MCP Lambda compute that omits pythonVersion will validate clean in eu-central-1, then synth a PYTHON_3_14 Lambda, then fail CloudFormation early-validation — the exact loop #907 is trying to stop.
This is tied to the main review thread (the sibling L3 constructs PR isn't on main yet). If you choose option 1 or 2 from the review body (bump the L3 dep and let its default become PYTHON_3_13), this is moot. If you choose option 3, please also flag unset-pythonVersion compute entries here. Same comment applies to the gateway-target loop a few lines down (line 152).
| * package, so they can drift; consider extracting a shared package if a third | ||
| * region becomes supported. See https://github.com/aws/agentcore-cli/issues/907. | ||
| */ | ||
| export const PYTHON_3_14_SUPPORTED_REGIONS: readonly string[] = ['us-east-1', 'us-west-2'] as const; |
There was a problem hiding this comment.
The cross-repo-consistency note here is great, but it flags the wrong symbol in the sibling repo for this PR's release story. The comment asks future maintainers to update DEFAULT_MCP_PYTHON_VERSION "when the list changes," but right now the sibling repo's main still has DEFAULT_MCP_PYTHON_VERSION = 'PYTHON_3_14' and lambda.Runtime.PYTHON_3_14 as the fallback in McpLambdaCompute.ts. Until that lands and a new L3-constructs version is published (and this CLI pins it + re-bundles), the CLI default here (PYTHON_3_13) and the bundled-CDK default (PYTHON_3_14) disagree. For MCP Lambda compute without an explicit pythonVersion, the CDK side wins and the original bug persists.
Could you either (a) wait for the L3 PR to land and bump the dependency here, or (b) note in the PR description that this CLI PR must be merged strictly after the L3 PR + a version bump? The "Any dependent changes have been merged and published" checkbox in the PR description is currently checked but the dependent change isn't merged.
| logger.log( | ||
| `--recover requested, but stack "${deployabilityCheck.blockingStack}" is not in an auto-recoverable state. Skipping recovery.`, | ||
| 'warn' | ||
| ); |
There was a problem hiding this comment.
Minor UX nit (not a blocker): if a stack reappears as REVIEW_IN_PROGRESS after recovery (e.g. CloudFormation eventual-consistency lag), the while loop exits via the recoveredStacks Set guard, and here isRecoverableReview is still true, so this warning block is skipped. We then fall through to the generic deployabilityCheck.message which points the user at agentcore deploy --recover — the thing they just ran — which is confusing.
Probably worth an extra branch that detects "we already recovered this stack this run" and emits a clearer message ("stack X was just recovered but is still reported as REVIEW_IN_PROGRESS; this is usually CFN eventual consistency, retry in a few seconds"). Not a merge blocker, but the current output is a dead-end loop from the user's perspective.
| const message = getErrorMessage(err).toLowerCase(); | ||
| if (message.includes('is in review_in_progress')) return true; | ||
| if (message.includes('currently in review_in_progress')) return true; | ||
| if (message.includes('review_in_progress state')) return true; |
There was a problem hiding this comment.
Very minor: isReviewInProgressError matches on three substrings, one of which is review_in_progress state (no word boundary). That's strong enough to avoid the obvious false positives (you have a test for the event-stream history case), but it would also match something like "... completed review_in_progress state handoff ..." in a future CFN phrasing change. Consider \bis (currently )?in REVIEW_IN_PROGRESS\b as a regex to be explicit. Not a blocker.
|
Closing: test run from batch agent |
Description
Fixes #907 —
PYTHON_3_14was accepted byagentcore validatebut rejected by CloudFormation early-validation in regions where it is not yet GA, leaving the stack permanently stuck inREVIEW_IN_PROGRESSand requiring manual deletion before any further deploys could proceed.This PR addresses the bug on three fronts:
1. Stop new projects from tripping the bug
src/schema/constants.ts:DEFAULT_PYTHON_VERSIONlowered from'PYTHON_3_14'→'PYTHON_3_13'(PYTHON_3_13 is GA in every AgentCore region).src/cli/tui/screens/mcp/types.ts: reordered the runtime picker soPYTHON_3_13is first / Recommended;PYTHON_3_14is still selectable but labelledPreview – not available in all regions.aws/agentcore-l3-cdk-constructs(branchfix/907-cc0e034e) lowersDEFAULT_MCP_PYTHON_VERSIONand thelambda.Runtimefallback inMcpLambdaCompute.tsto match.2. Catch the mismatch at validate time
src/schema/constants.ts: newPYTHON_3_14_SUPPORTED_REGIONS = ['us-east-1', 'us-west-2']constant with a cross-repo consistency note.src/cli/commands/validate/{action,command}.tsx:agentcore validatenow emits a non-fatal yellow warning when any agent, MCP runtime tool, or gateway Lambda/AgentCoreRuntime target usesPYTHON_3_14against an unsupported region. Region comparison is case-insensitive; offending components are listed individually (agent "x",MCP tool "y",gateway "g" target "t").3. Recover from the stuck-stack state
src/cli/cloudformation/stack-status.ts: splitREVIEW_IN_PROGRESSout of the generic in-progress bucket and tag itisRecoverableReview: truewith an actionable message.src/cli/cloudformation/stack-cleanup.ts(new):recoverReviewInProgressStack()helper that:REVIEW_IN_PROGRESS.FAILED/OBSOLETEor ExecutionStatus in{UNAVAILABLE, AVAILABLE, EXECUTE_FAILED, OBSOLETE}are recoverable).onWarningand proceeds sinceREVIEW_IN_PROGRESSitself is authoritative evidence of zero resources.DeleteStackCommandand pollsDescribeStacksuntil the stack disappears, with anonProgressheartbeat callback so the deploy step doesn't appear hung.ValidationError,ValidationException, or any error message containingdoes not exist).src/cli/commands/deploy/{command.tsx,types.ts,actions.ts}+src/cli/operations/deploy/preflight.ts: new--recoverflag wired through. When set, the preflight drains allREVIEW_IN_PROGRESSstacks one at a time (tracked by aSet<string>so the loop exits if the same stack reappears). When--recoveris requested but the blocking stack is in a non-recoverable state, this is logged explicitly so the flag never silently does nothing.src/cli/errors.ts:isEarlyValidationError()andisReviewInProgressError()helpers, wired into thehandleDeploycatch block to upgrade the user-facing error with concrete remediation hints (tryPYTHON_3_13, switch region, or runagentcore deploy --recover). Matchers are scoped to specific CFN phrasings to avoid false positives from event-stream messages that mention statuses in passing.Tests
stack-status.test.ts: REVIEW_IN_PROGRESS now exercises theisRecoverableReviewpath.stack-cleanup.test.ts(new, 13 cases): happy path, wrong status, missing stack, executed-changeset guard, paginated change sets, DELETE_FAILED, empty-summaries warn-and-proceed, AVAILABLE/EXECUTE_FAILED accepted, ValidationException + arbitrary "does not exist" surface variations, timeout, non-ValidationError propagation, onProgress heartbeats.errors.test.ts: new cases forisEarlyValidationError(PropertyValidation, broaderEarlyValidation::*namespace, hook(s)/validation framing) andisReviewInProgressError(specific phrasings + negative case for substring trap).validate/__tests__/action.test.ts: warning emission, single-aggregated warning for multiple agents, no-runtimes case, MCP tool detection, gateway Lambda target detection, case-insensitive region comparison.Related Issue
Closes #907
Documentation PR
N/A — no doc changes are required; the new
--recoverflag is documented inline via Commander's help text and the validate-time warning is self-describing.Type of Change
Testing
How have you tested the change?
npm run test:unitandnpm run test:integ(targeted runs locally; CI will execute the full suite)npm run typechecknpm run lintsrc/assets/, I rannpm run test:update-snapshotsand committed the updated snapshotsTargeted test runs that all pass locally:
src/cli/cloudformation/__tests__/stack-status.test.tssrc/cli/cloudformation/__tests__/stack-cleanup.test.tssrc/cli/__tests__/errors.test.tssrc/cli/commands/validate/__tests__/action.test.tssrc/cli/operations/deploy/__tests__/preflight-utils.test.tsChecklist
Review history
This PR went through five rounds of code review with all findings addressed:
isRuntimeAvailableInRegion, wiredisEarlyValidationErrorinto deploy catch, replaced dynamic import with static, fixed change-set status sets, deduplicated regions, etc.collectRuntimeRegionWarningsto MCP runtime tools and gateway targets, cross-repo consistency note, broader EarlyValidation matcher, wiredisReviewInProgressError, downgraded empty-summaries to warn-and-proceed, do/while polling.onProgressheartbeat, multi-stack recovery loop, explicit "skipping recovery" log when not auto-recoverable, case-insensitive region matching.isReviewInProgressErrormatcher, replaced counter withSet<string>for recovery loop.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.