Skip to content

fix(deploy): handle PYTHON_3_14 region mismatch and recover from REVIEW_IN_PROGRESS stacks#1157

Closed
aidandaly24 wants to merge 6 commits into
mainfrom
fix/907-cc0e034e
Closed

fix(deploy): handle PYTHON_3_14 region mismatch and recover from REVIEW_IN_PROGRESS stacks#1157
aidandaly24 wants to merge 6 commits into
mainfrom
fix/907-cc0e034e

Conversation

@aidandaly24

Copy link
Copy Markdown
Contributor

Description

Fixes #907PYTHON_3_14 was accepted by agentcore validate but rejected by CloudFormation early-validation in regions where it is not yet GA, leaving the stack permanently stuck in REVIEW_IN_PROGRESS and requiring manual deletion before any further deploys could proceed.

This PR addresses the bug on three fronts:

1. Stop new projects from tripping the bug

  • src/schema/constants.ts: DEFAULT_PYTHON_VERSION lowered from 'PYTHON_3_14''PYTHON_3_13' (PYTHON_3_13 is GA in every AgentCore region).
  • src/cli/tui/screens/mcp/types.ts: reordered the runtime picker so PYTHON_3_13 is first / Recommended; PYTHON_3_14 is still selectable but labelled Preview – not available in all regions.
  • Sibling repo aws/agentcore-l3-cdk-constructs (branch fix/907-cc0e034e) lowers DEFAULT_MCP_PYTHON_VERSION and the lambda.Runtime fallback in McpLambdaCompute.ts to match.

2. Catch the mismatch at validate time

  • src/schema/constants.ts: new PYTHON_3_14_SUPPORTED_REGIONS = ['us-east-1', 'us-west-2'] constant with a cross-repo consistency note.
  • src/cli/commands/validate/{action,command}.tsx: agentcore validate now emits a non-fatal yellow warning when any agent, MCP runtime tool, or gateway Lambda/AgentCoreRuntime target uses PYTHON_3_14 against an unsupported region. Region comparison is case-insensitive; offending components are listed individually (agent "x", MCP tool "y", gateway "g" target "t").

3. Recover from the stuck-stack state

  • src/cli/cloudformation/stack-status.ts: split REVIEW_IN_PROGRESS out of the generic in-progress bucket and tag it isRecoverableReview: true with an actionable message.
  • src/cli/cloudformation/stack-cleanup.ts (new): recoverReviewInProgressStack() helper that:
    • Confirms the stack is in REVIEW_IN_PROGRESS.
    • Lists all change sets (paginated) and refuses to delete if any has been executed (Status=FAILED/OBSOLETE or ExecutionStatus in {UNAVAILABLE, AVAILABLE, EXECUTE_FAILED, OBSOLETE} are recoverable).
    • If no change sets remain (CFN auto-purge), emits a warning via onWarning and proceeds since REVIEW_IN_PROGRESS itself is authoritative evidence of zero resources.
    • Issues DeleteStackCommand and polls DescribeStacks until the stack disappears, with an onProgress heartbeat callback so the deploy step doesn't appear hung.
    • Stack-not-found detection is robust to AWS SDK v3 surface variations (ValidationError, ValidationException, or any error message containing does not exist).
  • src/cli/commands/deploy/{command.tsx,types.ts,actions.ts} + src/cli/operations/deploy/preflight.ts: new --recover flag wired through. When set, the preflight drains all REVIEW_IN_PROGRESS stacks one at a time (tracked by a Set<string> so the loop exits if the same stack reappears). When --recover is requested but the blocking stack is in a non-recoverable state, this is logged explicitly so the flag never silently does nothing.
  • src/cli/errors.ts: isEarlyValidationError() and isReviewInProgressError() helpers, wired into the handleDeploy catch block to upgrade the user-facing error with concrete remediation hints (try PYTHON_3_13, switch region, or run agentcore deploy --recover). Matchers are scoped to specific CFN phrasings to avoid false positives from event-stream messages that mention statuses in passing.

Tests

  • stack-status.test.ts: REVIEW_IN_PROGRESS now exercises the isRecoverableReview path.
  • stack-cleanup.test.ts (new, 13 cases): happy path, wrong status, missing stack, executed-changeset guard, paginated change sets, DELETE_FAILED, empty-summaries warn-and-proceed, AVAILABLE/EXECUTE_FAILED accepted, ValidationException + arbitrary "does not exist" surface variations, timeout, non-ValidationError propagation, onProgress heartbeats.
  • errors.test.ts: new cases for isEarlyValidationError (PropertyValidation, broader EarlyValidation::* namespace, hook(s)/validation framing) and isReviewInProgressError (specific phrasings + negative case for substring trap).
  • validate/__tests__/action.test.ts: warning emission, single-aggregated warning for multiple agents, no-runtimes case, MCP tool detection, gateway Lambda target detection, case-insensitive region comparison.

Related Issue

Closes #907

Documentation PR

N/A — no doc changes are required; the new --recover flag is documented inline via Commander's help text and the validate-time warning is self-describing.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Testing

How have you tested the change?

  • I ran npm run test:unit and npm run test:integ (targeted runs locally; CI will execute the full suite)
  • I ran npm run typecheck
  • I ran npm run lint
  • If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Targeted test runs that all pass locally:

  • src/cli/cloudformation/__tests__/stack-status.test.ts
  • src/cli/cloudformation/__tests__/stack-cleanup.test.ts
  • src/cli/__tests__/errors.test.ts
  • src/cli/commands/validate/__tests__/action.test.ts
  • src/cli/operations/deploy/__tests__/preflight-utils.test.ts

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Review history

This PR went through five rounds of code review with all findings addressed:

  • Round 1 (11 findings): warnings rendered to user in validate command, removed dead isRuntimeAvailableInRegion, wired isEarlyValidationError into deploy catch, replaced dynamic import with static, fixed change-set status sets, deduplicated regions, etc.
  • Round 2 (11 findings): extended collectRuntimeRegionWarnings to MCP runtime tools and gateway targets, cross-repo consistency note, broader EarlyValidation matcher, wired isReviewInProgressError, downgraded empty-summaries to warn-and-proceed, do/while polling.
  • Round 3 (11 findings): broadened SDK v3 stack-not-found detection, added onProgress heartbeat, multi-stack recovery loop, explicit "skipping recovery" log when not auto-recoverable, case-insensitive region matching.
  • Round 4 (5 findings): tightened comments, dropped redundant guard, scoped isReviewInProgressError matcher, replaced counter with Set<string> for recovery loop.
  • Round 5 (7 findings): all approved.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

agentcore-bot added 6 commits May 7, 2026 16:51
- validate command: render warnings (yellow) before printing 'Valid'
  so PYTHON_3_14/region mismatch is actually surfaced to users
- schema/constants.ts: drop unused isRuntimeAvailableInRegion helper;
  callers compare PYTHON_3_14_SUPPORTED_REGIONS directly
- deploy/actions.ts: drop dynamic import of recoverReviewInProgressStack
  and add it to the static cloudformation import block; wire
  isEarlyValidationError into the deploy catch handler to upgrade the
  user-facing message and recommend --recover / PYTHON_3_13
- stack-cleanup.ts: split status-vs-execution-status sets; treat
  Status=CREATE_COMPLETE + ExecutionStatus in {AVAILABLE, EXECUTE_FAILED,
  UNAVAILABLE, OBSOLETE} as recoverable; refuse to delete when
  ListChangeSets returns zero summaries (we lose the safety evidence)
- stack-status.ts: clarify RECOVERABLE_REVIEW_STATUSES comment — flag
  advertises eligibility, real safety check lives in stack-cleanup
- validate/action.ts: deduplicate unsupportedRegions before joining
- stack-cleanup tests: add cases for empty list refusal, AVAILABLE +
  CREATE_COMPLETE, and EXECUTE_FAILED
- collectRuntimeRegionWarnings: extend to MCP runtime tools and gateway
  Lambda/AgentCoreRuntime targets so PYTHON_3_14 warnings cover the
  full MCP surface (was: agents only)
- schema/constants.ts: add cross-repo consistency note pointing at the
  sibling agentcore-l3-cdk-constructs constant
- errors.ts: relax isEarlyValidationError to match any
  AWS::EarlyValidation::* hook subtype (was: PropertyValidation only,
  with redundant alternatives)
- deploy/actions.ts: wire isReviewInProgressError into the deploy catch
  block to surface a friendly --recover hint when a retried deploy hits
  a stuck stack
- stack-cleanup.ts: empty-summaries case now logs a warning via new
  onWarning callback and proceeds (REVIEW_IN_PROGRESS is itself
  authoritative evidence the stack has no resources). Also use do/while
  for the post-DeleteStack poll so timeoutMs=0 still gets one iteration
- tests: add coverage for multiple PYTHON_3_14 agents, no-runtimes case,
  MCP tool + gateway target warnings, broader EarlyValidation hooks,
  recovery timeout, non-ValidationError propagation, empty-summaries
  warn-and-proceed
- stack-cleanup.ts: broaden stack-not-found detection to also match
  ValidationException and any error message containing 'does not
  exist' (AWS SDK v3 surface variations could otherwise cause us to
  poll until timeout even after a successful delete)
- stack-cleanup.ts: add onProgress callback emitting heartbeats with
  the current StackStatus and elapsed time for each poll iteration so
  users don't perceive the recovery flow as hung
- deploy/actions.ts: drain *all* stacks stuck in REVIEW_IN_PROGRESS
  in a recovery loop bounded by stackNames.length, so multi-stack
  deploys don't force the user to invoke --recover repeatedly
- deploy/actions.ts: log an explicit warning when --recover is
  requested but the blocking stack is not auto-recoverable, so the
  flag never appears to silently do nothing
- deploy/actions.ts: pipe the new onWarning + onProgress callbacks
  from recoverReviewInProgressStack through to the deploy logger
- deploy/command.tsx: document that --recover should be combined
  with --yes when bootstrap may also be needed
- validate/action.ts: case-insensitive region comparison so
  non-canonical casing in aws-targets.json doesn't trigger a false
  PYTHON_3_14 warning
- tests: add coverage for ValidationException, 'does not exist'
  message variants, onProgress heartbeats, and case-insensitive
  region matching

Notes on round-3 findings already addressed in earlier rounds:
- collectRuntimeRegionWarnings inspects mcpRuntimeTools (round 2)
- isReviewInProgressError wired into deploy catch (round 2)
- empty-summaries case warns and proceeds (round 2)
- stack-cleanup.ts: tighten do/while comment to accurately describe
  the synchronous-completion case (was: overstated guarantee about
  timeoutMs=0)
- stack-cleanup.ts: drop redundant 'changeSetSummaries.length === 0 ||'
  guard since Array#every already returns true for empty arrays
- errors.ts: tighten isReviewInProgressError to match specific
  CloudFormation phrasings ('is in REVIEW_IN_PROGRESS', 'currently in
  REVIEW_IN_PROGRESS', 'REVIEW_IN_PROGRESS state') instead of a bare
  token, so passing references in event-stream messages don't trigger
  a misleading --recover hint. Added negative test for substring trap.
- deploy/actions.ts: replace integer counter with a Set<string> of
  recovered stack names; loop exits if the same blocking stack
  reappears (eventual-consistency lag or evolving deployability
  semantics) rather than retrying. Added comment documenting the
  'at most one recovery per stack' contract.

Round 4 HIGH 'unable to perform review' is a reviewer-environment
issue (no git binary in sandbox); no code change required.
@aidandaly24 aidandaly24 requested a review from a team May 7, 2026 17:31
@github-actions github-actions Bot added size/xl PR size: XL agentcore-harness-reviewing AgentCore Harness review in progress labels May 7, 2026

@agentcore-cli-automation agentcore-cli-automation left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work on this — the three-pronged approach (new-project default, validate-time warning, recover-from-stuck-state) is well factored and the test coverage is solid. The stack-cleanup helper in particular is careful about the right things (paginated ListChangeSets, non-executed change-set guard, warn-and-proceed on empty summaries, onProgress heartbeat, robust stack-not-found detection, do/while polling for the synchronous-delete case).

I have one serious concern that I think needs to be resolved before this can merge, detailed in the inline comments. In short: the sibling PR in aws/agentcore-l3-cdk-constructs that lowers DEFAULT_MCP_PYTHON_VERSION isn't on main yet, and because the CLI bundles the L3 constructs at build time (scripts/bundle.mjs), the shipped artifact will still default MCP Lambda / AgentCoreRuntime compute to PYTHON_3_14 whenever pythonVersion is unset in the user's config. The PR checklist item "Any dependent changes have been merged and published" is therefore not yet accurate, and the validate-time warning as written won't catch this case (it only flags an explicit 'PYTHON_3_14', not a missing/defaulted value), so the user gets a clean agentcore validate followed by a stuck-stack deploy — the exact regression #907 is trying to prevent.

Any of the following resolves it:

  1. Land the L3 constructs PR first, bump that package version, update this CLI to depend on the new version, and re-run npm run bundle to confirm the shipped constructs carry PYTHON_3_13.
  2. Add a commit here that pins/patches the bundled constructs to the forked branch so this PR's shipped artifact is internally coherent.
  3. Extend collectRuntimeRegionWarnings to also flag MCP Lambda and AgentCoreRuntime compute entries that have no pythonVersion set when the target region is unsupported (since the L3 default would silently pick PYTHON_3_14). This makes the CLI safe by itself even if the L3 fix slips.

}

for (const tool of projectSpec.mcpRuntimeTools ?? []) {
if (getPythonVersionFromCompute(tool.compute) === 'PYTHON_3_14') {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The === 'PYTHON_3_14' check only fires when pythonVersion is explicitly set to 'PYTHON_3_14'. When compute.pythonVersion (Lambda) or compute.runtime.pythonVersion (AgentCoreRuntime) is simply absent, getPythonVersionFromCompute returns undefined and this branch is skipped.

That's a problem given how the bundled CDK constructs behave today: in aws/agentcore-l3-cdk-constructs on main, McpLambdaCompute.ts has const pythonVersion = compute.pythonVersion ?? DEFAULT_MCP_PYTHON_VERSION; where DEFAULT_MCP_PYTHON_VERSION = 'PYTHON_3_14'. So a config with an MCP Lambda compute that omits pythonVersion will validate clean in eu-central-1, then synth a PYTHON_3_14 Lambda, then fail CloudFormation early-validation — the exact loop #907 is trying to stop.

This is tied to the main review thread (the sibling L3 constructs PR isn't on main yet). If you choose option 1 or 2 from the review body (bump the L3 dep and let its default become PYTHON_3_13), this is moot. If you choose option 3, please also flag unset-pythonVersion compute entries here. Same comment applies to the gateway-target loop a few lines down (line 152).

Comment thread src/schema/constants.ts
* package, so they can drift; consider extracting a shared package if a third
* region becomes supported. See https://github.com/aws/agentcore-cli/issues/907.
*/
export const PYTHON_3_14_SUPPORTED_REGIONS: readonly string[] = ['us-east-1', 'us-west-2'] as const;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-repo-consistency note here is great, but it flags the wrong symbol in the sibling repo for this PR's release story. The comment asks future maintainers to update DEFAULT_MCP_PYTHON_VERSION "when the list changes," but right now the sibling repo's main still has DEFAULT_MCP_PYTHON_VERSION = 'PYTHON_3_14' and lambda.Runtime.PYTHON_3_14 as the fallback in McpLambdaCompute.ts. Until that lands and a new L3-constructs version is published (and this CLI pins it + re-bundles), the CLI default here (PYTHON_3_13) and the bundled-CDK default (PYTHON_3_14) disagree. For MCP Lambda compute without an explicit pythonVersion, the CDK side wins and the original bug persists.

Could you either (a) wait for the L3 PR to land and bump the dependency here, or (b) note in the PR description that this CLI PR must be merged strictly after the L3 PR + a version bump? The "Any dependent changes have been merged and published" checkbox in the PR description is currently checked but the dependent change isn't merged.

logger.log(
`--recover requested, but stack "${deployabilityCheck.blockingStack}" is not in an auto-recoverable state. Skipping recovery.`,
'warn'
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor UX nit (not a blocker): if a stack reappears as REVIEW_IN_PROGRESS after recovery (e.g. CloudFormation eventual-consistency lag), the while loop exits via the recoveredStacks Set guard, and here isRecoverableReview is still true, so this warning block is skipped. We then fall through to the generic deployabilityCheck.message which points the user at agentcore deploy --recover — the thing they just ran — which is confusing.

Probably worth an extra branch that detects "we already recovered this stack this run" and emits a clearer message ("stack X was just recovered but is still reported as REVIEW_IN_PROGRESS; this is usually CFN eventual consistency, retry in a few seconds"). Not a merge blocker, but the current output is a dead-end loop from the user's perspective.

Comment thread src/cli/errors.ts
const message = getErrorMessage(err).toLowerCase();
if (message.includes('is in review_in_progress')) return true;
if (message.includes('currently in review_in_progress')) return true;
if (message.includes('review_in_progress state')) return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: isReviewInProgressError matches on three substrings, one of which is review_in_progress state (no word boundary). That's strong enough to avoid the obvious false positives (you have a test for the event-stream history case), but it would also match something like "... completed review_in_progress state handoff ..." in a future CFN phrasing change. Consider \bis (currently )?in REVIEW_IN_PROGRESS\b as a regex to be explicit. Not a blocker.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 7, 2026
@aidandaly24

Copy link
Copy Markdown
Contributor Author

Closing: test run from batch agent

@aidandaly24 aidandaly24 closed this May 7, 2026
@aidandaly24 aidandaly24 deleted the fix/907-cc0e034e branch May 13, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PYTHON_3_14 runtime accepted by CLI but rejected by CloudFormation

2 participants