Skip to content

Resume a paused Redshift cluster before deleting it#68922

Open
seanghaeli wants to merge 1 commit into
apache:mainfrom
aws-mwaa:feature/redshift-resume-before-delete
Open

Resume a paused Redshift cluster before deleting it#68922
seanghaeli wants to merge 1 commit into
apache:mainfrom
aws-mwaa:feature/redshift-resume-before-delete

Conversation

@seanghaeli

Copy link
Copy Markdown
Contributor

Why

A paused Redshift cluster cannot be deleted. delete_cluster raises:

InvalidClusterStateFault: There is an operation running on the Cluster. Please try to delete it at a later time.

RedshiftDeleteClusterOperator already retries this error, but the retry cannot recover a paused cluster — a paused cluster never leaves that state on its own, so every attempt hits the same fault. The retries exhaust and the cluster is left behind, silently leaked until external cleanup reaps it.

This was observed in practice with the example_redshift system test: a cluster left in the paused phase (e.g. after an upstream task failed before resume_cluster) was never deleted by the delete_cluster teardown task, and accumulated as a stale resource.

What

Add _resume_if_paused() and call it at the start of execute(): if the cluster is paused, resume it and wait for available before issuing the delete. Clusters in any other state are unaffected (early return), and the existing busy-retry loop for transient InvalidClusterStateFault during deletion is unchanged.

Tests

  • test_delete_paused_cluster_resumes_first — a paused cluster is resumed, waited on (cluster_available), then deleted.
  • test_delete_available_cluster_does_not_resume — a non-paused cluster is deleted directly, with no spurious resume.
  • Existing delete-operator tests (deferrable paths, busy-retry exhaustion) unchanged and passing.

Verified locally in Breeze: all TestDeleteClusterOperator tests pass.

Generated-by: Claude Code (Opus via Claude Code) on behalf of Sean Ghaeli

A ``paused`` Redshift cluster cannot be deleted: ``delete_cluster`` raises
``InvalidClusterStateFault`` ("There is an operation running on the Cluster"),
and ``RedshiftDeleteClusterOperator``'s retry loop cannot recover because a
paused cluster never leaves that state on its own. The retries exhaust and the
cluster is left behind -- silently leaked until external cleanup reaps it.

Resume the cluster first when it is paused (and wait until it is ``available``)
before issuing the delete. Clusters that are not paused are unaffected.

Generated-by: Claude Code (Opus via Claude Code) on behalf of Sean Ghaeli
@seanghaeli seanghaeli requested a review from o-nikolas as a code owner June 23, 2026 23:24
@boring-cyborg boring-cyborg Bot added area:providers provider:amazon AWS/Amazon - related issues labels Jun 23, 2026
def execute(self, context: Context):
# A paused cluster cannot be deleted; resume it first (otherwise the retry loop below
# would exhaust against InvalidClusterStateFault and the cluster would be leaked).
self._resume_if_paused()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something to ensure this stays transactional? If we resume the cluster, then fail sometime between now and the deletion then the cluster is now running unexpectedly when the user thought it was paused (essentially the inverse of the situation that we find ourselves in now).

We should at least make this an opt in perhaps if we can't ensure the operation is transactional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants