Skip to content

OCPBUGS-86690: Fix Azure cluster deletion hang when resource groups are already deleted#8682

Open
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:OCPBUGS-86690-azure-orphaned-machine-deletion
Open

OCPBUGS-86690: Fix Azure cluster deletion hang when resource groups are already deleted#8682
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:OCPBUGS-86690-azure-orphaned-machine-deletion

Conversation

@vsolanki12
Copy link
Copy Markdown
Contributor

@vsolanki12 vsolanki12 commented Jun 5, 2026

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

  1. CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.

  2. CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

  • Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.

  • CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

  • The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.
  • Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.
  • Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Automatic cleanup of orphaned Azure machines stuck in deleting state after a short threshold.
  • Bug Fixes

    • Azure cluster destruction now gracefully proceeds if the specified resource group is missing.
  • Tests

    • Added unit tests covering orphaned-machine cleanup behavior.
  • Refactor

    • Compile-time assertion added to ensure the Azure platform implements orphan-deletion capability.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

  1. CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.

  2. CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

  • Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.

  • CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

  • The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.
  • Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.
  • Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: efd7cbcb-b39a-43dd-a2a9-0c264e73b604

📥 Commits

Reviewing files that changed from the base of the PR and between 79425af and ba7737f.

📒 Files selected for processing (4)
  • cmd/cluster/azure/destroy.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure_test.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
✅ Files skipped from review due to trivial changes (1)
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/cluster/azure/destroy.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure_test.go

📝 Walkthrough

Walkthrough

This PR adds Azure cluster cleanup enhancements across two areas. First, it introduces a new DeleteOrphanedMachines method to the Azure platform provider that cleans up machines stuck in deletion by removing their finalizers after a 5-minute threshold; the implementation is accompanied by comprehensive table-driven tests and integrated into the platform provider interface. Second, it improves the cluster destroy flow by gracefully handling missing Azure resource groups—when a resource group lookup returns a 404, the destroy operation now logs and continues instead of failing.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant KubeAPI as Kubernetes API (client)
  participant AzureMachine
  Controller->>KubeAPI: List AzureMachine in controlPlaneNamespace
  KubeAPI-->>Controller: AzureMachine list
  Controller->>AzureMachine: check DeletionTimestamp and age > 5m
  alt older than threshold
    Controller->>KubeAPI: MergePatch remove capiazure.MachineFinalizer
    KubeAPI-->>Controller: Patch result (success/error)
  else skip
    Controller-->>Controller: no-op
  end
  Controller-->>Controller: aggregate patch errors and return
Loading

Suggested reviewers

  • muraee
  • jparrill
  • sjenning
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive The check specifies "Ginkgo test code" but the added test uses standard Go testing with Gomega assertions, not Ginkgo patterns (no Describe/It/BeforeEach blocks). Clarify if check applies to standard Go tests or Ginkgo-only. For standard Go tests, the test is well-structured with single-responsibility subtests, good assertions messages, and codebase-consistent patterns.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: fixing Azure cluster deletion hang when resource groups are already deleted, which matches the core problem and solution described in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Tests use standard Go testing with t.Run() not Ginkgo. All test names are stable with no dynamic values like timestamps, UUIDs, or random suffixes.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces only controller logic for Azure machine cleanup with no deployment manifests, affinity rules, topology constraints, or scheduling assumptions incompatible with OpenShift topologies.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added; only standard Go unit test (TestDeleteOrphanedMachines) with fake client, no IPv4 assumptions, and no external connectivity requirements.
No-Weak-Crypto ✅ Passed No weak cryptography detected. The PR modifies Azure cluster deletion and orphan machine cleanup logic using standard libraries and Kubernetes controller utilities with no crypto-related code.
Container-Privileges ✅ Passed All four modified files are Go source code; no K8s manifests with privileged container configs (privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation) were introduced.
No-Sensitive-Data-In-Logs ✅ Passed New logging statements log only resource group names, machine identifiers, and timestamps—non-sensitive operational data. No credentials, tokens, API keys, PII, or passwords are exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Jun 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign devguyio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

  1. CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.

  2. CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

  • Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.

  • CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

  • The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.
  • Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.
  • Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Added automatic cleanup of orphaned Azure machines stuck in deleting state for extended periods.

  • Bug Fixes

  • Improved Azure cluster destruction to gracefully handle missing resource groups.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 53.84615% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.43%. Comparing base (25817d4) to head (ba7737f).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
cmd/cluster/azure/destroy.go 0.00% 6 Missing ⚠️
...ers/hostedcluster/internal/platform/azure/azure.go 70.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8682   +/-   ##
=======================================
  Coverage   41.43%   41.43%           
=======================================
  Files         756      756           
  Lines       93658    93671   +13     
=======================================
+ Hits        38807    38816    +9     
- Misses      52128    52132    +4     
  Partials     2723     2723           
Files with missing lines Coverage Δ
...ollers/hostedcluster/internal/platform/platform.go 0.00% <ø> (ø)
cmd/cluster/azure/destroy.go 17.18% <0.00%> (-0.56%) ⬇️
...ers/hostedcluster/internal/platform/azure/azure.go 31.89% <70.00%> (+2.03%) ⬆️

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 34.87% <0.00%> (-0.01%) ⬇️
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 42.74% <ø> (ø)
hypershift-operator 51.58% <70.00%> (+0.01%) ⬆️
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…s are already deleted

When Azure resource groups are deleted before the HostedCluster, two
things block cleanup: CAPI AzureMachine finalizers can never be removed
because the provider controller cannot reach the deleted resources, and
the CLI destroy command hard-fails on the resource group pre-validation
check.

Implement the OrphanDeleter interface for Azure by detecting
AzureMachines stuck in deletion for longer than 5 minutes and
force-removing their finalizers, allowing the CAPI deletion cascade to
proceed. Also change the destroy command's RG pre-validation to log a
warning and continue on 404 instead of failing.

Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
@vsolanki12 vsolanki12 force-pushed the OCPBUGS-86690-azure-orphaned-machine-deletion branch from 79425af to ba7737f Compare June 5, 2026 11:47
@vsolanki12
Copy link
Copy Markdown
Contributor Author

I have tried to reproduce and tested the fix in my test cluster as below:

Environment: Azure HCP cluster vsolanki-86690, CP version 4.22.0

Custom hypershift-operator image deployed:

    $ oc get pods -n hypershift -l name=operator \
        -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'
    operator-5c78c99-b5m4v: quay.io/vsolanki/hypershift-operator:fix-86690
    operator-5c78c99-srdl6: quay.io/vsolanki/hypershift-operator:fix-86690

Test scenario: Created two AzureMachines with azuremachine.infrastructure.cluster.x-k8s.io finalizer in the control plane namespace, then deleted them. Without a working Azure CAPI provider controller, they remain stuck in deletion indefinitely simulating the exact bug scenario where resource groups are already deleted.

Stuck AzureMachines (before HC deletion):

    $ oc get azuremachines -n clusters-vsolanki-86690 \
        -o jsonpath='{range .items[*]}{.metadata.name}: deletionTimestamp={.metadata.deletionTimestamp}, finalizers={.metadata.finalizers}{"\n"}{end}'
    test-azure-machine-1: deletionTimestamp=2026-06-05T10:39:26Z, finalizers=["azuremachine.infrastructure.cluster.x-k8s.io"]
    test-azure-machine-2: deletionTimestamp=2026-06-05T10:39:26Z, finalizers=["azuremachine.infrastructure.cluster.x-k8s.io"]

After fix triggered HC deletion, AzureMachines cleaned up:

    $ oc delete hc vsolanki-86690 -n clusters
    hostedcluster.hypershift.openshift.io "vsolanki-86690" deleted
    
    $ oc get azuremachines -n clusters-vsolanki-86690
    No resources found in clusters-vsolanki-86690 namespace.

Operator logs confirm finalizer removal:

    {"level":"info","ts":"2026-06-05T11:16:38Z",
     "msg":"Removed finalizers from orphaned AzureMachine stuck in deletion",
     "controller":"hostedcluster",
     "machine":{"name":"test-azure-machine-1","namespace":"clusters-vsolanki-86690"},
     "deletionTimestamp":"2026-06-05T10:39:26Z"}

    {"level":"info","ts":"2026-06-05T11:16:38Z",
     "msg":"Removed finalizers from orphaned AzureMachine stuck in deletion",
     "controller":"hostedcluster",
     "machine":{"name":"test-azure-machine-2","namespace":"clusters-vsolanki-86690"},
     "deletionTimestamp":"2026-06-05T10:39:26Z"}

Both AzureMachines had their finalizers removed (stuck for 37 minutes, well past the 5-minute threshold). HC deletion cascade proceeded without hanging.

@vsolanki12 vsolanki12 marked this pull request as ready for review June 5, 2026 12:00
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and enxebre June 5, 2026 12:01
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants