OCPBUGS-86690: Fix Azure cluster deletion hang when resource groups are already deleted by vsolanki12 · Pull Request #8682 · openshift/hypershift

vsolanki12 · 2026-06-05T11:34:12Z

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.
CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.
CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.
Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.
Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

New Features
- Automatic cleanup of orphaned Azure machines stuck in deleting state after a short threshold.
Bug Fixes
- Azure cluster destruction now gracefully proceeds if the specified resource group is missing.
Tests
- Added unit tests covering orphaned-machine cleanup behavior.
Refactor
- Compile-time assertion added to ensure the Azure platform implements orphan-deletion capability.

openshift-merge-bot · 2026-06-05T11:34:15Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-06-05T11:34:16Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-06-05T11:34:19Z

@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.

CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.

CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.

Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.

Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-05T11:34:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: efd7cbcb-b39a-43dd-a2a9-0c264e73b604

📥 Commits

Reviewing files that changed from the base of the PR and between 79425af and ba7737f.

📒 Files selected for processing (4)

cmd/cluster/azure/destroy.go
hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure.go
hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure_test.go
hypershift-operator/controllers/hostedcluster/internal/platform/platform.go

✅ Files skipped from review due to trivial changes (1)

hypershift-operator/controllers/hostedcluster/internal/platform/platform.go

🚧 Files skipped from review as they are similar to previous changes (3)

cmd/cluster/azure/destroy.go
hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure.go
hypershift-operator/controllers/hostedcluster/internal/platform/azure/azure_test.go

📝 Walkthrough

Walkthrough

This PR adds Azure cluster cleanup enhancements across two areas. First, it introduces a new DeleteOrphanedMachines method to the Azure platform provider that cleans up machines stuck in deletion by removing their finalizers after a 5-minute threshold; the implementation is accompanied by comprehensive table-driven tests and integrated into the platform provider interface. Second, it improves the cluster destroy flow by gracefully handling missing Azure resource groups—when a resource group lookup returns a 404, the destroy operation now logs and continues instead of failing.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant KubeAPI as Kubernetes API (client)
  participant AzureMachine
  Controller->>KubeAPI: List AzureMachine in controlPlaneNamespace
  KubeAPI-->>Controller: AzureMachine list
  Controller->>AzureMachine: check DeletionTimestamp and age > 5m
  alt older than threshold
    Controller->>KubeAPI: MergePatch remove capiazure.MachineFinalizer
    KubeAPI-->>Controller: Patch result (success/error)
  else skip
    Controller-->>Controller: no-op
  end
  Controller-->>Controller: aggregate patch errors and return

Suggested reviewers

muraee
jparrill
sjenning

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Test Structure And Quality	❓ Inconclusive	The check specifies "Ginkgo test code" but the added test uses standard Go testing with Gomega assertions, not Ginkgo patterns (no Describe/It/BeforeEach blocks).	Clarify if check applies to standard Go tests or Ginkgo-only. For standard Go tests, the test is well-structured with single-responsibility subtests, good assertions messages, and codebase-consistent patterns.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main change: fixing Azure cluster deletion hang when resource groups are already deleted, which matches the core problem and solution described in the PR objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Tests use standard Go testing with t.Run() not Ginkgo. All test names are stable with no dynamic values like timestamps, UUIDs, or random suffixes.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces only controller logic for Azure machine cleanup with no deployment manifests, affinity rules, topology constraints, or scheduling assumptions incompatible with OpenShift topologies.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests added; only standard Go unit test (TestDeleteOrphanedMachines) with fake client, no IPv4 assumptions, and no external connectivity requirements.
No-Weak-Crypto	✅ Passed	No weak cryptography detected. The PR modifies Azure cluster deletion and orphan machine cleanup logic using standard libraries and Kubernetes controller utilities with no crypto-related code.
Container-Privileges	✅ Passed	All four modified files are Go source code; no K8s manifests with privileged container configs (privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation) were introduced.
No-Sensitive-Data-In-Logs	✅ Passed	New logging statements log only resource group names, machine identifiers, and timestamps—non-sensitive operational data. No credentials, tokens, API keys, PII, or passwords are exposed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-05T11:34:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign devguyio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-06-05T11:36:03Z

@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

What this PR does / why we need it:

When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:

CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.

CLI destroy command hard-fails on missing resource group: The hypershift destroy cluster azure command validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.

Fix

Operator: Implement the OrphanDeleter interface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) using controllerutil.RemoveFinalizer, preserving any finalizers from other controllers.

CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-86690

Special notes for your reviewer:

The Azure implementation follows the same pattern as the existing AWS DeleteOrphanedMachines (aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.

Unlike the AWS implementation which clears all finalizers (Finalizers = []string{}), this implementation only removes the Azure-specific finalizer using controllerutil.RemoveFinalizer to avoid interfering with other controllers. This improvement was identified during CodeRabbit review.

Tested on a live Azure HCP cluster (4.22.0) with simulated stuck AzureMachines. Operator logs confirm both machines had their Azure finalizers removed and the deletion cascade proceeded without hanging.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

Added automatic cleanup of orphaned Azure machines stuck in deleting state for extended periods.

Bug Fixes

Improved Azure cluster destruction to gracefully handle missing resource groups.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-06-05T11:45:46Z

Codecov Report

❌ Patch coverage is 53.84615% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.43%. Comparing base (25817d4) to head (ba7737f).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/cluster/azure/destroy.go	0.00%	6 Missing ⚠️
...ers/hostedcluster/internal/platform/azure/azure.go	70.00%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8682   +/-   ##
=======================================
  Coverage   41.43%   41.43%           
=======================================
  Files         756      756           
  Lines       93658    93671   +13     
=======================================
+ Hits        38807    38816    +9     
- Misses      52128    52132    +4     
  Partials     2723     2723

Files with missing lines	Coverage Δ
...ollers/hostedcluster/internal/platform/platform.go	`0.00% <ø> (ø)`
cmd/cluster/azure/destroy.go	`17.18% <0.00%> (-0.56%)`	⬇️
...ers/hostedcluster/internal/platform/azure/azure.go	`31.89% <70.00%> (+2.03%)`	⬆️

... and 1 file with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.87% <0.00%> (-0.01%)`	⬇️
cpo-hostedcontrolplane	`43.50% <ø> (ø)`
cpo-other	`42.74% <ø> (ø)`
hypershift-operator	`51.58% <70.00%> (+0.01%)`	⬆️
other	`31.64% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…s are already deleted When Azure resource groups are deleted before the HostedCluster, two things block cleanup: CAPI AzureMachine finalizers can never be removed because the provider controller cannot reach the deleted resources, and the CLI destroy command hard-fails on the resource group pre-validation check. Implement the OrphanDeleter interface for Azure by detecting AzureMachines stuck in deletion for longer than 5 minutes and force-removing their finalizers, allowing the CAPI deletion cascade to proceed. Also change the destroy command's RG pre-validation to log a warning and continue on 404 instead of failing. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>

vsolanki12 · 2026-06-05T11:52:43Z

I have tried to reproduce and tested the fix in my test cluster as below:

Environment: Azure HCP cluster vsolanki-86690, CP version 4.22.0

Custom hypershift-operator image deployed:

    $ oc get pods -n hypershift -l name=operator \
        -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'
    operator-5c78c99-b5m4v: quay.io/vsolanki/hypershift-operator:fix-86690
    operator-5c78c99-srdl6: quay.io/vsolanki/hypershift-operator:fix-86690

Test scenario: Created two AzureMachines with azuremachine.infrastructure.cluster.x-k8s.io finalizer in the control plane namespace, then deleted them. Without a working Azure CAPI provider controller, they remain stuck in deletion indefinitely simulating the exact bug scenario where resource groups are already deleted.

Stuck AzureMachines (before HC deletion):

    $ oc get azuremachines -n clusters-vsolanki-86690 \
        -o jsonpath='{range .items[*]}{.metadata.name}: deletionTimestamp={.metadata.deletionTimestamp}, finalizers={.metadata.finalizers}{"\n"}{end}'
    test-azure-machine-1: deletionTimestamp=2026-06-05T10:39:26Z, finalizers=["azuremachine.infrastructure.cluster.x-k8s.io"]
    test-azure-machine-2: deletionTimestamp=2026-06-05T10:39:26Z, finalizers=["azuremachine.infrastructure.cluster.x-k8s.io"]

After fix triggered HC deletion, AzureMachines cleaned up:

    $ oc delete hc vsolanki-86690 -n clusters
    hostedcluster.hypershift.openshift.io "vsolanki-86690" deleted
    
    $ oc get azuremachines -n clusters-vsolanki-86690
    No resources found in clusters-vsolanki-86690 namespace.

Operator logs confirm finalizer removal:

    {"level":"info","ts":"2026-06-05T11:16:38Z",
     "msg":"Removed finalizers from orphaned AzureMachine stuck in deletion",
     "controller":"hostedcluster",
     "machine":{"name":"test-azure-machine-1","namespace":"clusters-vsolanki-86690"},
     "deletionTimestamp":"2026-06-05T10:39:26Z"}

    {"level":"info","ts":"2026-06-05T11:16:38Z",
     "msg":"Removed finalizers from orphaned AzureMachine stuck in deletion",
     "controller":"hostedcluster",
     "machine":{"name":"test-azure-machine-2","namespace":"clusters-vsolanki-86690"},
     "deletionTimestamp":"2026-06-05T10:39:26Z"}

Both AzureMachines had their finalizers removed (stuck for 37 minutes, well past the 5-minute threshold). HC deletion cascade proceeded without hanging.

openshift-ci · 2026-06-05T12:35:00Z

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 5, 2026

openshift-ci Bot added the do-not-merge/needs-area label Jun 5, 2026

vsolanki12 force-pushed the OCPBUGS-86690-azure-orphaned-machine-deletion branch from 79425af to ba7737f Compare June 5, 2026 11:47

vsolanki12 marked this pull request as ready for review June 5, 2026 12:00

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026

openshift-ci Bot requested review from Nirshal and enxebre June 5, 2026 12:01

Conversation

vsolanki12 commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Fix

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 5, 2026

What this PR does / why we need it:

Fix

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 5, 2026

What this PR does / why we need it:

Fix

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vsolanki12 commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vsolanki12 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

codecov Bot commented Jun 5, 2026 •

edited

Loading