OCPBUGS-86690: Fix Azure cluster deletion hang when resource groups are already deleted#8682
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR adds Azure cluster cleanup enhancements across two areas. First, it introduces a new Sequence Diagram(s)sequenceDiagram
participant Controller
participant KubeAPI as Kubernetes API (client)
participant AzureMachine
Controller->>KubeAPI: List AzureMachine in controlPlaneNamespace
KubeAPI-->>Controller: AzureMachine list
Controller->>AzureMachine: check DeletionTimestamp and age > 5m
alt older than threshold
Controller->>KubeAPI: MergePatch remove capiazure.MachineFinalizer
KubeAPI-->>Controller: Patch result (success/error)
else skip
Controller-->>Controller: no-op
end
Controller-->>Controller: aggregate patch errors and return
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vsolanki12 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-86690, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8682 +/- ##
=======================================
Coverage 41.43% 41.43%
=======================================
Files 756 756
Lines 93658 93671 +13
=======================================
+ Hits 38807 38816 +9
- Misses 52128 52132 +4
Partials 2723 2723
... and 1 file with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…s are already deleted When Azure resource groups are deleted before the HostedCluster, two things block cleanup: CAPI AzureMachine finalizers can never be removed because the provider controller cannot reach the deleted resources, and the CLI destroy command hard-fails on the resource group pre-validation check. Implement the OrphanDeleter interface for Azure by detecting AzureMachines stuck in deletion for longer than 5 minutes and force-removing their finalizers, allowing the CAPI deletion cascade to proceed. Also change the destroy command's RG pre-validation to log a warning and continue on 404 instead of failing. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
79425af to
ba7737f
Compare
|
I have tried to reproduce and tested the fix in my test cluster as below: Environment: Azure HCP cluster vsolanki-86690, CP version 4.22.0 Custom hypershift-operator image deployed: Test scenario: Created two AzureMachines with Stuck AzureMachines (before HC deletion): After fix triggered HC deletion, AzureMachines cleaned up: Operator logs confirm finalizer removal: Both AzureMachines had their finalizers removed (stuck for 37 minutes, well past the 5-minute threshold). HC deletion cascade proceeded without hanging. |
|
@vsolanki12: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
When Azure resource groups are deleted before the HostedCluster (e.g. manually or by another process), cluster deletion hangs indefinitely. Two issues cause this:
CAPI AzureMachine finalizers block deletion cascade: The Azure CAPI provider controller cannot remove finalizers because the backing Azure resources no longer exist. This causes the CAPI Cluster object (and thus the HostedCluster) to wait forever for machine cleanup.
CLI destroy command hard-fails on missing resource group: The
hypershift destroy cluster azurecommand validates the resource group exists before proceeding. When the RG is already deleted, this pre-validation returns a hard error, blocking the entire destroy flow.Fix
Operator: Implement the
OrphanDeleterinterface for Azure. During HC deletion, detect AzureMachines stuck in deletion for longer than 5 minutes and remove only the Azure-specific finalizer (azuremachine.infrastructure.cluster.x-k8s.io) usingcontrollerutil.RemoveFinalizer, preserving any finalizers from other controllers.CLI: Change the resource group pre-validation to detect 404 responses, log a warning, and continue with deletion instead of returning a hard error. Non-404 errors (e.g. auth failures) still fail.
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-86690
Special notes for your reviewer:
DeleteOrphanedMachines(aws.go:380-402), but uses a time-based heuristic (5 min stuck in deletion) instead of credential status checking, since Azure doesn't expose an equivalent credential validity signal.Finalizers = []string{}), this implementation only removes the Azure-specific finalizer usingcontrollerutil.RemoveFinalizerto avoid interfering with other controllers. This improvement was identified during CodeRabbit review.Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Refactor