OCPBUGS-86798: spot MHC not created when autoRepair=true and ignition endpoint not reached#8645
OCPBUGS-86798: spot MHC not created when autoRepair=true and ignition endpoint not reached#8645dpateriya wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR decouples spot instance health checks from the auto-repair gating logic. The Reconcile method now creates or updates a spot-specific MachineHealthCheck early based solely on whether spot instances are enabled, and deletes it if spot is disabled. The previous conditional block tied to auto-repair and ignition-endpoint status was removed. A new table-driven test verifies spot MHC behavior across combinations of spot enablement, auto-repair setting, and ignition endpoint readiness. Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dpateriya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
be91e33 to
a0f1b6d
Compare
|
/jira refresh |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 3301-3334: The test case name strings in the table of cases (the
struct entries that include fields like name, autoRepair, ignitionReached,
spotEnabled, expectSpotMHC, expectRegularMHC, expectAutoRepairStatus) do not
follow the repo policy; update each name to the required format starting with
"When ..." and then "it should ..." (e.g. "When spot enabled and autoRepair true
and ignition NOT reached it should create spot MHC and not create regular MHC"),
preserving the rest of the case fields and semantics; change only the name
values in the test cases inside capi_test.go so they conform to the "When ... it
should ..." pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f9c051ee-dacb-4f42-a86c-f7ad5906603b
📒 Files selected for processing (2)
hypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.go
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
… endpoint not reached The spot-specific MachineHealthCheck was placed after the autoRepair gate in CAPI.Reconcile(). When autoRepair is true and the ignition endpoint has not been reached, the function returns early, preventing the spot MHC from being created. This causes a permanent deadlock: spot instances that fail to provision never get remediated because the MHC (which would detect the stuck Machine via NodeStartupTimeout) is never created, and the ignition endpoint is never reached because no node ever joins. Move the spot MHC reconciliation before the autoRepair gate so it is always created when spot instances are enabled, regardless of the autoRepair/ignition state. Co-authored-by: Cursor <cursoragent@cursor.com>
a0f1b6d to
1a9eec3
Compare
|
@dpateriya: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Now I have the complete picture. Let me compile the final report. Here's a summary of all failures found:
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryPR #8645 has six independent failures across three categories. The primary blocker is a merge conflict — the PR's Root CauseThere are three PR-authoring issues and two transient CI infrastructure issues: 1. Merge Conflict (tide ERROR — primary blocker) 2. Lint Failures (2 issues in PR code)
3. Gitlint Failure (commit message format) 4. CI Infrastructure Issues (transient, NOT caused by this PR) Recommendations
Evidence
|
Summary
autoRepair: trueandReachedIgnitionEndpointis FalseProblem
The spot MHC block was placed after the autoRepair gate in
CAPI.Reconcile(). WhenautoRepair: trueand the ignition endpoint hasn't been reached yet, the function returns early viareturn nilat line 158, preventing the spot MHC from being created.This creates a permanent deadlock when a spot instance fails to provision:
PendingwithInstanceProvisionFailedReachedIgnitionEndpointstaysFalseforeverautoRepair: true+ ignition not reached →return nil→ spot MHC never createdNodeStartupTimeout→ no remediation → Machine stuck foreverFix
Move the spot MHC reconciliation block before the autoRepair gate. The spot MHC is independent of autoRepair — it serves as a safety net for spot instance failures using
maxUnhealthy: 100%and a 20-minuteNodeStartupTimeout.Reproduction
With spot capacity unavailable, the Machine stays
Pendingindefinitely and no<nodepool>-spotMHC is created.Test plan
TestSpotMHCCreatedIndependentlyOfAutoRepairIgnitionGatewith 4 cases:TestCAPIReconciletests pass (no regression)TestReconcileMachineHealthChecktests passnodepoolpackage tests passgo vetcleanSummary by CodeRabbit
Bug Fixes
Tests