OCPBUGS-81544: requeue when AutoNodeEnabled is progressing#8497
OCPBUGS-81544: requeue when AutoNodeEnabled is progressing#8497maxcao13 wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@maxcao13: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: maxcao13 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe reconciler now captures the second return value from Sequence Diagram(s)sequenceDiagram
participant Event as Event/Queue
participant Reconciler as HostedCluster Reconciler
participant AutoNode as reconcileAutoNodeEnabledCondition
participant Karpenter as Karpenter Components/Deployments
participant Timer as Requeue Timer
Event->>Reconciler: trigger reconcile(hcluster)
Reconciler->>AutoNode: evaluate AutoNode state
AutoNode->>Karpenter: check components & deployments
Karpenter-->>AutoNode: report missing/rolling/terminating OR stable
AutoNode-->>Reconciler: return (Condition, progressing)
Reconciler->>Reconciler: set hcluster.Status.Conditions = Condition
alt progressing == true
Reconciler->>Timer: ensure RequeueAfter = min(existing or +inf, 15s)
else progressing == false
Reconciler->>Timer: leave RequeueAfter unchanged
end
Reconciler-->>Event: requeue per Timer
Possibly related PRs
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hypershift-operator/controllers/hostedcluster/karpenter.go (1)
183-187:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequeue should stay enabled when AutoNode evaluation fails
At Line 187 and Line 245, transient client/list/get errors return
progressing=false, which can leaveAutoNodeEnabledstuck inUnknownuntil an unrelated event. Returningtruehere preserves polling and avoids stale status.Suggested patch
- return condition, false + return condition, true ... - return condition, false + return condition, trueAlso applies to: 241-245
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/karpenter.go` around lines 183 - 187, The client.List and client.Get error paths in the AutoNode evaluation currently return (condition, false) which disables requeue; update the error returns in the r.Client.List(...) and r.Client.Get(...) error branches (the blocks that set condition.Status = metav1.ConditionUnknown, condition.Reason = hyperv1.AutoNodeEvaluationFailedReason and set condition.Message with the fmt.Sprintf of the error) to return (condition, true) so retry/polling stays enabled and AutoNodeEnabled doesn't remain stale.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 2106-2108: When autoNodeProgressing is true, don't unconditionally
overwrite result.RequeueAfter; instead set it to the minimum non-zero interval
between the existing result.RequeueAfter and 15s. Update the autoNodeProgressing
branch to: if result.RequeueAfter is zero, assign 15s; otherwise assign 15s only
if 15s is less than the current result.RequeueAfter (i.e., result.RequeueAfter =
minNonZero(result.RequeueAfter, 15*time.Second)). This preserves shorter
previously-computed intervals while ensuring a 15s fallback when none was set.
---
Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/karpenter.go`:
- Around line 183-187: The client.List and client.Get error paths in the
AutoNode evaluation currently return (condition, false) which disables requeue;
update the error returns in the r.Client.List(...) and r.Client.Get(...) error
branches (the blocks that set condition.Status = metav1.ConditionUnknown,
condition.Reason = hyperv1.AutoNodeEvaluationFailedReason and set
condition.Message with the fmt.Sprintf of the error) to return (condition, true)
so retry/polling stays enabled and AutoNodeEnabled doesn't remain stale.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6d9b0b24-a636-484f-8709-0f8ab232c9e2
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/karpenter.gohypershift-operator/controllers/hostedcluster/karpenter_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8497 +/- ##
==========================================
+ Coverage 39.95% 39.99% +0.04%
==========================================
Files 751 751
Lines 92733 92843 +110
==========================================
+ Hits 37048 37137 +89
- Misses 52998 53018 +20
- Partials 2687 2688 +1
... and 21 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
The HC reconciler reads ControlPlaneComponent status to determine AutoNodeEnabled but does not watch CPCs. Without a requeue, the condition stays stale until an unrelated resource triggers reconciliation — causing the e2e lifecycle test to time out waiting for the condition to flip to True. Return a progressing signal from reconcileAutoNodeEnabledCondition and set a 15s requeue so the condition converges promptly. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>
83f477e to
caefe2a
Compare
|
@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-autonode |
The parallel provisioning subtests all captured the same *HostedCluster pointer. When two goroutines concurrently called mgtClient.Get() into that shared object, the JSON deserializer triggered a "concurrent map writes" panic. DeepCopy the pointer at the start of every parallel subtest so each goroutine works on its own object. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>
|
Added a commit to fix some data racing because we ran into concurrent map writes related to the shared hostedcluster pointer object. /test e2e-aws-autonode |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Resultse2e-aws
|
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws |
|
/test e2e-aws-4-22 |
|
The PR only modifies the HyperShift operator's Karpenter/AutoNode reconciliation logic. The failure is in Azure VM provisioning at the infrastructure level — completely unrelated to the PR. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe root cause is an Azure infrastructure-level Detailed failure chain:
Why this is unrelated to PR #8497:
Recommendations
Evidence
|
6f44bab to
c575c75
Compare
|
@maxcao13: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
The HostedCluster reconciler reads
ControlPlaneComponentstatus to set theAutoNodeEnabledcondition, but it does not watch CPC resources. When AutoNodeis enabled or disabled, the condition gets stuck at
AutoNodeProgressinguntilan unrelated resource change happens to trigger reconciliation. In CI this
causes the
TestKarpenter/Main/AutoNode_enable/disable_lifecyclee2e test totime out — the karpenter CPC reports
RolloutComplete=Truewithin ~1 minutebut the HC condition is never updated.
This adds a 15-second requeue when
reconcileAutoNodeEnabledConditionreportsa progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).
Which issue(s) this PR fixes:
Fixes periodic
TestKarpenter/Main/AutoNode_enable/disable_lifecyclefailuresin
periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.Special notes for your reviewer:
The alternative considered was adding
ControlPlaneComponentto the HCreconciler's
managedResources()watch list. That would provide immediatereactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.
Checklist:
Assisted-by: Cursor Agent
Made with Cursor
Summary by CodeRabbit
Bug Fixes
Tests