OCPBUGS-81544: requeue when AutoNodeEnabled is progressing by maxcao13 · Pull Request #8497 · openshift/hypershift

maxcao13 · 2026-05-12T21:34:27Z

What this PR does / why we need it:

The HostedCluster reconciler reads ControlPlaneComponent status to set the
AutoNodeEnabled condition, but it does not watch CPC resources. When AutoNode
is enabled or disabled, the condition gets stuck at AutoNodeProgressing until
an unrelated resource change happens to trigger reconciliation. In CI this
causes the TestKarpenter/Main/AutoNode_enable/disable_lifecycle e2e test to
time out — the karpenter CPC reports RolloutComplete=True within ~1 minute
but the HC condition is never updated.

This adds a 15-second requeue when reconcileAutoNodeEnabledCondition reports
a progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).

Which issue(s) this PR fixes:

Fixes periodic TestKarpenter/Main/AutoNode_enable/disable_lifecycle failures
in periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.

Special notes for your reviewer:

The alternative considered was adding ControlPlaneComponent to the HC
reconciler's managedResources() watch list. That would provide immediate
reactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Assisted-by: Cursor Agent

Made with Cursor

Summary by CodeRabbit

Bug Fixes
- Improved AutoNode reconciliation: the system now detects Karpenter rollout/teardown progress more reliably and will recheck at least every 15 seconds while progress is ongoing, improving responsiveness during component transitions.
Tests
- End-to-end tests and helpers updated to source dynamic cluster values consistently and to refine pod readiness checks using label-based selection for more reliable test behavior.

openshift-merge-bot · 2026-05-12T21:34:29Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-12T21:34:30Z

@maxcao13: This pull request explicitly references no jira issue.

Details

In response to this:

What this PR does / why we need it:

The HostedCluster reconciler reads ControlPlaneComponent status to set the
AutoNodeEnabled condition, but it does not watch CPC resources. When AutoNode
is enabled or disabled, the condition gets stuck at AutoNodeProgressing until
an unrelated resource change happens to trigger reconciliation. In CI this
causes the TestKarpenter/Main/AutoNode_enable/disable_lifecycle e2e test to
time out — the karpenter CPC reports RolloutComplete=True within ~1 minute
but the HC condition is never updated.

This adds a 15-second requeue when reconcileAutoNodeEnabledCondition reports
a progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).

Which issue(s) this PR fixes:

Fixes periodic TestKarpenter/Main/AutoNode_enable/disable_lifecycle failures
in periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.

Special notes for your reviewer:

The alternative considered was adding ControlPlaneComponent to the HC
reconciler's managedResources() watch list. That would provide immediate
reactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Assisted-by: Cursor Agent

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-12T21:34:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-05-12T21:34:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-12T21:35:12Z

@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The HostedCluster reconciler reads ControlPlaneComponent status to set the
AutoNodeEnabled condition, but it does not watch CPC resources. When AutoNode
is enabled or disabled, the condition gets stuck at AutoNodeProgressing until
an unrelated resource change happens to trigger reconciliation. In CI this
causes the TestKarpenter/Main/AutoNode_enable/disable_lifecycle e2e test to
time out — the karpenter CPC reports RolloutComplete=True within ~1 minute
but the HC condition is never updated.

This adds a 15-second requeue when reconcileAutoNodeEnabledCondition reports
a progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).

Which issue(s) this PR fixes:

Fixes periodic TestKarpenter/Main/AutoNode_enable/disable_lifecycle failures
in periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.

Special notes for your reviewer:

The alternative considered was adding ControlPlaneComponent to the HC
reconciler's managedResources() watch list. That would provide immediate
reactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Assisted-by: Cursor Agent

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-12T21:36:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c3fae197-4fc3-47de-aa56-c70157836bb9

📥 Commits

Reviewing files that changed from the base of the PR and between c575c75 and 6f44bab.

📒 Files selected for processing (2)

test/e2e/karpenter_control_plane_upgrade_test.go
test/e2e/karpenter_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/karpenter_test.go

📝 Walkthrough

Walkthrough

The reconciler now captures the second return value from reconcileAutoNodeEnabledCondition() indicating whether AutoNode/Karpenter rollout/teardown is still progressing. It writes the returned condition into hcluster.Status.Conditions. If the progressing flag is true, the reconciler ensures result.RequeueAfter is set to 15s when it is unset or greater than 15s so the controller will poll more frequently until progress completes.

Sequence Diagram(s)

sequenceDiagram
    participant Event as Event/Queue
    participant Reconciler as HostedCluster Reconciler
    participant AutoNode as reconcileAutoNodeEnabledCondition
    participant Karpenter as Karpenter Components/Deployments
    participant Timer as Requeue Timer

    Event->>Reconciler: trigger reconcile(hcluster)
    Reconciler->>AutoNode: evaluate AutoNode state
    AutoNode->>Karpenter: check components & deployments
    Karpenter-->>AutoNode: report missing/rolling/terminating OR stable
    AutoNode-->>Reconciler: return (Condition, progressing)
    Reconciler->>Reconciler: set hcluster.Status.Conditions = Condition
    alt progressing == true
        Reconciler->>Timer: ensure RequeueAfter = min(existing or +inf, 15s)
    else progressing == false
        Reconciler->>Timer: leave RequeueAfter unchanged
    end
    Reconciler-->>Event: requeue per Timer

Possibly related PRs

openshift/hypershift#8498: Updates test/e2e/karpenter_test.go to deep-copy/use a local hc and derive/assert platform/infra/release values from it across Karpenter E2E subtests, closely related to the test changes in this PR.

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	62% of E2E test assertions lack messages. 16 of 26 error assertions use `g.Expect(err).NotTo(HaveOccurred())` without diagnostic text, violating the check requirement for meaningful failure messages.	Add diagnostic messages to assertions in test/e2e/karpenter_test.go: `g.Expect(err).NotTo(HaveOccurred(), "context")` instead of plain form. Follow unit test patterns for consistency.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: implementing requeue logic when AutoNodeEnabled condition is in progressing state, which matches the core functionality added across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in modified test files are stable and deterministic with no dynamic values. No pod names, timestamps, UUIDs, node names, namespaces, or IPs in test titles.
Microshift Test Compatibility	✅ Passed	New e2e tests are AWS-platform-only and skip on non-AWS. No unavailable OpenShift APIs are used.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added. PR only modifies existing test functions and helper methods. SNO check applies only to newly added tests.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces no scheduling constraints. Changes are purely control-plane reconciliation logic (polling with 15s requeue when AutoNode is progressing) and test updates for data race fixes.
Ote Binary Stdout Contract	✅ Passed	No OTE stdout contract violations found. Modified files contain no process-level stdout writes via fmt.Print or log.Print operations.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. PR contains bug fixes to existing tests: deep-copying HostedCluster for data race prevention and refining pod filtering. Custom check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-12T21:37:19Z

@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

What this PR does / why we need it:

The HostedCluster reconciler reads ControlPlaneComponent status to set the
AutoNodeEnabled condition, but it does not watch CPC resources. When AutoNode
is enabled or disabled, the condition gets stuck at AutoNodeProgressing until
an unrelated resource change happens to trigger reconciliation. In CI this
causes the TestKarpenter/Main/AutoNode_enable/disable_lifecycle e2e test to
time out — the karpenter CPC reports RolloutComplete=True within ~1 minute
but the HC condition is never updated.

This adds a 15-second requeue when reconcileAutoNodeEnabledCondition reports
a progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).

Which issue(s) this PR fixes:

Fixes periodic TestKarpenter/Main/AutoNode_enable/disable_lifecycle failures
in periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.

Special notes for your reviewer:

The alternative considered was adding ControlPlaneComponent to the HC
reconciler's managedResources() watch list. That would provide immediate
reactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Assisted-by: Cursor Agent

Made with Cursor

Summary by CodeRabbit

Bug Fixes

Improved AutoNode component reconciliation polling to more intelligently detect when the system should requeue to monitor component rollout and teardown progress, ensuring better responsiveness during Karpenter component transitions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hypershift-operator/controllers/hostedcluster/karpenter.go (1)
183-187: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Requeue should stay enabled when AutoNode evaluation fails

At Line 187 and Line 245, transient client/list/get errors return progressing=false, which can leave AutoNodeEnabled stuck in Unknown until an unrelated event. Returning true here preserves polling and avoids stale status.
Suggested patch
-       return condition, false
+       return condition, true
...
-           return condition, false
+           return condition, true
Also applies to: 241-245
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/karpenter.go` around lines 183
- 187, The client.List and client.Get error paths in the AutoNode evaluation
currently return (condition, false) which disables requeue; update the error
returns in the r.Client.List(...) and r.Client.Get(...) error branches (the
blocks that set condition.Status = metav1.ConditionUnknown, condition.Reason =
hyperv1.AutoNodeEvaluationFailedReason and set condition.Message with the
fmt.Sprintf of the error) to return (condition, true) so retry/polling stays
enabled and AutoNodeEnabled doesn't remain stale.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 2106-2108: When autoNodeProgressing is true, don't unconditionally
overwrite result.RequeueAfter; instead set it to the minimum non-zero interval
between the existing result.RequeueAfter and 15s. Update the autoNodeProgressing
branch to: if result.RequeueAfter is zero, assign 15s; otherwise assign 15s only
if 15s is less than the current result.RequeueAfter (i.e., result.RequeueAfter =
minNonZero(result.RequeueAfter, 15*time.Second)). This preserves shorter
previously-computed intervals while ensuring a 15s fallback when none was set.

---

Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/karpenter.go`:
- Around line 183-187: The client.List and client.Get error paths in the
AutoNode evaluation currently return (condition, false) which disables requeue;
update the error returns in the r.Client.List(...) and r.Client.Get(...) error
branches (the blocks that set condition.Status = metav1.ConditionUnknown,
condition.Reason = hyperv1.AutoNodeEvaluationFailedReason and set
condition.Message with the fmt.Sprintf of the error) to return (condition, true)
so retry/polling stays enabled and AutoNodeEnabled doesn't remain stale.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6d9b0b24-a636-484f-8709-0f8ab232c9e2

📥 Commits

Reviewing files that changed from the base of the PR and between 4341d0c and 83f477e.

📒 Files selected for processing (3)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
hypershift-operator/controllers/hostedcluster/karpenter.go
hypershift-operator/controllers/hostedcluster/karpenter_test.go

codecov · 2026-05-12T21:41:55Z

Codecov Report

❌ Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.99%. Comparing base (4341d0c) to head (c575c75).
⚠️ Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
...trollers/hostedcluster/hostedcluster_controller.go	28.57%	4 Missing and 1 partial ⚠️
...ft-operator/controllers/hostedcluster/karpenter.go	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8497      +/-   ##
==========================================
+ Coverage   39.95%   39.99%   +0.04%     
==========================================
  Files         751      751              
  Lines       92733    92843     +110     
==========================================
+ Hits        37048    37137      +89     
- Misses      52998    53018      +20     
- Partials     2687     2688       +1

Files with missing lines	Coverage Δ
...ft-operator/controllers/hostedcluster/karpenter.go	`75.16% <75.00%> (ø)`
...trollers/hostedcluster/hostedcluster_controller.go	`43.59% <28.57%> (-0.06%)`	⬇️

... and 21 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.09% <ø> (+0.01%)`	⬆️
cpo-hostedcontrolplane	`40.56% <ø> (+0.04%)`	⬆️
cpo-other	`40.14% <ø> (+0.05%)`	⬆️
hypershift-operator	`50.52% <53.33%> (+0.08%)`	⬆️
other	`31.54% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The HC reconciler reads ControlPlaneComponent status to determine AutoNodeEnabled but does not watch CPCs. Without a requeue, the condition stays stale until an unrelated resource triggers reconciliation — causing the e2e lifecycle test to time out waiting for the condition to flip to True. Return a progressing signal from reconcileAutoNodeEnabledCondition and set a 15s requeue so the condition converges promptly. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>

openshift-ci-robot · 2026-05-12T21:43:47Z

@maxcao13: This pull request references Jira Issue OCPBUGS-81544, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

What this PR does / why we need it:

The HostedCluster reconciler reads ControlPlaneComponent status to set the
AutoNodeEnabled condition, but it does not watch CPC resources. When AutoNode
is enabled or disabled, the condition gets stuck at AutoNodeProgressing until
an unrelated resource change happens to trigger reconciliation. In CI this
causes the TestKarpenter/Main/AutoNode_enable/disable_lifecycle e2e test to
time out — the karpenter CPC reports RolloutComplete=True within ~1 minute
but the HC condition is never updated.

This adds a 15-second requeue when reconcileAutoNodeEnabledCondition reports
a progressing state, so the HC reconciler polls until the transition completes.
This is a targeted fix that avoids adding a broad CPC watch (which would fire
for all ~30-40 CPCs on every status change across every HostedCluster).

Which issue(s) this PR fixes:

Fixes periodic TestKarpenter/Main/AutoNode_enable/disable_lifecycle failures
in periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn.

Special notes for your reviewer:

The alternative considered was adding ControlPlaneComponent to the HC
reconciler's managedResources() watch list. That would provide immediate
reactivity but at the cost of watching all CPCs across all HCP namespaces
(~30-40 per HostedCluster), when only 2 karpenter CPCs matter during
enable/disable transitions. A bounded 15s poll during transitions is the
better tradeoff.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Assisted-by: Cursor Agent

Made with Cursor

Summary by CodeRabbit

Bug Fixes

Improved AutoNode reconciliation: the system now more reliably detects when Karpenter-related components are still rolling out or tearing down and will poll more appropriately, including ensuring progress is rechecked at least every 15 seconds when rollout/teardown is in flight, improving responsiveness during component transitions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

maxcao13 · 2026-05-12T21:45:15Z

/test e2e-aws-autonode
/test e2e-aws

The parallel provisioning subtests all captured the same *HostedCluster pointer. When two goroutines concurrently called mgtClient.Get() into that shared object, the JSON deserializer triggered a "concurrent map writes" panic. DeepCopy the pointer at the start of every parallel subtest so each goroutine works on its own object. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>

maxcao13 · 2026-05-12T23:37:13Z

Added a commit to fix some data racing because we ran into concurrent map writes related to the shared hostedcluster pointer object.

/test e2e-aws-autonode
/test e2e-aws

hypershift-jira-solve-ci · 2026-05-13T01:46:20Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2054345191254396928 | Cost: $3.2408330000000003 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cwbotbot · 2026-05-13T02:10:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-13T05:06:25Z
View Job
View Job History

maxcao13 · 2026-05-13T02:24:43Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-05-13T04:51:02Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2054387343774715904 | Cost: $2.9509135 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

maxcao13 · 2026-05-13T05:03:09Z

/test e2e-aws

maxcao13 · 2026-05-13T05:03:39Z

/test e2e-aws-4-22

hypershift-jira-solve-ci · 2026-05-13T09:36:03Z

The PR only modifies the HyperShift operator's Karpenter/AutoNode reconciliation logic. The failure is in Azure VM provisioning at the infrastructure level — completely unrelated to the PR.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed
Build ID: 2054426891598696448
Target: e2e-azure-v2-self-managed
PR: OCPBUGS-81544: requeue when AutoNodeEnabled is progressing #8497 (OCPBUGS-81544: requeue when AutoNodeEnabled is progressing)

Test Failure Analysis

Error

ERROR CODE: OSProvisioningTimedOut
OS Provisioning for VM did not finish in the allotted time. All 18 Azure VMs across 3 NodePools
failed with OSProvisioningTimedOut in Azure region centralus.

create-management-cluster step timed out after 45m0s waiting for ClusterVersion to become available.
ClusterVersion stuck at: "Unable to apply 4.22.0-ec.5: some cluster operators are not available"

Summary

The create-management-cluster pre-phase step timed out after 45 minutes because the hosted cluster's ClusterVersion never became available. All 18 Azure VMs (6 replicas × 3 NodePools) failed OS provisioning with Azure error OSProvisioningTimedOut, meaning no worker nodes ever joined the hosted cluster. Without worker nodes, DaemonSet-dependent cluster operators (dns, image-registry, ingress, node-tuning, storage) and operators awaiting node connectivity (console, insights, monitoring, kube-storage-version-migrator, openshift-samples, service-ca) could not become available, preventing the ClusterVersion from completing. This is an Azure infrastructure-level failure unrelated to PR #8497's changes to Karpenter/AutoNode requeue logic.

Root Cause

The root cause is an Azure infrastructure-level OSProvisioningTimedOut failure affecting all worker node VMs in the centralus region.

Detailed failure chain:

The CI job created 3 NodePools (1db9b0bece-mgmt-1, -2, -3) each with 6 replicas, requesting 18 Azure VMs total.
All 18 VMs were created in Azure at ~05:31:57 UTC but Azure reported OSProvisioningTimedOut for every one of them by ~05:52:14 UTC (~20 minutes later). This means the Azure VM agent could not complete OS boot and report readiness within Azure's provisioning window.
With no VMs successfully provisioned, no worker nodes ever registered with the hosted cluster's API server (ReachedIgnitionEndpoint: False, AllNodesHealthy: 6 of 6 machines not healthy - NodeProvisioning).
Without worker nodes, DaemonSets couldn't schedule pods, leaving operators like dns, ingress, storage, image-registry, node-tuning unavailable.
The ClusterVersion remained stuck at Progressing=True with "Unable to apply 4.22.0-ec.5: some cluster operators are not available" for 42+ minutes until the 45-minute step timeout was reached.

Why this is unrelated to PR #8497:

PR OCPBUGS-81544: requeue when AutoNodeEnabled is progressing #8497 modifies hostedcluster_controller.go and karpenter.go — the Karpenter/AutoNode reconciliation path
The HostedCluster in this test has AutoNodeEnabled: False (AutoNodeNotConfigured) — the AutoNode/Karpenter feature is not configured for this test
The failure is at the Azure IaaS layer (VM OS provisioning), not at the Kubernetes/HyperShift controller layer
The affected files (hostedcluster_controller.go, karpenter.go, karpenter_test.go) are not in the VM provisioning or CAPI provider codepath

Recommendations

Retry the job — This is a transient Azure infrastructure failure (OSProvisioningTimedOut across all VMs in centralus). A simple rerun should succeed if the Azure region has recovered.
Check Azure region health — If retries continue to fail, check Azure Status for any ongoing issues in the centralus region affecting VM provisioning.
No code changes needed — PR OCPBUGS-81544: requeue when AutoNodeEnabled is progressing #8497's changes to Karpenter/AutoNode requeue logic are not exercised in this test (AutoNode is not configured), and the failure is purely at the Azure VM provisioning layer.

Evidence

Evidence	Detail
Primary failure	`create-management-cluster` step (pre phase) timed out after 46m23s
Azure error code	`OSProvisioningTimedOut` on all 18 VMs across 3 NodePools
VM provisioning window	Started ~05:31:57 UTC, failed ~05:52:14 UTC (~20 min)
Azure region	`centralus`
Affected NodePools	`1db9b0bece-mgmt-1` (6 VMs), `1db9b0bece-mgmt-2` (6 VMs), `1db9b0bece-mgmt-3` (6 VMs)
NodePool status	`Ready: False (WaitingForAvailableMachines)`, `AllNodesHealthy: False (NodeProvisioning)`
Ignition endpoint	`ReachedIgnitionEndpoint: False` — VMs never started pulling ignition
Unavailable operators	console, dns, image-registry, ingress, insights, kube-storage-version-migrator, monitoring, node-tuning, openshift-samples, service-ca, storage
ClusterVersion status	`Progressing=True` for 42+ minutes: "Unable to apply 4.22.0-ec.5: some cluster operators are not available"
HostedCluster AutoNode	`AutoNodeEnabled: False (AutoNodeNotConfigured)` — PR's codepath not exercised
PR #8497 files changed	`hostedcluster_controller.go`, `karpenter.go`, `karpenter_test.go`, `karpenter_test.go` (e2e) — none in VM provisioning path
Secondary failures	`destroy-guests` and `hypershift-k8sgpt` (post-phase) — expected cascading failures from pre-phase failure

openshift-ci · 2026-05-14T00:55:51Z

@maxcao13: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels May 12, 2026

openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 12, 2026

maxcao13 changed the title ~~NO-JIRA: fix(hostedcluster): requeue when AutoNodeEnabled is progressing~~ OCPBUGS-81544: requeue when AutoNodeEnabled is progressing May 12, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 12, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 12, 2026

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 12, 2026

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

maxcao13 force-pushed the fix-autonode-requeue-on-progressing branch from 83f477e to caefe2a Compare May 12, 2026 21:42

openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label May 12, 2026

maxcao13 marked this pull request as ready for review May 13, 2026 05:01

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026

openshift-ci Bot requested review from muraee and sdminonne May 13, 2026 05:02

maxcao13 force-pushed the fix-autonode-requeue-on-progressing branch 2 times, most recently from 6f44bab to c575c75 Compare May 13, 2026 22:47

Conversation

maxcao13 commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Possibly related PRs

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci-robot commented May 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented May 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

maxcao13 commented May 12, 2026

Uh oh!

maxcao13 commented May 12, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 13, 2026

AI Test Failure Analysis

Uh oh!

cwbotbot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

Uh oh!

maxcao13 commented May 13, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 13, 2026

AI Test Failure Analysis

Uh oh!

maxcao13 commented May 13, 2026

Uh oh!

maxcao13 commented May 13, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 13, 2026

Test Failure Analysis Complete

Job Information

maxcao13 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading

cwbotbot commented May 13, 2026 •

edited

Loading