Skip to content

CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176

Open
weliang1 wants to merge 8 commits into
openshift:mainfrom
weliang1:add-ovn-zero-workers-test
Open

CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176
weliang1 wants to merge 8 commits into
openshift:mainfrom
weliang1:add-ovn-zero-workers-test

Conversation

@weliang1

@weliang1 weliang1 commented Apr 7, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added an e2e test validating a highly-available control plane with zero worker replicas. Verifies control-plane deployment rollout and readiness, accepts absent node daemonset or enforces zero-scheduled node state, optionally exercises an upgrade/image change path with rollout checks, waits for the hosted network operator to report healthy availability, and performs final stability checks after rollouts.

@openshift-ci-robot

Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6064 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6064

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
@openshift-ci

openshift-ci Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new e2e test TestOVNControlPlaneZeroWorkers is added to validate OVN control-plane behavior for HyperShift hosted clusters with NodePoolReplicas=0. The test derives the hosted control-plane namespace, waits for the ovnkube-control-plane Deployment to become ready and have ReadyReplicas>0, verifies the ovnkube-node DaemonSet is either absent or reports zero desired pods and matches observed generation, optionally patches hostedCluster.spec.release.image to trigger an upgrade and waits for rollout and image changes (plus control-plane/version checks for minimum HyperShift versions), creates a guest kube client to poll the hosted network ClusterOperator until Available=True and neither Progressing nor Degraded are true, then re-validates control-plane readiness and node state.

Sequence Diagram(s)

sequenceDiagram
    participant TestHarness as Test Harness
    participant HostAPI as HostedCluster API
    participant CPDeploy as ovnkube-control-plane Deployment
    participant NodeDS as ovnkube-node DaemonSet
    participant GuestAPI as Guest Kube API (hosted)
    participant ClusterOp as network ClusterOperator

    TestHarness->>HostAPI: Derive hosted control-plane namespace
    TestHarness->>CPDeploy: Wait for Deployment Available / ReadyReplicas>0
    TestHarness->>NodeDS: Check DaemonSet presence
    alt DaemonSet missing
        Note right of TestHarness: acceptable
    else DaemonSet present
        TestHarness->>NodeDS: Assert DesiredNumberScheduled, NumberAvailable, NumberUnavailable == 0
        TestHarness->>NodeDS: Assert ObservedGeneration == Generation
    end
    alt Upgrade image provided and differs
        TestHarness->>HostAPI: Patch hostedCluster.spec.release.image
        TestHarness->>CPDeploy: Wait for rollout (generation, ready/updated == desired)
        TestHarness->>CPDeploy: Verify container image changed
        Note right of TestHarness: For supported HyperShift versions also wait for control-plane rollout and ControlPlaneVersion
    end
    TestHarness->>GuestAPI: Create guest kube client
    loop Poll until success
        GuestAPI->>ClusterOp: Get network ClusterOperator (unstructured)
        ClusterOp-->>GuestAPI: Conditions (Available/Progressing/Degraded)
    end
    TestHarness->>CPDeploy: Final readiness check (ReadyReplicas>0)
    TestHarness->>NodeDS: Final absence or zero desired pods check
Loading
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weliang1
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 7, 2026
@weliang1 weliang1 changed the title [WIP] CORENET-6064: Add e2e test for zero-worker HyperShift clusters in daemonset rollout [WIP] CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 7, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6064

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from b6f7f5d to c2fa99d Compare April 7, 2026 14:42
@weliang1

weliang1 commented Apr 7, 2026

Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@weliang1

weliang1 commented Apr 7, 2026

Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Apr 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 39.82%. Comparing base (899fd2a) to head (662c29b).
⚠️ Report is 559 commits behind head on main.

⚠️ Current head 662c29b differs from pull request most recent head 63b92a7

Please upload reports for the commit 63b92a7 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8176      +/-   ##
==========================================
+ Coverage   32.16%   39.82%   +7.66%     
==========================================
  Files         766      774       +8     
  Lines       91957    94891    +2934     
==========================================
+ Hits        29575    37794    +8219     
+ Misses      59855    54396    -5459     
- Partials     2527     2701     +174     

see 257 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.69% <ø> (?)
cpo-hostedcontrolplane 41.76% <ø> (?)
cpo-other 41.23% <ø> (?)
hypershift-operator 50.72% <ø> (?)
other 31.58% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…in daemonset rollout

Verifies that OVN control plane components can successfully upgrade
in HyperShift clusters with zero worker nodes.

This test validates:
- Initial OVN deployment readiness with zero workers
- OVN DaemonSet behavior (not created or reports 0 desired)
- Control plane upgrade from version X to Y
- OVN pod rollout during upgrade
- All control plane components complete rollout
- Network ClusterOperator remains healthy
- No degradation or pod crashes

The test addresses scenarios such as:
- Data plane hibernation (workers scaled to zero for cost savings)
- Autoscaling from zero (no workers until workload arrives)
- Management cluster updates when worker nodes are unreachable

Validated on live cluster:
- Cluster: hypershift-ci-373084
- Upgrade: 4.22.0-223038 → 051707
- Workers: 0 throughout test
- Duration: ~10 minutes
- Result: All 8 steps passed, 0 pod restarts

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from c2fa99d to dc2b23a Compare April 7, 2026 15:20
@weliang1 weliang1 changed the title [WIP] CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout [WIP] test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 7, 2026
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

/test all

@weliang1 weliang1 marked this pull request as ready for review April 8, 2026 13:02
@weliang1 weliang1 changed the title [WIP] test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 8, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

/remove-label do-not-merge/work-in-progress

@openshift-ci

openshift-ci Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

@weliang1: The label(s) /remove-label do-not-merge/work-in-progress cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, rebase/manual, cluster-config-api-changed, run-integration-tests, verified, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/skip-dependent-bug-check, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/remove-label do-not-merge/work-in-progress

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from devguyio and enxebre April 8, 2026 13:04
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@weliang1

weliang1 commented Apr 9, 2026

Copy link
Copy Markdown
Author

/pipeline required

@openshift-ci-robot

Copy link
Copy Markdown

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

NonePlatform does not deploy OVN-Kubernetes components, causing the test
to fail when looking for ovnkube-control-plane deployment. The test needs
a real platform (AWS) that deploys OVN networking components.

The framework validation correctly handles zero-worker clusters through
clusterOpts.ExpectedNodeCount(), adjusting condition expectations for
clusters without worker nodes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@weliang1

weliang1 commented Apr 9, 2026

Copy link
Copy Markdown
Author

/test e2e-aws

@openshift-ci-robot

openshift-ci-robot commented Apr 9, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added an e2e test validating a highly-available control plane with zero worker replicas. Verifies control-plane deployment rollout and readiness, accepts absent node daemonset or enforces zero-scheduled node state, optionally exercises an upgrade/image change path with rollout checks, waits for the hosted network operator to report healthy availability, and performs final stability checks after rollouts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/ovn_control_plane_zero_workers_test.go (1)

126-131: Don't skip the entire test when only the upgrade image is missing.

Line 130 turns the whole test into SKIP, which also drops the non-upgrade coverage from Steps 1-2 and the post-upgrade-independent health checks later in the test. It would be better to gate only the upgrade-specific steps (or split them into a subtest) so zero-worker OVN validation still runs in jobs without LatestReleaseImage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/ovn_control_plane_zero_workers_test.go` around lines 126 - 131, The
test currently calls t.Skip() when upgradeImage (globalOpts.LatestReleaseImage)
is empty or equal to baselineImage, which skips the entire test; instead, change
the flow so only upgrade-specific steps are gated: check upgradeImage and if
missing/equal only skip or return from the upgrade-related block (the steps that
perform the upgrade and post-upgrade validation) or move those steps into a
subtest (t.Run("upgrade", ...)) that is skipped, while allowing the initial
zero-worker OVN validation and post-upgrade-independent health checks to always
run; update references to upgradeImage, baselineImage and any t.Skip calls
accordingly so the rest of the test is still executed when LatestReleaseImage is
not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 154-210: The rollout predicate for the "ovnkube-control-plane"
Deployment can return true on the pre-upgrade revision; update the Eventually
check in the goroutine that reads deployment (the block creating deployment :=
&appsv1.Deployment{} inside g.Eventually) to also verify the pod image has
changed from the recorded baselineImage before returning true: after checking
ready==desired, updated==desired and observedGeneration==generation, fetch the
first container image from deployment.Spec.Template.Spec.Containers[0].Image
and, if baselineImage is non-empty, require newImage != baselineImage (or skip
the image check only when baselineImage is empty) so Eventually only succeeds
once the Deployment rollout actually reflects the new image.

---

Nitpick comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 126-131: The test currently calls t.Skip() when upgradeImage
(globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips
the entire test; instead, change the flow so only upgrade-specific steps are
gated: check upgradeImage and if missing/equal only skip or return from the
upgrade-related block (the steps that perform the upgrade and post-upgrade
validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is
skipped, while allowing the initial zero-worker OVN validation and
post-upgrade-independent health checks to always run; update references to
upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the
test is still executed when LatestReleaseImage is not provided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4b2e3eab-4f92-42a7-a589-99ea89428359

📥 Commits

Reviewing files that changed from the base of the PR and between 997a620 and ec4c5c9.

📒 Files selected for processing (1)
  • test/e2e/ovn_control_plane_zero_workers_test.go

Comment thread test/e2e/ovn_control_plane_zero_workers_test.go Outdated
Address CodeRabbit finding: The rollout predicate could return true on the
pre-upgrade revision if the deployment was already ready with the old image.

Changes:
- Capture baseline generation in addition to baseline image
- Verify deployment.Generation has changed from baseline
- Verify container image has changed from baseline
- Only return true when both generation and image have changed AND
  all replicas are ready/updated

This ensures Eventually waits for the actual upgrade rollout to complete
rather than returning immediately on the pre-upgrade state.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@weliang1

weliang1 commented Apr 9, 2026

Copy link
Copy Markdown
Author

/test e2e-aws

1 similar comment
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

…tests

The standard Execute() method runs EnsureHostedCluster validation in the
after() phase, which incorrectly defaults hasWorkerNodes=true for private
or non-public clusters. This causes ValidateHostedClusterConditions to
expect worker-dependent conditions (DataPlaneConnectionAvailable,
ControlPlaneConnectionAvailable, ClusterVersionAvailable) that cannot be
satisfied in zero-worker cluster configurations.

This commit adds ExecuteWithoutEnsureValidation() method that:
- Skips the problematic after() validation (EnsureHostedCluster)
- Still runs before() validation which correctly uses opts.ExpectedNodeCount()
- Allows tests to provide their own comprehensive validation
- Is specifically designed for non-standard cluster configurations

The TestOVNControlPlaneZeroWorkers test is updated to use this new method,
as it already provides comprehensive Steps 1-8 validation for OVN components
in zero-worker clusters.

This fixes the CI failure where the test timed out waiting for conditions
that cannot be met without worker nodes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@weliang1

Copy link
Copy Markdown
Author

/test verify-deps

@weliang1

Copy link
Copy Markdown
Author

/cc @kyrtapz
Please help review the e2e test case for openshift/cluster-network-operator#2897, thanks!

@openshift-ci openshift-ci Bot requested a review from kyrtapz April 14, 2026 13:56
@weliang1

Copy link
Copy Markdown
Author

@enxebre @devguyio
Please help review the e2e test case for openshift/cluster-network-operator#2897, thanks!

@enxebre

enxebre commented May 13, 2026

Copy link
Copy Markdown
Member

Should this rather be an additional sequential validation for TestUpgradeControlPlane? so we don't create a new HC. i.e after existing validations, scale down to zero and run this.

Besides, should we enable a way to create HCs with no infra?
@devguyio @sjenning

Integrate OVN zero-worker validation as a subtest in TestUpgradeControlPlane
instead of creating a separate test with a new HostedCluster.

Changes:
- Remove standalone TestOVNControlPlaneZeroWorkers test
- Add "Validate OVN control plane with zero workers" subtest to TestUpgradeControlPlane
- Scale NodePool to zero after upgrade completion
- Verify ovnkube-node DaemonSet reports DesiredNumberScheduled == 0
- Verify ovnkube-control-plane Deployment remains healthy
- Verify network ClusterOperator remains healthy with zero workers

This approach:
- Saves CI resources by reusing the upgraded cluster
- Still validates CORENET-6066 fix (CNO handling DesiredNumberScheduled == 0)
- Tests realistic "data plane hibernation after upgrade" scenario

Fixes: https://redhat.atlassian.net/browse/CORENET-6066
Related: openshift/cluster-network-operator#2897
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

After validating OVN with zero workers, scale the NodePool back to its
original replica count before the framework's EnsureHostedCluster validation
runs. This prevents framework validation failures due to worker-dependent
cluster operators (image-registry, ingress) and connectivity checks being
unavailable with zero workers.

Flow:
1. Complete upgrade with normal worker count ✓
2. Scale NodePool to zero
3. Validate OVN control plane with zero workers ✓
4. Scale NodePool back up (NEW)
5. Wait for nodes to become ready (NEW)
6. Framework validation passes ✓

Fixes the issue where EnsureHostedCluster failed with:
- DataPlaneConnectionAvailable=Unknown: NoWorkerNodesAvailable
- ClusterVersionSucceeding=False: Cluster operators image-registry, ingress not available

Related: https://redhat.atlassian.net/browse/CORENET-6066
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

1 similar comment
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

Switch TestUpgradeControlPlane to use ExecuteWithoutEnsureValidation to
avoid HostedCluster condition validation race after scaling workers back
from zero.

After the zero-worker validation completes and workers are scaled back to
2 replicas, cluster operators (image-registry, ingress) need additional
time to reconcile before HostedCluster conditions reflect healthy state.
Node Ready status does not guarantee operator availability.

The ExecuteWithoutEnsureValidation method was created specifically for
this scenario but was not being used, causing test timeouts on the
EnsureHostedCluster validation step.

Fixes: openshift#8176 (comment)
@weliang1

Copy link
Copy Markdown
Author

Should this rather be an additional sequential validation for TestUpgradeControlPlane? so we don't create a new HC. i.e after existing validations, scale down to zero and run this.

Besides, should we enable a way to create HCs with no infra? @devguyio @sjenning

@enxebre Your feedback was addressed as integrating the test into TestUpgradeControlPlane. cc: @devguyio @sjenning

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@openshift-ci

openshift-ci Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

@weliang1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-4-21 997a620 link true /test e2e-aks-4-21
ci/prow/e2e-aks 997a620 link true /test e2e-aks
ci/prow/e2e-aws-4-21 997a620 link true /test e2e-aws-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from c4ef3c9 to 63b92a7 Compare May 26, 2026 21:49
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@hypershift-jira-solve-ci

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Enterprise Contract verify task result: FAILURE — 250 successes, 20 warnings, 6 failures.
EC policy rules with effective_on dates of 2026-05-13 transitioned from warn→deny,
causing previously-passing builds to fail. This is a repo-wide issue affecting ALL
hypershift PRs since ~May 20, 2026 — not caused by PR #8176's code changes.

Summary

Both check runs fail identically: the Konflux Enterprise Contract (EC) verification step rejects the built hypershift-operator container image against the operator-main release policy. The failure is not caused by PR #8176 — it is a repo-wide infrastructure issue. Every PR on the hypershift repo has been failing these same EC checks since approximately May 20, 2026. The MCE-217 EC scenarios continue to pass because they use a different (less strict) policy configuration.

The root cause is that the EC release policy (quay.io/enterprise-contract/ec-release-policy:latest) was updated on 2026-05-13 with new deny rules that have effective_on: 2026-05-13T00:00:00Z dates. These rules enforce proxy metadata requirements for SBOM packages fetched by Hermeto (the Konflux dependency prefetch system). Before May 13, these rules were warnings; after May 13, they became hard denials. The hypershift-operator build pipeline does not yet satisfy these new requirements.

Historical evidence: PR #8530 (May 15) passed EC, PR #8540 (May 20) failed with 2 failures, and PR #8176 (May 27) now shows 6 failures — the failure count is growing as additional effective_on dates pass while warnings graduate to denials.

Root Cause

The EC release policy introduced three new deny rules effective 2026-05-13:

  1. prefetch_dependencies.package_registry_proxy_enabled (effective_on: 2026-05-13) — Requires the prefetch-dependencies task to have enable-package-registry-proxy set to "true". The hypershift pipeline already has this parameter set, so this rule alone should not cause failure.

  2. sbom_spdx.proxy_metadata_required (effective_on: 2026-05-13) — For packages fetched by Hermeto with PURL types listed in proxy_enabled_purl_types that are registry dependencies, the SPDX SBOM sourceInfo field must be non-empty. If the Hermeto prefetch step is not producing proxy metadata in the SBOM, this rule fires.

  3. sbom_cyclonedx.proxy_metadata_required (effective_on: 2026-05-13) — Same rule as above but for CycloneDX SBOMs.

Additionally, sbom_spdx.allowed_proxy_urls (effective_on: 2026-06-01) will become active on June 1, which will likely add more failures.

The failure count growth (0→2→6) indicates multiple rule violations accumulating. The exact 6 violation messages could not be retrieved because the Konflux cluster (stone-prd-rh01) authentication token has expired and interactive re-authentication is required.

Key factors:

  • All 18 Tekton task bundle references in common-operator-build.yaml are confirmed still trusted (not expired)
  • All 13 required image labels are present on the built image
  • The issue is specifically with SBOM proxy metadata content, not with task trust or image labeling
  • Only the operator-main EC policy triggers failures; the MCE-217 policy passes
Recommendations
  1. Immediate: Engage Konflux/HACBS team — File an issue or reach out to the Konflux team to understand what SBOM changes are needed to satisfy proxy_metadata_required. The sourceInfo field in SPDX SBOMs must be populated for Hermeto-fetched registry dependencies.

  2. Check Hermeto/prefetch configuration — Verify that the prefetch-dependencies task version being used supports generating proxy metadata. A newer version of the task may be required that populates sourceInfo in the SBOM output.

  3. Review proxy_enabled_purl_types rule data — The default in ec-policy-data is [] (empty list). If a policy data update has added purl types (e.g., ["golang"]), all packages of those types fetched by Hermeto must have proxy metadata. Check the current value in quay.io/enterprise-contract/ec-policy-data:latest.

  4. Prepare for June 1 deadline — The allowed_proxy_urls rule becomes effective on 2026-06-01 and will enforce that proxy download URLs match allowed patterns. Address this proactively before the deadline adds more failures.

  5. This is NOT a PR-level fix — Do not attempt to fix this in individual PRs. The fix needs to happen at the pipeline/task/policy-data level, likely through a Konflux pipeline configuration change that affects all builds.

Evidence
Evidence Detail
Affected scope All hypershift PRs, not just #8176
Failure pattern EC verify: 250 pass, 20 warn, 6 fail (growing over time)
First failure observed ~May 20, 2026 (PR #8540)
Last passing build ~May 15, 2026 (PR #8530)
Policy update date 2026-05-13 (effective_on for new rules)
Failing policy operator-main Enterprise Contract policy
Passing policy MCE-217 Enterprise Contract policy
New deny rule 1 prefetch_dependencies.package_registry_proxy_enabled (effective 2026-05-13)
New deny rule 2 sbom_spdx.proxy_metadata_required (effective 2026-05-13)
New deny rule 3 sbom_cyclonedx.proxy_metadata_required (effective 2026-05-13)
Upcoming deny rule sbom_spdx.allowed_proxy_urls (effective 2026-06-01)
Task bundles All 18 confirmed trusted and not expired
Image labels All 13 required labels present
Pipeline file .tekton/pipelines/common-operator-build.yaml
EC release policy source quay.io/enterprise-contract/ec-release-policy:latest
Limitation Exact 6 violation messages unavailable (Konflux cluster auth expired)

@weliang1

weliang1 commented Jun 8, 2026

Copy link
Copy Markdown
Author

/retest-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants