Skip to content

OCPBUGS-86238: add CPO overrides for ARO swift-nic resource limits#8610

Merged
celebdor merged 2 commits into
openshift:mainfrom
celebdor:worktree-cpo-overrides-swift-nic
May 29, 2026
Merged

OCPBUGS-86238: add CPO overrides for ARO swift-nic resource limits#8610
celebdor merged 2 commits into
openshift:mainfrom
celebdor:worktree-cpo-overrides-swift-nic

Conversation

@celebdor
Copy link
Copy Markdown
Collaborator

@celebdor celebdor commented May 27, 2026

Summary

  • Add CPO override entries for OCP 4.20, 4.21, and 4.22 so ARO clusters get a CPO image that sets limits == requests for the aro.openshift.io/swift-nic extended resource, fixing pod admission failures
  • 4.20 (OCPBUGS-86567): Update all existing entries to new Konflux image, add missing 4.20.7, extend to 4.20.24
  • 4.21 (OCPBUGS-86416): New entries for 4.21.0–4.21.18
  • 4.22 (OCPBUGS-86354): New entries for 4.22.0–4.22.1

All override images are multi-arch (amd64 + arm64) OCI image indexes built by Konflux and verified to contain the respective fix PRs (#8593, #8565, #8564).

Test plan

  • Override unit tests pass (go test ./hypershift-operator/controlplaneoperator-overrides/...)
  • Verify CPO override resolution returns the correct image for Azure platform + each version

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Updated platform overrides to refresh control-plane operator images for OpenShift 4.20 (through 4.20.24) and 4.21 (through 4.21.18); clarified that 4.22 does not require an override.
  • New Features
    • Added deployment/stream definitions to enable automated delivery for control-plane-operator v4.21 and v4.22.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@celebdor: This pull request references Jira Issue OCPBUGS-86238, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Add CPO override entries for OCP 4.20, 4.21, and 4.22 so ARO clusters get a CPO image that sets limits == requests for the aro.openshift.io/swift-nic extended resource, fixing pod admission failures
  • 4.20 (OCPBUGS-86567): Update all existing entries to new Konflux image, add missing 4.20.7, extend to 4.20.24
  • 4.21 (OCPBUGS-86416): New entries for 4.21.0–4.21.18
  • 4.22 (OCPBUGS-86354): New entries for 4.22.0–4.22.1

All override images are multi-arch (amd64 + arm64) OCI image indexes built by Konflux and verified to contain the respective fix PRs (#8593, #8565, #8564).

Test plan

  • Override unit tests pass (go test ./hypershift-operator/controlplaneoperator-overrides/...)
  • Verify CPO override resolution returns the correct image for Azure platform + each version

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8283e198-1b2e-49d3-986d-863dfb897916

📥 Commits

Reviewing files that changed from the base of the PR and between 412a258 and 60d9fe4.

📒 Files selected for processing (3)
  • contrib/konflux/cpo_4_21_stream.yaml
  • contrib/konflux/cpo_4_22_stream.yaml
  • hypershift-operator/controlplaneoperator-overrides/assets/overrides.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • contrib/konflux/cpo_4_21_stream.yaml
  • contrib/konflux/cpo_4_22_stream.yaml

📝 Walkthrough

Walkthrough

This PR replaces the Azure CPO 4.20 override mappings with a shared cpoImage digest and extends 4.20 coverage through 4.20.24 (OCPBUGS-86567). It adds an Azure CPO override block for 4.21.0–4.21.18 using a shared 4.21 digest (OCPBUGS-86416) and inserts comments stating 4.22 does not require an override. It also adds two Konflux ProjectDevelopmentStream manifests to create control-plane-operator-v4-21 and control-plane-operator-v4-22 streams wired to the hypershift-cpo-template.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error PR adds overrides_test.go with dynamic test names using t.Run(fmt.Sprintf("test-%d", i+1)) generating "test-1", "test-2", etc. based on loop index, violating stable test name requirements. Replace dynamic fmt.Sprintf("test-%d", i+1) with descriptive static names (e.g., based on platform and version being tested).
Title check ⚠️ Warning The title references OCPBUGS-86238 but the PR objectives show this bug is invalid for the actual changes, which address OCPBUGS-86567, OCPBUGS-86416, and OCPBUGS-86354. The changes add CPO overrides for multiple OCP versions (4.20-4.22) with Konflux stream manifests, not specifically for ARO swift-nic resource limits as the title suggests. Update the title to accurately reflect the main changes: adding CPO overrides for OCP 4.20-4.22 via OCPBUGS-86567, OCPBUGS-86416, and OCPBUGS-86354, or use the correct primary bug reference.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Structure And Quality ✅ Passed PR contains only YAML configuration and manifest changes; no Ginkgo tests added or modified. The check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes are configuration/build files only: CPO image version overrides and Konflux build stream definitions. No deployment manifests, pod specs, or scheduling constraints are introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only YAML configuration and manifest files (overrides.yaml, cpo_4_21_stream.yaml, cpo_4_22_stream.yaml) with no new Ginkgo e2e tests; check not applicable.
No-Weak-Crypto ✅ Passed PR contains only YAML config and standard Go code. No weak crypto algorithms, custom crypto implementations, or insecure secret comparisons detected.
Container-Privileges ✅ Passed None of the modified files contain container security violations. The files are configuration data (image overrides) and build stream definitions with no privileged container specifications.
No-Sensitive-Data-In-Logs ✅ Passed PR contains no logging code and no sensitive data. All files are configuration (YAML) with public image digests and comments only.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 27, 2026
@csrwng
Copy link
Copy Markdown
Contributor

csrwng commented May 27, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: celebdor, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026
celebdor and others added 2 commits May 27, 2026 22:18
…e limits

Add Azure CPO image overrides for 4.20 and 4.21 to fix the
aro.openshift.io/swift-nic extended resource limits not matching
requests, which causes Kubernetes pod admission failures.

4.22 does not need an override: the fix (PR openshift#8564, commit d6c72d1)
landed before rc.5 and will be included in the GA release.

- 4.20.0-4.20.24: OCPBUGS-86567 (PR openshift#8593 cherry-pick to release-4.20)
- 4.21.0-4.21.18: OCPBUGS-86416 (PR openshift#8565 cherry-pick to release-4.21)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProjectDevelopmentStream YAML files for control-plane-operator
versions 4.21 and 4.22, matching the existing pattern used for 4.19
and 4.20.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@celebdor celebdor force-pushed the worktree-cpo-overrides-swift-nic branch from 412a258 to 60d9fe4 Compare May 27, 2026 20:19
@celebdor celebdor added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@celebdor celebdor marked this pull request as ready for review May 27, 2026 20:20
@openshift-ci-robot
Copy link
Copy Markdown

@celebdor: This pull request references Jira Issue OCPBUGS-86238, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Retaining the jira/valid-bug label as it was manually added.

Details

In response to this:

Summary

  • Add CPO override entries for OCP 4.20, 4.21, and 4.22 so ARO clusters get a CPO image that sets limits == requests for the aro.openshift.io/swift-nic extended resource, fixing pod admission failures
  • 4.20 (OCPBUGS-86567): Update all existing entries to new Konflux image, add missing 4.20.7, extend to 4.20.24
  • 4.21 (OCPBUGS-86416): New entries for 4.21.0–4.21.18
  • 4.22 (OCPBUGS-86354): New entries for 4.22.0–4.22.1

All override images are multi-arch (amd64 + arm64) OCI image indexes built by Konflux and verified to contain the respective fix PRs (#8593, #8565, #8564).

Test plan

  • Override unit tests pass (go test ./hypershift-operator/controlplaneoperator-overrides/...)
  • Verify CPO override resolution returns the correct image for Azure platform + each version

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
  • Updated platform overrides to refresh control-plane operator images for OpenShift 4.20 (through 4.20.24) and 4.21 (through 4.21.18); clarified that 4.22 does not require an override.
  • New Features
  • Added deployment/stream definitions to enable automated delivery for control-plane-operator v4.21 and v4.22.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.61%. Comparing base (2f52041) to head (60d9fe4).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8610   +/-   ##
=======================================
  Coverage   40.61%   40.61%           
=======================================
  Files         755      755           
  Lines       93227    93227           
=======================================
  Hits        37864    37864           
  Misses      52640    52640           
  Partials     2723     2723           
Flag Coverage Δ
cmd-support 34.70% <ø> (ø)
cpo-hostedcontrolplane 41.77% <ø> (ø)
cpo-other 41.06% <ø> (ø)
hypershift-operator 50.75% <ø> (ø)
other 31.58% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci openshift-ci Bot requested review from Nirshal and cblecker May 27, 2026 20:31
@cblecker
Copy link
Copy Markdown
Member

/uncc

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-override
  • Build ID: 2059731962586730496
  • Target: e2e-aws-override
  • Test: TestAutoscaling/Main/TestAutoscalerRespectsNodePoolPause
  • Result: 277 tests, 74 skipped, 3 failures (1 unique failing subtest)

Test Failure Analysis

Error

eventually.go:225: observed *v1.Deployment e2e-clusters-4mzc5-autoscaling-g5k8s/cluster-autoscaler invalid at RV 107544 after 5m0s: autoscaler deployment missing --scale-down-unneeded-time=60s arg
autoscaling_test.go:533: Failed to wait for autoscaler deployment to have required settings and be ready in 5m0s: context deadline exceeded

Summary

The TestAutoscalerRespectsNodePoolPause test failed because it sets HostedCluster.Spec.Autoscaling.ScaleDown configuration (with UnneededDurationSeconds=60) and then waits 5 minutes for the cluster-autoscaler deployment to contain --scale-down-unneeded-time=60s in its args. The arg never appeared because the e2e-aws-override job runs hosted clusters with the 4.17.x OCP release image, which means the Control Plane Operator (CPO) running inside the control plane namespace is a 4.17-era override image (ocp-v4.0-art-dev@sha256:75f141...). That older CPO does not understand the ClusterAutoscaling.ScaleDown API field (added post-4.17 on main) and therefore never propagates the scale-down args to the autoscaler deployment. This failure is unrelated to PR #8610, which only modifies Azure-platform CPO override entries and Konflux stream definitions — the AWS override section and testing configuration are completely untouched.

Root Cause

The root cause is a missing version gate on the TestAutoscalerRespectsNodePoolPause test (a pre-existing bug in the test suite, not introduced by this PR).

Detailed chain of causation:

  1. The e2e-aws-override job creates hosted clusters using the AWS testing release images from overrides.yaml: latest: 4.17.43 / previous: 4.17.20.

  2. For AWS 4.17.43, the override table maps to CPO image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75f141026e4c4efb68ba1691942ac2d2abae906b402bdce85ed5f967712d1e7e — a CPO built from the OCP 4.17 release branch.

  3. TestAutoscalerRespectsNodePoolPause (a regression test for OCPBUGS-78152) updates HostedCluster.Spec.Autoscaling with ScaleDown config including UnneededDurationSeconds: 60, then calls waitForAutoscalerDeploymentReady() expecting the cluster-autoscaler deployment to contain --scale-down-unneeded-time=60s.

  4. The ClusterAutoscaling.ScaleDown field and the ScaleDownArgs() function in the CPO (control-plane-operator/controllers/hostedcontrolplane/v2/autoscaler/deployment.go) were added on the main branch (4.18+). The 4.17 CPO override image does not have this code, so it ignores the ScaleDown configuration entirely and never sets the --scale-down-unneeded-time argument.

  5. The test times out after 5 minutes waiting for an arg that will never appear.

  6. Unlike TestAutoscalingBalancing which correctly gates itself with e2eutil.AtLeast(t, e2eutil.Version420), TestAutoscalerRespectsNodePoolPause has no version gate and runs unconditionally against any hosted cluster version — including 4.17.x where the feature doesn't exist.

Why this is unrelated to PR #8610: The PR only modifies overrides.yaml entries under the azure: platform section (adding CPO override images for ARO swift-nic resource limits in 4.20/4.21) and adds Konflux stream definitions. The aws: override entries and aws: testing: configuration are completely unchanged. The failing test would fail identically on any PR that triggers the e2e-aws-override job.

Recommendations
  1. Add a version gate to TestAutoscalerRespectsNodePoolPause: The test should call e2eutil.AtLeast(t, e2eutil.Version418) (or whichever version first shipped the ClusterAutoscaling.ScaleDown support in the CPO) at the beginning of testAutoscalerRespectsNodePoolPause(), matching the pattern used by TestAutoscalingBalancing. This will cause the test to skip when running against 4.17.x release images in the override job.

  2. Retest / override: This failure is a pre-existing test issue and is safe to /retest or override. The PR's Azure CPO override changes are data-only and cannot affect AWS e2e behavior.

  3. File a bug for the missing version gate on TestAutoscalerRespectsNodePoolPause to prevent this flake from blocking future PRs that trigger the e2e-aws-override job.

Evidence
Evidence Detail
Failing test TestAutoscaling/Main/TestAutoscalerRespectsNodePoolPause (300.06s timeout)
Error message autoscaler deployment missing --scale-down-unneeded-time=60s arg after 5m0s
AWS testing release latest: 4.17.43, previous: 4.17.20 (from overrides.yaml AWS testing section)
AWS CPO override image (4.17.43) ocp-v4.0-art-dev@sha256:75f141026e4c4efb68ba1691942ac2d2abae906b402bdce85ed5f967712d1e7e
Missing version gate testAutoscalerRespectsNodePoolPause has no e2eutil.AtLeast() call, unlike testAutoscalingBalancing which uses e2eutil.AtLeast(t, e2eutil.Version420)
ScaleDown support location control-plane-operator/controllers/hostedcontrolplane/v2/autoscaler/deployment.go:153ScaleDownArgs() added on main branch, absent in 4.17 CPO
PR #8610 scope Only modifies azure: section of overrides.yaml + adds Konflux stream files; AWS section untouched
Files changed by PR contrib/konflux/cpo_4_21_stream.yaml, contrib/konflux/cpo_4_22_stream.yaml, overrides.yaml (Azure section only)
Overall test results 277 tests, 74 skipped, 3 failures (parent tests TestAutoscaling and TestAutoscaling/Main fail because subtest failed)
Other passing tests TestCreateClusterCustomConfig, TestUpgradeControlPlane, TestNodePoolAutoscalingScaleFromZero, TestCreateClusterProxy, TestCreateClusterPrivate, TestCreateClusterRequestServingIsolation all PASS

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 28, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 28, 2026

validated via #8616

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@celebdor: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-override 60d9fe4 link true /test e2e-aws-override
ci/prow/e2e-aks-override 60d9fe4 link true /test e2e-aks-override

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 28, 2026

PR desc needs updating

@celebdor celebdor merged commit 84e6827 into openshift:main May 29, 2026
32 of 35 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@celebdor: Jira Issue OCPBUGS-86238 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

Details

In response to this:

Summary

  • Add CPO override entries for OCP 4.20, 4.21, and 4.22 so ARO clusters get a CPO image that sets limits == requests for the aro.openshift.io/swift-nic extended resource, fixing pod admission failures
  • 4.20 (OCPBUGS-86567): Update all existing entries to new Konflux image, add missing 4.20.7, extend to 4.20.24
  • 4.21 (OCPBUGS-86416): New entries for 4.21.0–4.21.18
  • 4.22 (OCPBUGS-86354): New entries for 4.22.0–4.22.1

All override images are multi-arch (amd64 + arm64) OCI image indexes built by Konflux and verified to contain the respective fix PRs (#8593, #8565, #8564).

Test plan

  • Override unit tests pass (go test ./hypershift-operator/controlplaneoperator-overrides/...)
  • Verify CPO override resolution returns the correct image for Azure platform + each version

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
  • Updated platform overrides to refresh control-plane operator images for OpenShift 4.20 (through 4.20.24) and 4.21 (through 4.21.18); clarified that 4.22 does not require an override.
  • New Features
  • Added deployment/stream definitions to enable automated delivery for control-plane-operator v4.21 and v4.22.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-05-30-072431

enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 3, 2026
Add 4.20.25 and 4.21.19 override entries using the same images
from PR openshift#8610 to exercise the validate-pr-override-images workflow
end-to-end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants