Skip to content

AUTOSCALE-681: various karpenter and karpenterupgrade test fixes#8510

Open
maxcao13 wants to merge 3 commits into
openshift:mainfrom
maxcao13:fix-karpenter-e2e-parallel-races
Open

AUTOSCALE-681: various karpenter and karpenterupgrade test fixes#8510
maxcao13 wants to merge 3 commits into
openshift:mainfrom
maxcao13:fix-karpenter-e2e-parallel-races

Conversation

@maxcao13
Copy link
Copy Markdown
Member

@maxcao13 maxcao13 commented May 13, 2026

What this PR does / why we need it:

Scope waitForReadyKarpenterPods to workload labels: The function listed all pods in the default namespace, which picked up pods from other parallel subtests running concurrently. This caused count mismatches and wrong-node assertions. Now each caller passes its workload's app label so it only sees its own pods.

Use correct version gate for TestKarpenterUpgradeControlPlane: Switch from AtLeast(Version422) to ShouldRunKarpenterTests, which is the standard gate used by the other karpenter tests and respects the RUN_KARPENTER_TESTS env var.

Use arm-compatible image in ARM64 test: quay.io/openshift/origin-pod is amd64-only and can't actually run on arm nodes. Use quay.io/hypershift/sleep:multiarch (multi-arch) for the arm test via a new testWorkloadWithImage helper.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

I was unaware of #8466 when I merged #8498. This PR reconciles that.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Gate Karpenter upgrade test with environment-aware runner to skip when not applicable
    • Add pod label filtering (e.g., app=web-app) to Karpenter readiness checks before and after upgrades
    • Generalize readiness helper to accept label selectors for more precise verification
    • Add ARM64 workload handling and configurable test workload images for broader platform coverage

maxcao13 and others added 2 commits May 13, 2026 15:49
waitForReadyKarpenterPods listed all pods in the default namespace,
which picked up pods from other parallel subtests. This caused
count mismatches and wrong-node assertions.

Filter by the workload's app label so each test only sees its own
pods.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Max Cao <macao@redhat.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 13, 2026

@maxcao13: This pull request references AUTOSCALE-681 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Scope waitForReadyKarpenterPods to workload labels: The function listed all pods in the default namespace, which picked up pods from other parallel subtests running concurrently. This caused count mismatches and wrong-node assertions. Now each caller passes its workload's app label so it only sees its own pods.

Use correct version gate for TestKarpenterUpgradeControlPlane: Switch from AtLeast(Version422) to ShouldRunKarpenterTests, which is the standard gate used by the other karpenter tests and respects the RUN_KARPENTER_TESTS env var.

Use arm-compatible image in ARM64 test: quay.io/openshift/origin-pod is amd64-only and can't actually run on arm nodes. Use registry.k8s.io/pause:3.10 (multi-arch) for the arm test via a new testWorkloadWithImage helper.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

I was unaware of #8466 when I merged #8498. This PR reconciles that.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

The pull request updates two Karpenter E2E test files: the control-plane upgrade test now uses e2eutil.ShouldRunKarpenterTests(t) for gating and applies a label selector when waiting for Karpenter pods; the main Karpenter test refactors waitForReadyKarpenterPods to accept podLabels, scopes readiness checks by those labels, and introduces testWorkloadWithImage to allow specifying workload container images (used for ARM64 compatibility).

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

  • clebs
  • enxebre
  • bryan-cox
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Tests TestKarpenter and TestKarpenterUpgradeControlPlane require Karpenter node auto-scaling. No SNO skip protection. Tests will fail on single-node clusters. Add [Skipped:SingleReplicaTopology] label to test names, or guard with exutil.IsSingleNode() check. Karpenter provisioning assumes multiple nodes available for auto-scaling.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main changes: test fixes for karpenter and karpenter upgrade tests, with specific reference to the AUTOSCALE-681 issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic. Main test functions and all subtests via t.Run() use static strings with no dynamic content.
Test Structure And Quality ✅ Passed Helper filters pods by labels, preventing parallel test interference. Proper timeouts, messages, and cleanup. t.Cleanup() usage and gate alignment with patterns all follow quality standards.
Microshift Test Compatibility ✅ Passed PR modifies only existing tests, not adding new ones. Both test functions have platform protection (AWS-only, Karpenter-specific gates). No MicroShift-incompatible APIs used.
Topology-Aware Scheduling Compatibility ✅ Passed Check scope is deployment manifests, operator code, or controllers. PR modifies only E2E test files (test/e2e/ with e2e build tag). No production code affected.
Ote Binary Stdout Contract ✅ Passed PR changes do not violate OTE Binary Stdout Contract. No process-level stdout writes, klog without stderr redirection, or problematic Ginkgo suite configuration detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Custom check applies only to NEW Ginkgo tests (It(), Describe(), Context(), When()). PR modifies existing traditional Go tests using func Test*() and t.Run(), not Ginkgo. Check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.00%. Comparing base (674d92a) to head (9b68e6a).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8510   +/-   ##
=======================================
  Coverage   40.00%   40.00%           
=======================================
  Files         751      751           
  Lines       92838    92838           
=======================================
  Hits        37137    37137           
  Misses      53014    53014           
  Partials     2687     2687           
Flag Coverage Δ
cmd-support 34.09% <ø> (ø)
cpo-hostedcontrolplane 40.56% <ø> (ø)
cpo-other 40.14% <ø> (ø)
hypershift-operator 50.53% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 318-320: The ARM64 e2e workload is using a public registry image
("registry.k8s.io/pause:3.10") in the armWorkLoads created via
testWorkloadWithImage, which introduces external network dependency; update the
image argument passed to testWorkloadWithImage for the "arm-app" workload (and
any related armNodePool usage) to a CI-backed/mirrored multi-arch image hosted
on our internal registry (or a known-mirrored-by-CI image) that supports arm64
so the test no longer pulls from the public internet; ensure the chosen image is
multi-arch and referenced consistently in the test setup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 98e58f21-8bf9-43d3-9337-e13ead63f350

📥 Commits

Reviewing files that changed from the base of the PR and between 674d92a and 80fe16f.

📒 Files selected for processing (2)
  • test/e2e/karpenter_control_plane_upgrade_test.go
  • test/e2e/karpenter_test.go

Comment thread test/e2e/karpenter_test.go
The quay.io/openshift/origin-pod manifest was is not multi-arch and run on arm nodes.
Though, it can schedule. We use quay.io/hypershift/sleep:multiarch specifically for the arm test.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the fix-karpenter-e2e-parallel-races branch from 80fe16f to 9b68e6a Compare May 14, 2026 01:36
@maxcao13 maxcao13 marked this pull request as ready for review May 14, 2026 01:37
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and muraee May 14, 2026 01:38
@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-aws-autonode
/test e2e-aws-4.22

@cwbotbot
Copy link
Copy Markdown

Test Results

e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@maxcao13: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 9b68e6a link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 14, 2026

/approve
/hold
needs openshift/release#79051

feel free to cancel afterwards with the test passing

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, maxcao13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants