OCPBUGS-86662: Optimize CPO deployment polling interval in tests of additional trust bundle propagation by YamunadeviShanmugam · Pull Request #8617 · openshift/hypershift

YamunadeviShanmugam · 2026-05-28T08:27:44Z

This PR addresses inconsistent test timeouts in the trust bundle propagation tests. The root cause was aggressive API polling during the CPO deployment check, which lacked an explicit interval configuration and defaulted to the global 3s threshold.

By explicitly setting this check to 15s, we reduce API polling frequency by 5x, preventing API throttling

Summary by CodeRabbit

Tests
- Centralized and reused timing parameters in node pool end-to-end tests for improved maintainability and consistency
- Increased timeout for node pool status validation to reduce flakiness and improve reliability

openshift-merge-bot · 2026-05-28T08:27:47Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-28T08:27:52Z

@YamunadeviShanmugam: This pull request references Jira Issue OCPBUGS-86662, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Problem
CI fails in the TestAdditionalTrustBundlePropagation test suite due to timeout paired with overly aggressive API polling. During the test, injecting a custom trust bundle triggers a rolling update that requires spinning up new cloud vms, executing Ignition configs, and waiting for the new nodes to fully register which takes ~30mins to complete successfully. Because the test cuts off at exactly 20 minutes, it flags healthy, progressing rollouts as false-negative timeouts. Compounding this issue, different segments of the test poll the K8s API server for status updates as frequently as every 3 to 10 seconds. This hyper-aggressive polling floods the management cluster with traffic, triggering client-side rate limiters that cause API starvation and crash the test blocks.

Fix
To resolve this instability, the test has been refactored to replace the hardcoded time constraints with constants. Timeouts and Polling interval are updated accordingly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-28T08:28:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-28T08:30:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR centralizes E2E test polling and timeout values by adding shared constants in test/e2e/nodepool_additionalTrustBundlePropagation_test.go and applying them to all related waits. It increases the NodePool condition wait in test/e2e/nodepool_test.go from 20 to 30 minutes. Dockerfile.e2e is updated to install azure-cli via dnf with added flags: --nobest and --setopt=install_weak_deps=False.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions optimizing CPO deployment polling interval, but the actual changes also address inconsistent timeouts and include unrelated Docker build optimization for Azure CLI installation.	Update the title to reflect all key changes: optimizing polling intervals, increasing timeouts for node pool status validation, and updating Azure CLI installation options. Consider a more comprehensive title like 'OCPBUGS-86662: Fix inconsistent timeouts and aggressive API polling in additional trust bundle propagation tests'.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR introduces no dynamic test names. Test in nodepool_additionalTrustBundlePropagation_test.go uses static name; nodepool_test.go only modified timeout values.
Test Structure And Quality	✅ Passed	Test code meets quality standards: proper timeout constants defined, all Eventually calls have explicit timeouts, descriptive error messages present, consistent e2eutil patterns used throughout.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only E2E test files and test Docker infrastructure. No deployment manifests, operator code, controllers, or topology-aware scheduling constraints are introduced.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR doesn't add Ginkgo e2e tests (check requirement). New test uses Go testing.T/t.Run() pattern. No IPv4 assumptions or external connectivity detected in code.
No-Weak-Crypto	✅ Passed	No weak cryptography found. Changes involve test timing parameters, timeout adjustments, and Docker build configuration only—no crypto implementations or insecure patterns.
Container-Privileges	✅ Passed	PR contains no Kubernetes manifests or container privilege escalation. Changes are E2E test timing refactoring and Dockerfile dnf install constraint flags (security improvement).
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data exposure in logs. Changes involve timeout refactoring and Docker build parameters with existing logging only showing test names and hardcoded constants.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-28T08:31:21Z

@YamunadeviShanmugam: This pull request references Jira Issue OCPBUGS-86662, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Problem
CI fails in the TestAdditionalTrustBundlePropagation test suite due to timeout paired with overly aggressive API polling. During the test, injecting a custom trust bundle triggers a rolling update that requires spinning up new cloud vms, executing Ignition configs, and waiting for the new nodes to fully register which takes ~30mins to complete successfully. Because the test cuts off at exactly 20 minutes, it flags healthy, progressing rollouts as false-negative timeouts. Compounding this issue, different segments of the test poll the K8s API server for status updates as frequently as every 3 to 10 seconds. This hyper-aggressive polling floods the management cluster with traffic, triggering client-side rate limiters that cause API starvation and crash the test blocks.

Fix
To resolve this instability, the test has been refactored to replace the hardcoded time constraints with constants. Timeouts and Polling interval are updated accordingly.

Summary by CodeRabbit

Tests

Refactored timing parameters in node pool end-to-end tests for improved maintainability

Increased validation timeout for node pool status checks to enhance test reliability

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

YamunadeviShanmugam · 2026-05-28T08:32:24Z

/pipeline required

openshift-merge-bot · 2026-05-28T08:32:27Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

coderabbitai

🧹 Nitpick comments (1)

test/e2e/nodepool_additionalTrustBundlePropagation_test.go (1)
137-137: ⚡ Quick win

Extract hardcoded timeout to a constant for consistency.

Both locations use a hardcoded 5*time.Minute timeout for user-ca-bundle operations, which is inconsistent with the PR's stated objective to "replace hardcoded time constraints with constants." Consider extracting this value as userCABundleTimeout in the const block (lines 57-62) and referencing it here.
♻️ Suggested refactor

Add to the const block:
 const (
 	nodePoolConfigUpdateStartTimeout  = 5 * time.Minute
 	nodePoolConfigUpdateFinishTimeout = 30 * time.Minute
 	nodePoolConfigUpdatePollInterval  = 15 * time.Second
 	cpoDeploymentUpdateTimeout        = 10 * time.Minute
+	userCABundleTimeout               = 5 * time.Minute
 )
Then update both call sites:
-		e2eutil.WithInterval(nodePoolConfigUpdatePollInterval), e2eutil.WithTimeout(5*time.Minute),
+		e2eutil.WithInterval(nodePoolConfigUpdatePollInterval), e2eutil.WithTimeout(userCABundleTimeout),
Also applies to: 219-219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/nodepool_additionalTrustBundlePropagation_test.go` at line 137,
Replace the hardcoded 5*time.Minute timeout used in e2eutil.WithTimeout calls
with a named constant by adding userCABundleTimeout to the existing const block
(currently containing other timeouts) and then change the
e2eutil.WithTimeout(5*time.Minute) call sites (e.g., the call that currently
uses e2eutil.WithInterval(nodePoolConfigUpdatePollInterval),
e2eutil.WithTimeout(5*time.Minute)) to use
e2eutil.WithTimeout(userCABundleTimeout) so both locations reference the new
constant.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/nodepool_additionalTrustBundlePropagation_test.go`:
- Line 137: Replace the hardcoded 5*time.Minute timeout used in
e2eutil.WithTimeout calls with a named constant by adding userCABundleTimeout to
the existing const block (currently containing other timeouts) and then change
the e2eutil.WithTimeout(5*time.Minute) call sites (e.g., the call that currently
uses e2eutil.WithInterval(nodePoolConfigUpdatePollInterval),
e2eutil.WithTimeout(5*time.Minute)) to use
e2eutil.WithTimeout(userCABundleTimeout) so both locations reference the new
constant.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1a09827c-9c36-47e4-8c0e-6753421c9155

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa20fc and e746a44.

📒 Files selected for processing (2)

test/e2e/nodepool_additionalTrustBundlePropagation_test.go
test/e2e/nodepool_test.go

codecov · 2026-05-28T08:37:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.43%. Comparing base (f13c62d) to head (3b08f70).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8617   +/-   ##
=======================================
  Coverage   41.43%   41.43%           
=======================================
  Files         756      756           
  Lines       93647    93647           
=======================================
  Hits        38802    38802           
  Misses      52124    52124           
  Partials     2721     2721

Flag	Coverage Δ
cmd-support	`34.87% <ø> (ø)`
cpo-hostedcontrolplane	`43.50% <ø> (ø)`
cpo-other	`42.74% <ø> (ø)`
hypershift-operator	`51.57% <ø> (ø)`
other	`31.64% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cwbotbot · 2026-05-28T08:50:05Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-28T14:22:48Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-06-02T09:10:32Z
View Job
View Job History

YamunadeviShanmugam · 2026-05-28T09:09:05Z

/pipeline required

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

Dockerfile.e2e (3)

1-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HEALTHCHECK is missing.

No HEALTHCHECK directive is defined. Container orchestrators cannot determine container health without it. As per coding guidelines, "HEALTHCHECK defined."

🏥 Proposed fix to add HEALTHCHECK

Add at the end of the Dockerfile:

     dnf install -y azure-cli-2.85.0-1.el9 && \
     dnf clean all
+
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD ["/hypershift/bin/hypershift", "version"] || exit 1

Note: Adjust the health check command based on the most appropriate validation for your e2e container.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` around lines 1 - 35, The Dockerfile is missing a HEALTHCHECK
directive; add a HEALTHCHECK at the end of the file that runs an appropriate
lightweight check (for example invoking /hypershift/hack/ci-test-e2e.sh or
/hypershift/bin/test-e2e with a quick status subcommand) using CMD-SHELL, and
include sensible options like --interval=30s --timeout=10s --retries=3; ensure
the referenced script/binary (ci-test-e2e.sh or test-e2e) is executable in the
image and that the health command exits with 0 on success so container
orchestrators can detect unhealthy containers.

1-35: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Container runs as root - critical security violation.

No USER directive is specified. The container will run as root, which violates the security guideline. E2E test containers should run as a non-root user. As per coding guidelines, "USER non-root; never run as root."

🔒 Proposed fix to add non-root user

Add before the final RUN instruction:

 COPY --from=builder /hypershift/hack/run-reqserving-e2e.sh /hypershift/hack/run-reqserving-e2e.sh

+RUN useradd -m -u 1001 -s /bin/bash hypershift && \
+    chown -R 1001:1001 /hypershift
+
+USER 1001
+
 RUN rpm --import https://packages.microsoft.com/keys/microsoft.asc && \

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` around lines 1 - 35, The image currently runs as root because
there's no USER directive; before the final RUN that installs packages (the RUN
starting with "rpm --import ...") add commands to create a non-root user and
group (e.g., create "hypershift" with a fixed UID/GID), chown the WORKDIR and
copied bins (refer to WORKDIR /hypershift and the copied paths like
/hypershift/bin and /hypershift/hack), and then set USER hypershift so the
remaining layers run unprivileged; ensure the created user has a valid shell and
ownership of /hypershift to allow the subsequent scripts (ci-test-e2e.sh,
run-reqserving-e2e.sh) and binaries (test-e2e, hypershift, etc.) to be
executable by that user.

10-12: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Build tools present in final image.

The final stage reuses the golang builder image, which includes the Go compiler and build toolchain. This violates the security guideline. Consider extracting only the necessary runtime dependencies or using a minimal base image for the final stage. As per coding guidelines, "Multi-stage builds; no build tools in final image."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` around lines 10 - 12, The final image currently reuses the
full Go builder image (the FROM line referencing
registry.ci.openshift.org/openshift/release:rhel-9-release-golang-1.25-openshift-4.23)
which leaves build tools in the runtime image; change the Dockerfile to use a
proper multi-stage build: keep a builder stage that compiles artifacts, then add
a distinct final stage FROM a minimal runtime base (e.g., ubi-minimal or
scratch/ubi) and COPY only the built binaries, configs and any runtime
dependencies into that final stage; if ci-test-e2e.sh truly requires the go tool
at runtime, either modify the script to avoid needing go or include only the
minimal Go runtime binary/artifacts copied from the builder into the final stage
rather than inheriting the entire builder image, and remove any leftover
compiler/toolchain files so no build tools remain in the final image.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Dockerfile.e2e`:
- Line 33: Replace the current version-only dnf install of azure-cli (the line
installing "azure-cli-2.85.0-1.el9") with a digest-pinned workflow: download the
exact Microsoft RPM artifact for azure-cli-2.85.0-1.el9, verify its SHA256 (and
signature if available), install the verified local RPM (instead of installing
from repos), then lock the exact NEVRA with the package manager (e.g., dnf
versionlock) so the exact build is fixed; remove the plain "dnf install
azure-cli-2.85.0-1.el9" step and add a final step to re-run your
vulnerability/support scanner against that specific RPM build to record its
advisories.

---

Outside diff comments:
In `@Dockerfile.e2e`:
- Around line 1-35: The Dockerfile is missing a HEALTHCHECK directive; add a
HEALTHCHECK at the end of the file that runs an appropriate lightweight check
(for example invoking /hypershift/hack/ci-test-e2e.sh or
/hypershift/bin/test-e2e with a quick status subcommand) using CMD-SHELL, and
include sensible options like --interval=30s --timeout=10s --retries=3; ensure
the referenced script/binary (ci-test-e2e.sh or test-e2e) is executable in the
image and that the health command exits with 0 on success so container
orchestrators can detect unhealthy containers.
- Around line 1-35: The image currently runs as root because there's no USER
directive; before the final RUN that installs packages (the RUN starting with
"rpm --import ...") add commands to create a non-root user and group (e.g.,
create "hypershift" with a fixed UID/GID), chown the WORKDIR and copied bins
(refer to WORKDIR /hypershift and the copied paths like /hypershift/bin and
/hypershift/hack), and then set USER hypershift so the remaining layers run
unprivileged; ensure the created user has a valid shell and ownership of
/hypershift to allow the subsequent scripts (ci-test-e2e.sh,
run-reqserving-e2e.sh) and binaries (test-e2e, hypershift, etc.) to be
executable by that user.
- Around line 10-12: The final image currently reuses the full Go builder image
(the FROM line referencing
registry.ci.openshift.org/openshift/release:rhel-9-release-golang-1.25-openshift-4.23)
which leaves build tools in the runtime image; change the Dockerfile to use a
proper multi-stage build: keep a builder stage that compiles artifacts, then add
a distinct final stage FROM a minimal runtime base (e.g., ubi-minimal or
scratch/ubi) and COPY only the built binaries, configs and any runtime
dependencies into that final stage; if ci-test-e2e.sh truly requires the go tool
at runtime, either modify the script to avoid needing go or include only the
minimal Go runtime binary/artifacts copied from the builder into the final stage
rather than inheriting the entire builder image, and remove any leftover
compiler/toolchain files so no build tools remain in the final image.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 22f8217c-50b2-43f3-a324-993080a88d86

📥 Commits

Reviewing files that changed from the base of the PR and between e746a44 and 091bbaf.

📒 Files selected for processing (1)

Dockerfile.e2e

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Dockerfile.e2e`:
- Around line 32-34: The final image's RUN that installs python3-pip (the shown
"RUN dnf install -y python3-pip && pip3 install --no-cache-dir azure-cli && dnf
clean all") must not leave build tools in runtime; move the pip installation and
"pip3 install --no-cache-dir azure-cli" into the builder stage, install
azure-cli there, then copy only the runtime artifacts (eg. /usr/local/bin/az and
the installed site-packages or relevant azure-cli files) into the final stage
and remove "python3-pip" and any pip invocation from the final stage's RUN so
the final image contains no build tools.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 415cdd0b-56f3-4f0e-95d4-eed965d53f02

📥 Commits

Reviewing files that changed from the base of the PR and between 091bbaf and 3ce0f5a.

📒 Files selected for processing (1)

Dockerfile.e2e

YamunadeviShanmugam · 2026-05-28T09:37:05Z

/pipeline required

openshift-merge-bot · 2026-05-28T09:37:09Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

Dockerfile.e2e (2)

1-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add USER directive to run as non-root.

The Dockerfile does not set a USER directive, so the container runs as root by default. This violates the container security guideline requiring non-root execution.

🔒 Proposed fix

 COPY --from=builder /hypershift/hack/ci-test-e2e.sh /hypershift/hack/ci-test-e2e.sh
 COPY --from=builder /hypershift/hack/run-reqserving-e2e.sh /hypershift/hack/run-reqserving-e2e.sh

+RUN useradd -r -u 1001 -g root e2euser && \
+    chown -R 1001:root /hypershift && \
+    chmod -R g+rwX /hypershift
+
 RUN rpm --import https://packages.microsoft.com/keys/microsoft.asc && \
     dnf install -y https://packages.microsoft.com/config/rhel/9/packages-microsoft-prod.rpm && \
     mv /etc/yum.repos.d/microsoft-prod.repo /etc/yum.repos.d/ci/ && \
     dnf install -y azure-cli --setopt=install_weak_deps=False && \
     dnf clean all

+USER 1001
+

As per coding guidelines, "USER non-root; never run as root".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` around lines 1 - 34, The Dockerfile runs as root; create a
non-root user and switch to it by adding steps in the final stage to create a
dedicated user/group (e.g., "hypershift" or UID 1000), chown the application
dirs and binaries under /hypershift (all files copied from builder, and
/hypershift/bin and /hypershift/hack), set a sensible HOME, and add a USER
directive (USER hypershift or USER 1000) before the image entrypoint; ensure any
install steps that require root remain in the builder or are done before
changing ownership so run-time does not require root.

1-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add HEALTHCHECK directive.

The Dockerfile does not define a HEALTHCHECK, which is required by the container security guidelines. While this is an e2e test image (ephemeral workload), the guideline mandates a health check for all containers.

🏥 Proposed fix

 USER 1001

+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+  CMD ["/hypershift/bin/hypershift", "version"] || exit 1
+

As per coding guidelines, "HEALTHCHECK defined".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` around lines 1 - 34, Add a HEALTHCHECK directive to the
Dockerfile (at the end of the final image stage) so the container reports
liveness; for example, add a line like HEALTHCHECK --interval=30s --timeout=5s
--start-period=10s --retries=3 CMD-SHELL "test -x /hypershift/bin/hypershift ||
exit 1" (or point it at an existing probe script such as
/hypershift/hack/ci-test-e2e.sh if you prefer); ensure the directive is placed
after the last RUN/COPY in the final stage so the image builds with the
healthcheck baked in.

♻️ Duplicate comments (1)

Dockerfile.e2e (1)

33-33: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Pin azure-cli by digest and version.

The azure-cli package is installed from a third-party (Microsoft) repository without digest or version pinning, violating the guideline that non-Red Hat images must be pinned by digest. This creates supply-chain and reproducibility risks.

As per coding guidelines, "non-RH images: pin by digest".

Run the following script to identify the exact azure-cli RPM and its digest/checksum:

#!/bin/bash
# Description: Query the Microsoft repo for available azure-cli versions and checksums

echo "=== Available azure-cli packages from Microsoft repo ==="
# List recent azure-cli RPMs
curl -s https://packages.microsoft.com/yumrepos/azure-cli/Packages/a/ | \
  grep -oP 'href="azure-cli-[0-9]+\.[0-9]+\.[0-9]+-[0-9]+\.el9[^"]*"' | \
  sed 's/href="//;s/"//' | sort -V | tail -10

echo ""
echo "=== Example: Fetch SHA256 for a specific RPM ==="
echo "# Download the RPM to compute its digest:"
echo "# curl -sO https://packages.microsoft.com/yumrepos/azure-cli/Packages/a/azure-cli-VERSION.el9.x86_64.rpm"
echo "# sha256sum azure-cli-VERSION.el9.x86_64.rpm"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.e2e` at line 33, The Dockerfile currently installs "azure-cli"
without pinning; replace the generic dnf install line that references azure-cli
with a deterministic installation: download the exact azure-cli RPM for a
specific version/release (azure-cli-VERSION.el9.x86_64.rpm) from the Microsoft
repo, verify its SHA256 digest matches the pinned checksum, and then install
that RPM (or install the exact package name including version-release) instead
of the floating package; update the Dockerfile.e2e azure-cli install step to use
the pinned version and checksum so the build is reproducible and supply-chain
safe.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Dockerfile.e2e`:
- Line 33: Remove the --nobest flag from the dnf install invocation to avoid
silent downgrades; locate the RUN line that contains "dnf install -y azure-cli
--nobest --setopt=install_weak_deps=False" and change it to install azure-cli
without --nobest (keeping -y and --setopt=install_weak_deps=False), so
dependency conflicts surface as build failures instead of allowing older package
versions.
- Line 32: The mv command in Dockerfile.e2e currently moves microsoft-prod.repo
to a mistyped directory `/etc/yum.repos.art/ci/`; update the destination to the
correct yum repo directory `/etc/yum.repos.d/` (and ensure the target exists by
creating it with mkdir -p if your Dockerfile sequence might require it) so the
mv in the line showing "mv /etc/yum.repos.d/microsoft-prod.repo
/etc/yum.repos.art/ci/ && \" correctly places the repo file where dnf/yum will
find it.

---

Outside diff comments:
In `@Dockerfile.e2e`:
- Around line 1-34: The Dockerfile runs as root; create a non-root user and
switch to it by adding steps in the final stage to create a dedicated user/group
(e.g., "hypershift" or UID 1000), chown the application dirs and binaries under
/hypershift (all files copied from builder, and /hypershift/bin and
/hypershift/hack), set a sensible HOME, and add a USER directive (USER
hypershift or USER 1000) before the image entrypoint; ensure any install steps
that require root remain in the builder or are done before changing ownership so
run-time does not require root.
- Around line 1-34: Add a HEALTHCHECK directive to the Dockerfile (at the end of
the final image stage) so the container reports liveness; for example, add a
line like HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3
CMD-SHELL "test -x /hypershift/bin/hypershift || exit 1" (or point it at an
existing probe script such as /hypershift/hack/ci-test-e2e.sh if you prefer);
ensure the directive is placed after the last RUN/COPY in the final stage so the
image builds with the healthcheck baked in.

---

Duplicate comments:
In `@Dockerfile.e2e`:
- Line 33: The Dockerfile currently installs "azure-cli" without pinning;
replace the generic dnf install line that references azure-cli with a
deterministic installation: download the exact azure-cli RPM for a specific
version/release (azure-cli-VERSION.el9.x86_64.rpm) from the Microsoft repo,
verify its SHA256 digest matches the pinned checksum, and then install that RPM
(or install the exact package name including version-release) instead of the
floating package; update the Dockerfile.e2e azure-cli install step to use the
pinned version and checksum so the build is reproducible and supply-chain safe.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b7ce2411-ba0e-4ee7-aaf9-b439d9d0a716

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce0f5a and 14cdf7f.

📒 Files selected for processing (1)

Dockerfile.e2e

YamunadeviShanmugam · 2026-05-28T14:22:33Z

/test e2e-aws

YamunadeviShanmugam · 2026-05-29T00:45:09Z

/retest

sdminonne

I like the idea of this PR (consolidating constants try reducing flakyness is always great) but I've some concerns.

Not sure why we need modifying validateNodePoolConditions which is impacting several tests
Moving from a 10 minutes timeout to 30 minutes looks too much to me, while reducing flakyness is great, failing fast has value too, and I would say we need to address both

YamunadeviShanmugam · 2026-06-01T09:34:52Z

I like the idea of this PR (consolidating constants try reducing flakyness is always great) but I've some concerns.

Not sure why we need modifying validateNodePoolConditions which is impacting several tests

Moving from a 10 minutes timeout to 30 minutes looks too much to me, while reducing flakyness is great, failing fast has value too, and I would say we need to address both

@sdminonne : Point 1 is addressed by reverting the changes, for Point 2 kept the timeout to 20 mins aligned with other tests

YamunadeviShanmugam · 2026-06-01T12:45:57Z

/retest

YamunadeviShanmugam · 2026-06-02T09:10:16Z

/test e2e-aks

sdminonne · 2026-06-02T16:53:47Z

Thanks, mind updating the PR description to reflect the code modification in this PR? Ideally can you also squash the commits?

YamunadeviShanmugam · 2026-06-02T17:11:56Z

Thanks, mind updating the PR description to reflect the code modification in this PR? Ideally can you also squash the commits?

PR description is updated and the commits are squashed, thanks.

jparrill

Dropped some comments. Thanks!

…I throttling Configures the CPO deployment check to 15s instead of 3s default

YamunadeviShanmugam · 2026-06-05T13:47:28Z

/test e2e-aws

YamunadeviShanmugam · 2026-06-05T16:44:38Z

/test e2e-aws

openshift-ci · 2026-06-05T18:48:46Z

@YamunadeviShanmugam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-v2-gke	`3ce0f5a`	link	true	`/test e2e-v2-gke`
ci/prow/e2e-aws-upgrade-hypershift-operator	`3ce0f5a`	link	true	`/test e2e-aws-upgrade-hypershift-operator`
ci/prow/e2e-v2-aws	`3ce0f5a`	link	true	`/test e2e-v2-aws`
ci/prow/e2e-aks-4-22	`3ce0f5a`	link	true	`/test e2e-aks-4-22`
ci/prow/e2e-azure-self-managed	`3ce0f5a`	link	true	`/test e2e-azure-self-managed`
ci/prow/e2e-aws-4-22	`3ce0f5a`	link	true	`/test e2e-aws-4-22`
ci/prow/e2e-aws	`3b08f70`	link	true	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-06T00:38:39Z

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2062938697933787136
Target: e2e-aws
Failed Step: e2e-aws-hypershift-aws-run-e2e-nested
PR: OCPBUGS-86662: Optimize CPO deployment polling interval in tests of additional trust bundle propagation #8617 (OCPBUGS-86662: Optimize CPO deployment polling interval in tests of additional trust bundle propagation)

Test Failure Analysis

Error

failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

Summary

The test TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout timed out after ~20 minutes waiting for the global-pull-secret-syncer DaemonSet to reach full readiness (2/3 pods ready, never reaching 3/3). This is a pre-existing flake unrelated to PR #8617 — the PR only modifies nodepool_additionalTrustBundlePropagation_test.go and nodepool_lifecycle_test.go, which both passed successfully. The EnsureGlobalPullSecret test that failed was not touched by this PR.

Root Cause

The hosted cluster create-cluster-xlbqb is deployed in HighlyAvailable mode across 3 AZs (us-east-1a, us-east-1b, us-east-1c) with 3 separate NodePools, each with 1 replica — giving 3 worker nodes total. The global-pull-secret-syncer DaemonSet schedules one pod per node, so DesiredNumberScheduled=3.

However, one of the 3 syncer pods persistently failed to reach Ready state. From the very first readiness check (build-log line 2422) to the final timeout (line 6614), the DaemonSet was stuck at 2/3 pods ready — the pattern repeated 366 times across a 20+ minute window without ever converging.

The waitForDaemonSetReady function (util.go:2266) has a 20-minute timeout and requires ready == desired (i.e., all 3 pods must be ready). Since one pod never became ready, the function hit its context deadline.

The secondary failure Check_if_the_config.json_is_correct_in_all_of_the_nodes (0.03s) is a cascading failure caused by the kubelet-config-verifier DaemonSet already existing from the prior subtest's cleanup failure, resulting in a 409 Conflict: "kubelet-config-verifier" already exists error.

The test uses *np.Spec.Replicas (=1) as minExpected for waitForDaemonSetReady, but the DaemonSet's DesiredNumberScheduled is 3 (one per node across all AZs). The minExpected sanity check passes (3 ≥ 1), but the readiness check waits for ready == desired (2 ≠ 3), which is the correct behavior — the issue is that one of the 3 pods genuinely failed to become ready, likely due to a transient node-level issue on one of the 3 worker nodes (e.g., kubelet not responding, container runtime issue, or resource pressure on the affected node).

Recommendations

Rerun the job — this is a pre-existing flake in TestCreateCluster/Main/EnsureGlobalPullSecret, not caused by the PR's changes. The PR's test (TestAdditionalTrustBundlePropagation) passed successfully.
File a separate bug for the EnsureGlobalPullSecret flake — the global-pull-secret-syncer DaemonSet pod on one node persistently fails to reach Ready state in multi-AZ HA clusters, causing a 20-minute timeout.
Consider improving the test to diagnose which node's pod is not ready and log pod conditions/events when the DaemonSet fails to converge, making future occurrences easier to diagnose.
Consider cleanup improvement — the kubelet-config-verifier DaemonSet should be cleaned up even when the preceding subtest fails, to prevent cascading AlreadyExists errors in subsequent subtests.

Evidence

Evidence	Detail
Primary failure	`TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout` (1205.07s)
Error message	`failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded`
DaemonSet status	Persistently stuck at 2/3 pods ready (366 occurrences across 20+ minutes)
Cluster topology	HA across 3 AZs: us-east-1a, us-east-1b, us-east-1c — 3 NodePools, 3 nodes, 3 desired DaemonSet pods
NodePool replicas	1 per NodePool (minExpected=1 passed, but DesiredNumberScheduled=3)
Secondary failure	`Check_if_the_config.json_is_correct_in_all_of_the_nodes` — cascading `409 AlreadyExists` for kubelet-config-verifier DaemonSet
PR's test result	`TestAdditionalTrustBundlePropagation` — PASSED (1290.12s)
PR relevance	PR modifies `nodepool_additionalTrustBundlePropagation_test.go` and `nodepool_lifecycle_test.go` — neither file is related to the failing `EnsureGlobalPullSecret` test
Total test results	623 tests, 30 skipped, 5 failures (all 5 from the same `TestCreateCluster/EnsureGlobalPullSecret` cascade)

openshift-ci Bot added the do-not-merge/needs-area label May 28, 2026

openshift-ci Bot requested review from sdminonne and sjenning May 28, 2026 08:28

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread Dockerfile.e2e Outdated

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread Dockerfile.e2e Outdated

This was referenced May 28, 2026

OCPBUGS-86661: Konnectivity retry proxy connection on timeout #8579

Open

CNTRLPLANE-3510: Enable additional golangci-lint linters #8567

Merged

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread Dockerfile.e2e

Comment thread Dockerfile.e2e Outdated

This was referenced May 28, 2026

NO-JIRA: feat(skills): add validate-pr-override-images skill #8616

Merged

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator #7749

Open

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from 14cdf7f to f312239 Compare May 28, 2026 23:57

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from b7e4283 to 52b9cf3 Compare May 29, 2026 08:16

openshift-ci Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 30, 2026

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from c32627d to 8c798f7 Compare June 1, 2026 01:54

sdminonne suggested changes Jun 1, 2026

View reviewed changes

openshift-ci Bot assigned sdminonne Jun 1, 2026

hypershift-jira-solve-ci Bot mentioned this pull request Jun 1, 2026

CNTRLPLANE-2207: Upgrade to CAPI 1.11 #8594

Merged

12 tasks

YamunadeviShanmugam requested a review from sdminonne June 1, 2026 09:36

hypershift-jira-solve-ci Bot mentioned this pull request Jun 1, 2026

OCPBUGS-54790: Move packageserver PDB from guest cluster to management cluster #8459

Open

4 tasks

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from b26858c to 22360d6 Compare June 2, 2026 17:10

hypershift-jira-solve-ci Bot mentioned this pull request Jun 2, 2026

CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup #8368

Merged

5 tasks

jparrill reviewed Jun 3, 2026

View reviewed changes

Comment thread test/e2e/nodepool_additionalTrustBundlePropagation_test.go

Comment thread test/e2e/nodepool_additionalTrustBundlePropagation_test.go

Comment thread test/e2e/nodepool_additionalTrustBundlePropagation_test.go Outdated

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from 22360d6 to 92a58d5 Compare June 3, 2026 13:43

YamunadeviShanmugam changed the title ~~OCPBUGS-86662: Fix inconsistent timeout and aggressive api polling in tests of additional trust bundle propagation~~ OCPBUGS-86662: Optimize CPO deployment polling interval in tests of additional trust bundle propagation Jun 3, 2026

YamunadeviShanmugam requested a review from jparrill June 4, 2026 08:49

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from d074723 to aa13ec2 Compare June 5, 2026 13:00

openshift-ci Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 5, 2026

fix(e2etests): optimize CPO deployment polling interval to prevent AP…

3b08f70

…I throttling Configures the CPO deployment check to 15s instead of 3s default

YamunadeviShanmugam force-pushed the fix-e2e-tests-inconsistent-timeout-agressive-api-polling branch from cfb0b0c to 3b08f70 Compare June 5, 2026 13:05

Conversation

YamunadeviShanmugam commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 28, 2026

Uh oh!

openshift-ci-robot commented May 28, 2026

Uh oh!

openshift-ci Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented May 28, 2026

Summary by CodeRabbit

Uh oh!

YamunadeviShanmugam commented May 28, 2026

Uh oh!

openshift-merge-bot Bot commented May 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cwbotbot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

YamunadeviShanmugam commented May 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YamunadeviShanmugam commented May 28, 2026

Uh oh!

openshift-merge-bot Bot commented May 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YamunadeviShanmugam commented May 28, 2026

Uh oh!

YamunadeviShanmugam commented May 29, 2026

Uh oh!

sdminonne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YamunadeviShanmugam commented Jun 1, 2026

Uh oh!

YamunadeviShanmugam commented Jun 1, 2026

Uh oh!

YamunadeviShanmugam commented Jun 2, 2026

Uh oh!

sdminonne commented Jun 2, 2026

Uh oh!

YamunadeviShanmugam commented Jun 2, 2026

Uh oh!

jparrill left a comment

Choose a reason for hiding this comment

YamunadeviShanmugam commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading

cwbotbot commented May 28, 2026 •

edited

Loading