CNTRLPLANE-3371: Fix AllowedCIDRs e2e test for Route-based KAS by bryan-cox · Pull Request #8469 · openshift/hypershift

bryan-cox · 2026-05-08T19:10:34Z

What

Fixes the ValidateKubeAPIServerAllowedCIDRs e2e test so it passes on v2 Azure self-managed clusters where KAS uses Route publishing strategy (via --external-dns-domain).

Why

The test was skipped in v2 CI (--ginkgo.skip="KAS allowed CIDRs") because it always failed. Both v1 and v2 Azure self-managed use Route strategy for KAS, but v1 passes while v2 fails due to a difference in cluster lifecycle timing combined with HTTP/2 connection reuse.

Root cause: HTTP/2 connection reuse

The test reuses a single kubeclient.Clientset across all ServerVersion() poll iterations. Go's HTTP/2 transport multiplexes all requests over a single persistent TCP connection. If the first poll succeeds before Azure NSG rules take effect, all subsequent polls reuse that connection and never observe the expected failure.

Why v1 passes but v2 fails: In v1, the cluster is created fresh inside TestCreateCluster, so the CPO is in its initial reconciliation burst — the router service's LoadBalancerSourceRanges and corresponding Azure NSG rules are updated before the first ServerVersion() call. In v2, the cluster is pre-created and shared across tests, so the CPO is in steady-state with longer reconciliation intervals. The first ServerVersion() call succeeds before the NSG rules catch up, and HTTP/2 holds that connection open for all subsequent polls.

Additional fix: missing downstream service wait

The test waits for AllowedCIDRBlocks to propagate from the HostedCluster to the HostedControlPlane, but does not wait for the CPO to reconcile the downstream LoadBalancer service's LoadBalancerSourceRanges. This is a race condition that exists in both v1 and v2 — v1 just happens to win the race due to CPO being in active reconciliation. Adding an explicit wait makes the test correct rather than relying on timing.

Changes

test/e2e/util/util.go — single file, three changes:

ensureAPIServerAllowedCIDRs signature: *kubeclient.Clientset → *rest.Config to enable fresh client creation per poll
Fresh kubeclient per poll: Each ServerVersion() iteration creates a new client via kubeclient.NewForConfig(rest.CopyConfig(guestConfig)), preventing HTTP/2 connection reuse.
Strategy-aware service wait: New allowedCIDRsTargetService() helper determines the correct LB service based on APIServer publishing strategy (Route → router, LoadBalancer → platform-specific KAS LB). An Eventually block waits for the service's LoadBalancerSourceRanges to match before checking KAS reachability.

Test Plan

go build -tags e2e ./test/e2e/... — compiles
go build -tags e2ev2 ./test/e2e/v2/... — compiles
go vet -tags e2e ./test/e2e/... — passes
Re-run v2 rehearsal on openshift/release#79048 after merge

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved API server CIDR restriction validation to ensure network rules are reconciled to the correct downstream service across publishing strategies (Route vs LoadBalancer), including platform-specific selection and skipping checks for non-applicable cases.
Tests
- Strengthened reachability tests by waiting for reconciliation, recreating client connections per attempt, and adding a new test to verify correct downstream service selection for various platforms and publishing strategies.

openshift-merge-bot · 2026-05-08T19:10:37Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-08T19:10:38Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-08T19:10:39Z

@bryan-cox: This pull request references CNTRLPLANE-3371 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What

Fixes the ValidateKubeAPIServerAllowedCIDRs e2e test so it passes on v2 Azure self-managed clusters where KAS uses Route publishing strategy (via --external-dns-domain).

Why

The test was skipped in v2 CI (--ginkgo.skip="KAS allowed CIDRs") because it always failed. Root cause: two issues compound to make the test pass on v1 but fail on v2.

1. Missing downstream service wait

The test waits for AllowedCIDRBlocks to propagate from the HostedCluster to the HostedControlPlane, but does not wait for the CPO to reconcile the downstream LoadBalancer service's LoadBalancerSourceRanges. With Route strategy, the relevant service is the router LB service (not a KAS LB). The CPO reconciliation adds a delay that the test doesn't account for.

2. HTTP/2 connection reuse

The test reuses a single kubeclient.Clientset across all ServerVersion() poll iterations. Go's HTTP/2 transport multiplexes all requests over a single persistent TCP connection. If the first poll succeeds before Azure NSG rules take effect, all subsequent polls reuse that connection and never observe the expected failure.

Changes

test/e2e/util/util.go — single file, three changes:

ensureAPIServerAllowedCIDRs signature: *kubeclient.Clientset → *rest.Config to enable fresh client creation per poll

Strategy-aware service wait: New allowedCIDRsTargetService() helper determines the correct LB service based on APIServer publishing strategy (Route → router, LoadBalancer → platform-specific KAS LB). An Eventually block waits for the service's LoadBalancerSourceRanges to match before checking KAS reachability.

Fresh kubeclient per poll: Each ServerVersion() iteration creates a new client via kubeclient.NewForConfig(rest.CopyConfig(guestConfig)), preventing HTTP/2 connection reuse.

Test Plan

go build -tags e2e ./test/e2e/... — compiles

go build -tags e2ev2 ./test/e2e/v2/... — compiles

go vet -tags e2e ./test/e2e/... — passes

Re-run v2 rehearsal on openshift/release#79048 after merge

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-08T19:10:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e731499a-64d1-4a88-8f08-48ac6dc5e339

📥 Commits

Reviewing files that changed from the base of the PR and between 6b609b0 and 43d818b.

📒 Files selected for processing (2)

test/e2e/util/util.go
test/e2e/util/util_test.go

📝 Walkthrough

Walkthrough

The test utility ValidateKubeAPIServerAllowedCIDRs now passes the guest cluster REST config into ensureAPIServerAllowedCIDRs. The helper waits for the control-plane to reconcile HostedCluster.Spec.Networking.APIServer.AllowedCIDRBlocks into the downstream Service.spec.LoadBalancerSourceRanges (target Service chosen by publishing strategy and cloud-specific rules). Once reconciled, reachability is polled by creating a fresh guest kubeclient per attempt (copying rest.Config with a custom Dial) and calling ServerVersion() to validate network restrictions.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Harness
    participant CP as Control-Plane Reconciler
    participant LB as Downstream Service/LoadBalancer
    participant GuestAPI as Guest kube-apiserver

    Test->>CP: Set AllowedCIDRBlocks on HostedCluster spec
    Note right of CP: Reconciler updates target Service based on publishing strategy/cloud
    CP->>LB: Update spec.LoadBalancerSourceRanges
    loop Poll for reconciliation
        Test->>LB: GET Service.spec.LoadBalancerSourceRanges
        alt ranges match expected
            Note right of Test: perform reachability checks
            loop Reachability attempts
                Test->>GuestAPI: Create fresh kubeclient (copied rest.Config + custom Dial) and call ServerVersion()
                GuestAPI-->>Test: respond (reachable / unreachable)
            end
        else not yet reconciled
            Test-->>Test: wait and retry
        end
    end

Suggested reviewers

clebs

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: fixing the AllowedCIDRs e2e test for Route-based KAS, which is the core purpose of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR adds only standard Go test functions using t.Run(), not Ginkgo tests. Custom check targets Ginkgo naming (It(), Describe(), etc.), so not applicable.
Test Structure And Quality	✅ Passed	Well-structured unit test with 9 focused scenarios covering the new allowedCIDRsTargetService helper. Matches codebase patterns, uses table-driven design, Gomega assertions.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests added. TestAllowedCIDRsTargetService is a standard Go unit test (func Test* with t.Run), not Ginkgo. Check applies only to Ginkgo tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The new test is a standard Go unit test, not Ginkgo e2e. The modified helper tests API server CIDR filtering and service configuration, which work on SNO and don't require multiple nodes.
Topology-Aware Scheduling Compatibility	✅ Passed	Changes are test utilities only (test/e2e/util/). No deployment manifests, operators, or scheduling constraints introduced.
Ote Binary Stdout Contract	✅ Passed	Test utility functions modified without violating OTE stdout contract. No process-level code with stdout writes, only test-level logging via t.Log/t.Logf which is intercepted by framework.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	TestAllowedCIDRsTargetService is a standard Go unit test, not a Ginkgo e2e test. The check applies to new Ginkgo patterns (It(), Describe(), Context(), When()), which are absent here.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-08T19:15:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.00%. Comparing base (b0a10c5) to head (43d818b).
⚠️ Report is 97 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8469      +/-   ##
==========================================
+ Coverage   37.49%   40.00%   +2.50%     
==========================================
  Files         751      751              
  Lines       91984    92838     +854     
==========================================
+ Hits        34487    37137    +2650     
+ Misses      54854    53014    -1840     
- Partials     2643     2687      +44

see 57 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.09% <ø> (+1.45%)`	⬆️
cpo-hostedcontrolplane	`40.56% <ø> (+3.79%)`	⬆️
cpo-other	`40.14% <ø> (+2.41%)`	⬆️
hypershift-operator	`50.53% <ø> (+2.59%)`	⬆️
other	`31.54% <ø> (+3.76%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bryan-cox · 2026-05-08T19:29:41Z

/pipeline required

bryan-cox · 2026-05-11T10:07:39Z

/pipeline required

openshift-merge-bot · 2026-05-11T10:07:42Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

cwbotbot · 2026-05-11T10:35:02Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-13T01:11:46Z
View Job
View Job History

e2e-aks

Status: ❌ FAIL
Started: 2026-05-14T01:15:17Z
View Job
View Job History

Failed Tests

Total failed tests: 3

TestCreateCluster
TestCreateCluster/Main
TestCreateCluster/Main/EnsureAzureWorkloadIdentityWebhookMutation

bryan-cox · 2026-05-11T10:35:37Z

/retest

bryan-cox · 2026-05-11T11:08:44Z

/test e2e-aws

bryan-cox · 2026-05-11T11:46:49Z

/test e2e-aks-4-22

hypershift-jira-solve-ci · 2026-05-11T12:37:50Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2053786112610013184 | Cost: $4.889827649999997 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-11T12:58:59Z

/test e2e-aks

bryan-cox · 2026-05-11T13:20:49Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-05-11T15:37:17Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2053827692964352000 | Cost: $4.6627350000000005 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-11T15:58:07Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-05-11T16:21:43Z

I have all the evidence needed. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-security
Build ID: 2053863428547678208
Target: security
Job Type: presubmit
Cluster: build01
PR: CNTRLPLANE-3371: Fix AllowedCIDRs e2e test for Route-based KAS #8469 (CNTRLPLANE-3371: Fix AllowedCIDRs e2e test for Route-based KAS)
State: error
Start Time: 2026-05-11T15:42:54Z
Completion Time: 2026-05-11T16:12:54Z
Duration: 30 minutes

Test Failure Analysis

Error

Pod scheduling timeout. 0/56 nodes are available: 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had untolerated taint {node-role.kubernetes.io/ci-builds-tmpfs-worker: ci-builds-tmpfs-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 16 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 24 node(s) had untolerated taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) had untolerated taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}.

Summary

This is a CI infrastructure failure, not a test or code failure. The ci-operator pod for the security job was never scheduled on the build01 cluster because no suitable node was available among all 56 nodes for the entire 30-minute scheduling timeout window. The pod remained in Pending state until Prow terminated it with a "Pod scheduling timeout" error. No test code was executed — the PR changes are not implicated in this failure.

Root Cause

The CI pod could not be scheduled on the build01 cluster due to resource exhaustion and node constraints. The Kubernetes scheduler evaluated all 56 nodes and found none suitable:

24 nodes had untolerated taint ci-tests-worker (reserved for test workloads, not ci-operator build pods)
16 nodes didn't match the pod's node affinity/selector (the pod has multiarch.openshift.io preferred node affinity for amd64)
4 nodes had untolerated taint ci-builds-worker (reserved for a different build workload class)
3 nodes had untolerated taint master (control plane nodes)
3 nodes had untolerated taint infra (infrastructure nodes)
2 nodes had insufficient memory (eligible nodes but out of resources)
1 node had untolerated taint ci-builds-tmpfs-worker
1 node had untolerated taint ci-longtests-worker
1 node had untolerated taint not-ready (unhealthy node)
1 node failed pod anti-affinity rules

The 2 nodes that were actually eligible for this pod type did not have enough memory to schedule it. Preemption was also not possible — the scheduler found no viable preemption victims on the memory-constrained nodes. The pod waited for 30 minutes (the default Prow scheduling timeout) before being terminated.

This is a transient cluster capacity issue on build01, completely unrelated to the PR changes.

Recommendations

Retest the PR — Run /test security on the PR to trigger a new attempt. This is a transient infrastructure issue and is very likely to succeed on retry.
No code changes needed — The PR (CNTRLPLANE-3371) was not involved in this failure. No test code was executed.
If retests continue to fail with the same error, the build01 cluster may be under sustained capacity pressure. In that case, escalate to the CI infrastructure team (Test Platform / DPTP) to investigate node capacity on build01.

Evidence

Evidence	Detail
Failure type	CI infrastructure — pod scheduling timeout
Job state	`error` (not `failure` — indicates infra issue, not test failure)
Pod phase	`Failed` — pod never reached `Running`
PodScheduled condition	`False` / `Unschedulable`
Cluster	`build01` (56 nodes evaluated, 0 schedulable)
Eligible nodes	2 nodes matched selectors/tolerations but had insufficient memory
Preemption attempted	Yes — no viable victims found
Scheduling wait	30 minutes (15:42:54Z → 16:12:54Z)
Build log	Not present — no build log artifact was generated (pod never started)
Test execution	None — ci-operator never ran; no test code was evaluated
Container statuses	Empty — no containers were ever created

bryan-cox · 2026-05-11T16:22:38Z

/test security

bryan-cox · 2026-05-11T17:04:16Z

/test e2e-aws

bryan-cox · 2026-05-12T00:33:19Z

/auto-cc

cblecker

The root cause analysis here is solid — the HTTP/2 connection reuse explanation is clear and the fix (fresh client per poll + waiting for the downstream service to reflect the updated source ranges before testing reachability) is the right approach rather than just a timing band-aid. The allowedCIDRsTargetService helper is a nice encapsulation of the CPO service selection logic.

A few comments inline, the most notable being a potential issue with the ARO HCP guard in the Route case.

cblecker · 2026-05-12T22:28:37Z

+	}
+	switch strategy.Type {
+	case hyperv1.Route:
+		if azureutil.IsAroHCP() && !netutil.IsPrivateHC(hc) {


The compound condition here doesn't quite match how the CPO makes this decision. The CPO uses IsAroHCP() as a standalone check when handling router services (infra.go:459 — deletes RouterPublicService for ARO HCP unconditionally). ARO HCP never has a public router LB service with LoadBalancerSourceRanges, since Swift handles connectivity.

The issue is with PublicAndPrivate topology: IsPublicHC returns true (so the top guard passes), but IsPrivateHC also returns true (via the topology check), making !IsPrivateHC false. The guard doesn't fire and we return RouterPublicService — a service the CPO actively deletes for ARO HCP. The downstream Eventually would then time out waiting for LoadBalancerSourceRanges on a service that doesn't exist.

Simplifying to just azureutil.IsAroHCP() matches the CPO's logic:

if azureutil.IsAroHCP() { return nil }

Done. Simplified to just azureutil.IsAroHCP() — good catch on the PublicAndPrivate topology case where both IsPublicHC and IsPrivateHC return true.

AI-assisted response via Claude Code

cblecker · 2026-05-12T22:28:40Z

+// allowedCIDRsTargetService returns the LoadBalancer service that enforces AllowedCIDRBlocks
+// based on the HostedCluster's APIServer publishing strategy. Returns nil when no LB service
+// carries source ranges (private clusters, NodePort, ARO HCP).
+// Mirrors service selection in CPO: infra.go:reconcileAPIServerService, kas/service.go:ReconcileService.


This citation is incomplete — it only covers the LoadBalancer path. For the Route case (which is the main path this PR is fixing), the relevant CPO code is infra.go:reconcileHCPRouterServices → ingress/router.go:ReconcileRouterService. The cited kas/service.go:ReconcileService sets LoadBalancerSourceRanges only in the LoadBalancer case, not for Route.

These file-level references are also going to get stale as the CPO migrates to the v2 component framework. Something like this would age better:

// Mirrors CPO's API server and router service reconciliation logic.

Done. Simplified to a generic reference that won't go stale with the v2 component migration.

AI-assisted response via Claude Code

bryan-cox · 2026-05-12T23:58:48Z

/retest

bryan-cox · 2026-05-13T01:08:38Z

/retest

bryan-cox · 2026-05-13T01:11:20Z

/test e2e-aws

bryan-cox · 2026-05-13T16:17:13Z

/retest

cblecker

Second round overall looks good — all the previous feedback was addressed. A few new things came up on closer look, one of which is a real bug in the HTTP/2 fix.

cblecker · 2026-05-13T20:48:48Z

+	// subsequent requests reuse that connection and bypass the restriction.
 	g.Eventually(func(g Gomega) {
-		_, err = guestClient.ServerVersion()
+		freshClient, err := kubeclient.NewForConfig(rest.CopyConfig(guestConfig))


Unfortunately rest.CopyConfig doesn't actually give you a new HTTP transport here. I traced through the client-go source: kubeclient.NewForConfig → rest.HTTPClientFor → transport.New(), which calls tlsCache.get() when config.Transport is nil. The cache key (tlsCacheKey in transport/cache.go) is built from the TLS data values — string(c.TLS.CAData), string(c.TLS.CertData), string(c.TLS.KeyData), etc. — not pointers. CopyConfig copies the same byte content, and since Dial is nil, TransportConfig() leaves DialHolder nil too. Both the original config and the copy produce identical cache keys, so the cache returns the same *http.Transport instance with its existing HTTP/2 connection pool.

The simplest fix is to set Dial on the copied config before creating the client. TransportConfig() wraps a non-nil Dial in a new &DialHolder{} each time, making the pointer unique and busting the cache:

cfg := rest.CopyConfig(guestConfig) cfg.Dial = (&net.Dialer{Timeout: 30 * time.Second, KeepAlive: 30 * time.Second}).DialContext freshClient, err := kubeclient.NewForConfig(cfg)

Done. Set cfg.Dial to create a unique *transport.DialHolder pointer per iteration, busting the TLS transport cache.

AI-assisted response via Claude Code

cblecker · 2026-05-13T20:48:55Z

+
+	// Create a fresh kubeclient per poll to avoid HTTP/2 connection reuse. Go's HTTP/2
+	// transport multiplexes requests over a single persistent TCP connection. If a prior
+	// successful request established a connection before NSG rules took effect, all


nit: "NSG rules" is Azure-specific — this function runs on AWS (security groups) and GCP (firewall rules) too. Something like "network restrictions" or "load balancer source-range enforcement" would be accurate across platforms.

Done. Replaced "NSG rules" with "load balancer source-range restrictions".

AI-assisted response via Claude Code

cblecker · 2026-05-13T20:49:07Z

+				},
+			},
+			wantNil: true,
+		},


These two test cases (NodePort and no-strategy) don't actually reach the branch they claim to test. Both create an AWSPlatform HostedCluster without setting Platform.AWS, so IsPublicHC evaluates ptr.Deref(nil, AWSPlatformSpec{}).EndpointAccess == "" — which matches neither Public nor PublicAndPrivate — and returns false. The function exits at the !IsPublicHC(hc) guard before the strategy switch is ever reached.

Using the publicHC helper fixes this (it correctly sets EndpointAccess: hyperv1.Public for AWS):

{ name: "When NodePort strategy it should return nil", hc: publicHC(hyperv1.AWSPlatform, hyperv1.NodePort), wantNil: true, }, { name: "When no APIServer strategy it should return nil", hc: func() *hyperv1.HostedCluster { hc := publicHC(hyperv1.AWSPlatform, hyperv1.Route) hc.Spec.Services = nil return hc }(), wantNil: true, },

The tests still return wantNil: true either way, but a bug in the default switch case or the nil-strategy guard wouldn't be caught as-is.

Done. NodePort now uses publicHC(hyperv1.AWSPlatform, hyperv1.NodePort) so it passes IsPublicHC and exercises the switch default. No-strategy case uses publicHC with hc.Spec.Services = nil so it reaches the strategy == nil guard.

AI-assisted response via Claude Code

cblecker · 2026-05-13T20:49:13Z

+
+	for _, tc := range tests {
+		t.Run(tc.name, func(t *testing.T) {
+			g := NewWithT(t)


nit: azureutil.SetAsAroHCPTest(t) already does exactly this — might as well use the helper for consistency with how the rest of the codebase sets up ARO HCP test environments.

Done. Switched to azureutil.SetAsAroHCPTest(t).

AI-assisted response via Claude Code

The ValidateKubeAPIServerAllowedCIDRs test fails on v2 Azure self-managed clusters because KAS uses Route publishing strategy (via external-dns-domain), not LoadBalancer. Two fixes: 1. Wait for the downstream LB service (router or KAS LB) to have its LoadBalancerSourceRanges updated by the CPO before asserting KAS reachability. The target service is determined by the HC's APIServer publishing strategy. 2. Create a fresh kubeclient per poll iteration to prevent HTTP/2 connection reuse. Go's HTTP/2 transport multiplexes all requests over a single persistent TCP connection — if a prior request succeeded before Azure NSG rules took effect, subsequent requests bypass the restriction on the same connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cblecker

All round-2 feedback addressed correctly — the cfg.Dial fix properly busts the TLS transport cache, the test branch fixes reach the right code paths, and the nits are cleaned up.

/lgtm

openshift-merge-bot · 2026-05-13T22:48:05Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

openshift-ci · 2026-05-13T22:48:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, cblecker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox,cblecker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cblecker · 2026-05-13T22:49:38Z

@coderabbitai resume

coderabbitai · 2026-05-13T22:49:46Z

✅ Actions performed

Reviews resumed.

hypershift-jira-solve-ci · 2026-05-14T00:43:33Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2054695287196225536 | Cost: $3.345031900000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-14T00:46:09Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2054695299795914752 | Cost: $3.15317325 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-14T01:15:04Z

/retest

hypershift-jira-solve-ci · 2026-05-14T02:51:41Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2054732206756073472 | Cost: $4.0600609500000004 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-14T02:54:09Z

/retest

hypershift-jira-solve-ci · 2026-05-14T04:04:02Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2054732206865125376 | Cost: $2.7961020000000008 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci · 2026-05-14T05:02:15Z

@bryan-cox: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-self-managed	`43d818b`	link	true	`/test e2e-azure-self-managed`
ci/prow/e2e-aks	`43d818b`	link	true	`/test e2e-aks`
ci/prow/e2e-aks-4-22	`43d818b`	link	true	`/test e2e-aks-4-22`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

enxebre · 2026-05-14T08:23:25Z

cc @muraee

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 8, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 8, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 8, 2026

bryan-cox force-pushed the CNTRLPLANE-3371 branch from d312b79 to 51d7116 Compare May 8, 2026 19:26

bryan-cox marked this pull request as ready for review May 8, 2026 19:29

bryan-cox force-pushed the CNTRLPLANE-3371 branch from 51d7116 to 6b609b0 Compare May 11, 2026 10:07

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026

bryan-cox force-pushed the CNTRLPLANE-3371 branch from 6b609b0 to 29672c9 Compare May 11, 2026 12:55

openshift-ci Bot requested review from cblecker and enxebre May 12, 2026 00:33

cblecker reviewed May 12, 2026

View reviewed changes

bryan-cox force-pushed the CNTRLPLANE-3371 branch 2 times, most recently from f0d3966 to 87a1a7e Compare May 12, 2026 23:03

bryan-cox force-pushed the CNTRLPLANE-3371 branch from 87a1a7e to b0ea145 Compare May 13, 2026 00:29

cblecker suggested changes May 13, 2026

View reviewed changes

openshift-ci Bot assigned cblecker May 13, 2026

bryan-cox force-pushed the CNTRLPLANE-3371 branch from b0ea145 to 43d818b Compare May 13, 2026 22:43

cblecker approved these changes May 13, 2026

View reviewed changes

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2026

Conversation

bryan-cox commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Root cause: HTTP/2 connection reuse

Additional fix: missing downstream service wait

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 8, 2026

Uh oh!

openshift-ci Bot commented May 8, 2026

Uh oh!

openshift-ci-robot commented May 8, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

1. Missing downstream service wait

2. HTTP/2 connection reuse

Changes

Test Plan

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bryan-cox commented May 8, 2026

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

openshift-merge-bot Bot commented May 11, 2026

Uh oh!

cwbotbot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 11, 2026

AI Test Failure Analysis

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 11, 2026

AI Test Failure Analysis

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

bryan-cox commented May 11, 2026

Uh oh!

bryan-cox commented May 12, 2026

Uh oh!

cblecker left a comment

bryan-cox commented May 8, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 8, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading

cwbotbot commented May 11, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented May 11, 2026 •

edited by openshift-ci Bot

Loading

cblecker left a comment •

edited

Loading