CNTRLPLANE-3430: Make HA break-glass credentials test infraless by sdminonne · Pull Request #8546 · openshift/hypershift

sdminonne · 2026-05-19T12:57:53Z

Summary

Make TestCreateClusterHABreakGlassCredentials work on non-OpenShift management clusters (e.g. AKS) by setting explicit hostnames on Route-type services in BeforeApply, so the CPO can resolve infrastructure readiness without a default ingress domain.
Add early KAS pod scheduling failure detection (10-minute timeout) to fail fast under resource pressure instead of waiting for the full WaitForGuestClient timeout.
Skip the test on release versions < 4.23 via e2eutil.AtLeast(t, e2eutil.Version423) because older CPO versions lack the LoadBalancer.Hostname bypass fix (merged in PR OCPBUGS-87028: Fix KAS ReconcileServiceStatus to check LoadBalancer.Hostname before LB provisioning #8663).
Poll SSR in validateCertificateAuth instead of failing immediately on Unauthorized, allowing time for the control-plane-pki-operator to reconcile the break-glass signer CA into the client-ca trust bundle.

Depends on: #8663

Test plan

e2e-aks-4-22: test should be skipped (not failed) due to AtLeast(Version423)
e2e-aks (4.23+): test should pass as before
e2e-aws: test should pass (cloud LB provisions external endpoints)

🤖 Generated with Claude Code

openshift-ci-robot · 2026-05-19T12:57:58Z

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Single-line change. The NonePlatform pattern is already proven by TestHAEtcdChaos
(test/e2e/chaos_test.go:32-66) which runs an HA control plane on NonePlatform with
NodePoolReplicas=0 and successfully calls WaitForGuestClient.

The framework handles NonePlatform correctly for this test:

before(): skips ValidatePublicCluster — OK, the test has explicit waits

after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)

after(): skips ValidateHostedClusterConditions — OK for NonePlatform

RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-19T12:57:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-merge-bot · 2026-05-19T12:58:05Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

sdminonne · 2026-05-19T12:59:25Z

/test e2e-aks

coderabbitai · 2026-05-19T13:00:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
  participant Route as routev1.Route
  participant RouteHost as netutil.RouteHost
  participant ServiceStatus as ServiceStatusUpdater
  Reconciler->>Route: read route resource
  Reconciler->>RouteHost: call RouteHost(route)
  RouteHost-->>Reconciler: host (or "")
  alt host != ""
    Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
  else host == ""
    Reconciler-->>ServiceStatus: set error message and return
  end

Suggested reviewers

muraee
Nirshal

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Microshift Test Compatibility	⚠️ Warning	Test uses HighlyAvailable control plane (3 KAS replicas) which violates MicroShift's single-node architecture. No protective [Skipped:MicroShift] or [apigroup:...] labels present.	Add [Skipped:MicroShift] label to test name or guard with exutil.IsMicroShiftCluster() check, since HighlyAvailable control planes unsupported on single-node MicroShift.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: converting the HA break-glass credentials test to use NonePlatform instead of cloud platform.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR are stable and deterministic. TestRouteHost (7 cases) and new Route cases in TestReconcileServiceStatus (2 cases) use static, descriptive strings with no dynamic content.
Test Structure And Quality	✅ Passed	Test meets quality standards: single responsibility, proper cleanup, explicit 20-min timeout, meaningful assertion messages, consistent patterns.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. The PR modifies an existing test (TestCreateClusterHABreakGlassCredentials) and adds only a unit test (TestRouteHost), which is outside the scope of this check.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces no scheduling constraints. Changes are for route hostname resolution, infrastructure logging, and e2e test platform configuration only.
Ote Binary Stdout Contract	✅ Passed	PR uses logr.Logger for all logging (framework-controlled). No fmt.Print/klog/stdout writes in process-level code. ValidateHostedClusterConditions is called within test bodies.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. New unit tests use domain names, not IPv4 hardcoding, and have no external connectivity requirements.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-19T13:07:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.44%. Comparing base (6a6d8c1) to head (c08ad8d).
⚠️ Report is 142 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8546      +/-   ##
==========================================
+ Coverage   40.68%   41.44%   +0.75%     
==========================================
  Files         755      756       +1     
  Lines       93368    93644     +276     
==========================================
+ Hits        37985    38807     +822     
+ Misses      52649    52114     -535     
+ Partials     2734     2723      -11

see 56 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.87% <ø> (+0.17%)`	⬆️
cpo-hostedcontrolplane	`43.50% <ø> (+1.69%)`	⬆️
cpo-other	`42.79% <ø> (+1.40%)`	⬆️
hypershift-operator	`51.56% <ø> (+0.75%)`	⬆️
other	`31.64% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sdminonne · 2026-05-19T13:21:42Z

@coderabbitai full review

coderabbitai · 2026-05-19T13:21:49Z

✅ Actions performed

Full review triggered.

hypershift-jira-solve-ci · 2026-05-19T15:13:58Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056721417461829632 | Cost: $4.631663149999999 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cwbotbot · 2026-05-19T16:00:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-27T10:19:46Z
View Job
View Job History

e2e-aks

Status: ❌ FAIL
Started: 2026-05-27T10:19:53Z
View Job
View Job History

Failed Tests

Total failed tests: 15

TestNodePool
TestNodePool/HostedCluster0
TestNodePool/HostedCluster0/Main
TestNodePool/HostedCluster0/Main/TestMirrorConfigs
TestNodePool/HostedCluster0/Main/TestNTOMachineConfigAppliedInPlace

... and 10 more failed tests

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.

In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.

In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400

📥 Commits

Reviewing files that changed from the base of the PR and between 880b44f and 4a99314.

📒 Files selected for processing (5)

control-plane-operator/controllers/hostedcontrolplane/kas/service.go
control-plane-operator/controllers/hostedcontrolplane/oauth/service.go
control-plane-operator/controllers/hostedcontrolplane/oauth/service_test.go
support/netutil/route.go
support/netutil/route_test.go

coderabbitai · 2026-05-19T17:21:00Z

+func TestRouteHost(t *testing.T) {
+	testCases := []struct {
+		name         string
+		route        *routev1.Route
+		expectedHost string
+	}{
+		{
+			name:         "When route is nil, it should return empty string",
+			route:        nil,
+			expectedHost: "",
+		},
+		{
+			name: "When Spec.Host is set, it should return Spec.Host",
+			route: &routev1.Route{
+				Spec: routev1.RouteSpec{
+					Host: "oauth.example.com",
+				},
+			},
+			expectedHost: "oauth.example.com",
+		},
+		{
+			name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host",
+			route: &routev1.Route{
+				Spec: routev1.RouteSpec{
+					Host: "oauth.example.com",
+				},
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							Host:                    "ingress.example.com",
+							RouterCanonicalHostname: "canonical.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "oauth.example.com",
+		},
+		{
+			name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							Host: "ingress.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "ingress.example.com",
+		},
+		{
+			name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							RouterCanonicalHostname: "router-canonical.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "router-canonical.example.com",
+		},
+		{
+			name: "When all host fields are empty, it should return empty string",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{},
+					},
+				},
+			},
+			expectedHost: "",
+		},
+		{
+			name:         "When Status.Ingress slice is empty, it should return empty string",
+			route:        &routev1.Route{},
+			expectedHost: "",
+		},
+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {
+			g := NewWithT(t)
+			g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost))
+		})
+	}
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a multi-ingress test case to lock expected fallback behavior.

TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in TestRouteHost that asserts RouteHost prefers a non-empty host from a later ingress when the first ingress is empty: create a routev1.Route with Status.Ingress containing two entries where Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route) to return that second ingress host; ensure the new case references RouteHost and the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback behavior.

This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.

@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.

✏️ Learnings added

Learnt from: sdminonne Repo: openshift/hypershift PR: 8546 File: support/netutil/route_test.go:472-559 Timestamp: 2026-05-19T18:45:39.602Z Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-05-19T17:21:00Z

+	if len(route.Status.Ingress) > 0 {
+		if route.Status.Ingress[0].Host != "" {
+			return route.Status.Ingress[0].Host
+		}
+		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
+			return route.Status.Ingress[0].RouterCanonicalHostname
+		}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle multi-ingress routes instead of only index 0.

Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.

🔧 Suggested fix

func RouteHost(route *routev1.Route) string { if route == nil { return "" } if route.Spec.Host != "" { return route.Spec.Host } - if len(route.Status.Ingress) > 0 { - if route.Status.Ingress[0].Host != "" { - return route.Status.Ingress[0].Host - } - if route.Status.Ingress[0].RouterCanonicalHostname != "" { - return route.Status.Ingress[0].RouterCanonicalHostname - } - } - return "" + var canonicalFallback string + for _, ingress := range route.Status.Ingress { + if ingress.Host != "" { + return ingress.Host + } + if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" { + canonicalFallback = ingress.RouterCanonicalHostname + } + } + return canonicalFallback }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(route.Status.Ingress) > 0 {

if route.Status.Ingress[0].Host != "" {

return route.Status.Ingress[0].Host

}

if route.Status.Ingress[0].RouterCanonicalHostname != "" {

return route.Status.Ingress[0].RouterCanonicalHostname

}

func RouteHost(route *routev1.Route) string {

if route == nil {

return ""

}

if route.Spec.Host != "" {

return route.Spec.Host

}

var canonicalFallback string

for _, ingress := range route.Status.Ingress {

if ingress.Host != "" {

return ingress.Host

}

if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {

canonicalFallback = ingress.RouterCanonicalHostname

}

}

return canonicalFallback

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/netutil/route.go` around lines 163 - 169, The current logic only inspects route.Status.Ingress[0] and returns its Host or RouterCanonicalHostname, which misses later ingress entries; update the code that references route.Status.Ingress[0] to iterate over all entries in route.Status.Ingress, checking each entry's Host first and then RouterCanonicalHostname, and return the first non-empty value found (falling back to empty/no-host if none exist); ensure you reference the same symbols (route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the existing index-0 checks.

In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.

Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.

@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.

✏️ Learnings added

Learnt from: sdminonne Repo: openshift/hypershift PR: 8546 File: support/netutil/route.go:163-169 Timestamp: 2026-05-19T18:45:30.452Z Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

sdminonne · 2026-05-19T18:46:19Z

/retest

sdminonne · 2026-05-19T18:46:26Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-19T21:13:37Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056808766044639232 | Cost: $6.988275500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T07:06:44Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T08:51:24Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056995254766145536 | Cost: $2.7953644999999994 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T11:01:32Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T13:56:18Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057054221647220736 | Cost: $3.1325262499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T15:32:37Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T17:15:17Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057122430584885248 | Cost: $3.4875462500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci-robot · 2026-05-29T12:11:26Z

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Makes TestCreateClusterHABreakGlassCredentials truly infraless by eliminating
the dependency on cloud LoadBalancer provisioning, and fixes a CPO code
inconsistency that prevented the LoadBalancer.Hostname bypass from working.

This test validates control-plane PKI operations (CSR creation/approval/signing,
certificate revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and
an HA control plane (3 KAS replicas). None of these operations require cloud
infrastructure.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Why e2e-aks-4-22 fails while e2e-aws-4-22 passes

Both jobs use the CPO from the 4.22 release payload — not the PR-built code.
The 4.22 CPO has the old ordering in kas.ReconcileServiceStatus that calls
CollectLBMessageIfNotProvisioned before checking LoadBalancer.Hostname,
so the hostname bypass has no effect on either platform.

The difference is LB provisioning speed by the management cluster's cloud
controller manager. On AWS, the NLB provisions in seconds —
CollectLBMessageIfNotProvisioned finds svc.Status.LoadBalancer.Ingress
populated quickly and the CPO proceeds. On AKS, Azure LB provisioning is
slower, and under resource pressure the CPO pod itself may not schedule
promptly — causing the test to timeout waiting for infrastructure readiness.

Since the CPO fix (hostname check before LB wait) won't be in the release
payload until 4.23, the test is skipped on release versions < 4.23 via
e2eutil.AtLeast(t, e2eutil.Version423).

How the fix makes the test truly infraless

The CPO blocks ALL component deployment (etcd, KAS, PKI, etc.) until
InfrastructureStatus.IsReady() returns true. IsReady() requires three
hostnames: APIHost, KonnectivityHost, and OAuthHost. On NonePlatform,
the KAS service uses LoadBalancer type, while Konnectivity and OAuth use
Route type.

The problem was in kas.ReconcileServiceStatus: it called
CollectLBMessageIfNotProvisioned before checking
strategy.LoadBalancer.Hostname. This meant that even when the test set an
explicit hostname, the function returned early with a "load balancer is not
provisioned" message — the hostname check was never reached. OAuth and
Konnectivity already had the correct ordering (hostname check first), so this
was a KAS-specific inconsistency.

The fix reorders the check: LoadBalancer.Hostname is evaluated first.
Combined with the test's BeforeApply that sets explicit hostnames on all
services (Route hostnames for OAuth/Konnectivity, LoadBalancer hostname for
KAS pointing to the internal ClusterIP DNS), all three resolve immediately
without waiting for any cloud infrastructure.

Changes

Commit 1: fix(hypershift-operator) — RouteHost helper

Add RouteHost() in support/netutil with a fallback chain (Spec.Host →
Status.Ingress[0].Host) for resolving route hostnames.
RouterCanonicalHostname is intentionally not used as a fallback: it is
the shared router LB address, not a route-specific hostname, so it cannot be
used for SNI routing, OAuth callback URLs, PKI certificate SANs, or
konnectivity agent connect addresses.

Commit 2: fix(control-plane-operator) — KAS LB hostname fix

Reorder kas.ReconcileServiceStatus to check LoadBalancer.Hostname before
CollectLBMessageIfNotProvisioned, aligning with the existing pattern in
OAuth and Konnectivity service status functions.

Use RouteHost() in oauth.ReconcileServiceStatus and
kas.ReconcileKonnectivityServerServiceStatus.

Guard the shared ingress path in reconcileAPIServerServiceStatus with
serviceStrategy.Type == Route to prevent empty apiHost when
UseSharedIngress()=true but strategy is LoadBalancer.

Add fmt.Stringer on InfrastructureStatus for structured logging when
infra is not ready.

Unit tests for ReconcileServiceStatus, ReconcileKonnectivityServerServiceStatus,
ReconcileOAuthServiceStatus, shared ingress guard, and NonePlatform infra
integration test with fixtures.

Commit 3: test(e2e) — infraless test + version skip

Set explicit hostnames on all Route-type and LoadBalancer-type services via
BeforeApply so the CPO resolves infrastructure readiness without depending
on cloud LB provisioning or a default ingress domain.

Add early KAS pod scheduling failure detection (5-minute timeout) to fail
fast under resource pressure.

Skip the test on release versions < 4.23 via e2eutil.AtLeast(t, e2eutil.Version423) because the CPO from those payloads does not contain
the LoadBalancer.Hostname bypass fix.

Poll SelfSubjectReview in validateCertificateAuth with a retry loop to
tolerate slow PKI reconciliation on NonePlatform clusters where KAS may be
serving before the break-glass signer CA is reconciled into the trust bundle.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sdminonne · 2026-05-29T12:15:08Z

/hold cancel

Nirshal · 2026-05-29T12:33:22Z

/lgtm

openshift-merge-bot · 2026-05-29T12:33:37Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

hypershift-jira-solve-ci · 2026-05-29T15:55:07Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2060339158013972480 | Cost: $1.9629789999999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci · 2026-06-02T09:17:45Z

New changes are detected. LGTM label has been removed.

sdminonne · 2026-06-02T13:13:19Z

/test e2e-aks

sdminonne · 2026-06-02T13:13:27Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-06-02T16:10:24Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2061798353355673600 | Cost: $2.8396047500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-06-03T07:22:04Z

/test e2e-aks

sdminonne · 2026-06-03T07:22:17Z

/pipeline-required

sdminonne · 2026-06-03T07:23:13Z

/test e2e-aws

sdminonne · 2026-06-03T07:23:37Z

/test e2e-aws-4-22

sdminonne · 2026-06-03T13:10:20Z

/test e2e-aws

sdminonne · 2026-06-03T13:11:03Z

/test e2e-aks-4-22

sdminonne · 2026-06-03T15:14:07Z

/test e2e-aws-upgrade-hypershift-operator

hypershift-jira-solve-ci · 2026-06-03T15:25:28Z

MANAGED_SERVICE=ARO-HCP — that means UseSharedIngress() returns true!

Now the bug is crystal clear. In the old code (line 530 of infra.go):

if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

Since UseSharedIngress() is true (ARO-HCP environment), this enters the shared ingress path for the KAS service even though the strategy is LoadBalancer, not Route. The KasRouteHostname() function returns an empty hostname because there's no KAS Route configured (KAS is using LoadBalancer), so APIHost stays empty, and IsReady() returns false forever.

The PR fixes this with:

if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {

This correctly checks the strategy type FIRST — only entering the shared ingress path if the strategy is Route. But this fix is in the CPO code which comes from the release image, not the PR build.

Now I have the complete root cause. Let me output the report:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aks-4-22
Build ID: 2062163112106135553
Target: e2e-aks-4-22
Test: TestCreateClusterHABreakGlassCredentials
PR: #8546 — CNTRLPLANE-3430: Make HA break-glass credentials test infraless

Test Failure Analysis

Error

TestCreateClusterHABreakGlassCredentials/Main: Failed to wait for at least one KAS pod
to be scheduled in 10m0s: context deadline exceeded
  - observed **v1.Pod collection invalid: no KAS pods found yet

HostedCluster conditions stuck at:
  KubeAPIServerAvailable=False: NotFound(Kube APIServer deployment not found)
  EtcdAvailable=False: StatefulSetNotFound
  InfrastructureReady=False: WaitingOnInfrastructureReady(Cluster infrastructure is still provisioning)

Summary

The test creates a platform: None (infraless) HostedCluster on an AKS management cluster running in an ARO-HCP environment (MANAGED_SERVICE=ARO-HCP). The control-plane-operator (CPO) gets stuck in an infinite loop reporting "Infrastructure is not yet ready" for 10+ minutes, never deploying etcd or kube-apiserver. The root cause is a pre-existing bug in the CPO's infra.go shared ingress condition (line 530) that the PR fixes — but the fix only applies to the HyperShift operator binary, not the CPO binary, which is sourced from the OCP release image and therefore runs the old (unfixed) code.

Root Cause

The failure is caused by a bug in the shared ingress condition order in the control-plane-operator's infra.go, combined with the fact that the PR's CPO fix is not deployed because the CPO runs from the OCP release image, not from the PR build.

The bug (old code at infra.go:530):

if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

On the AKS management cluster, MANAGED_SERVICE=ARO-HCP is set, so UseSharedIngress() returns true. This causes the function to enter the shared ingress path for the KAS service even though the strategy is LoadBalancer, not Route. Since there is no KAS Route (it uses LoadBalancer), KasRouteHostname() returns an empty string. This means APIHost stays empty, causing InfrastructureStatus.IsReady() to return false permanently — blocking all control plane component deployment (etcd, KAS, etc.).

The PR's fix (corrected condition):

if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

This correctly checks the strategy type FIRST, only entering shared ingress when the strategy is Route.

Why the fix doesn't take effect: The HyperShift CI builds the HyperShift operator and test binary from the PR, but the control-plane-operator (CPO) is deployed from the OCP release image specified in the HostedCluster's release.image field. The PR changes to kas/service.go, infra/infra.go, and support/netutil/route.go all live in the CPO binary — which is NOT built from the PR. Therefore, the old buggy code runs, and infrastructure readiness is permanently blocked.

Evidence from CPO logs confirming old code: The CPO logs show "Infrastructure is not yet ready" without the "infraStatus" field that the PR adds, confirming the unpatched CPO binary is running.

Recommendations

The infra.go shared ingress condition fix must be present in the OCP release image's CPO — it needs to be backported/merged into the OCP release payload that the test uses, OR the test must use a release image that already contains the fix. Since this is a chicken-and-egg problem (the PR fixes CPO code but the CPO comes from the release), consider:
Override the CPO image in the test — Set --control-plane-operator-image to use the CI-built image from the PR instead of the release image's CPO. This would ensure the PR's fixes to infra.go, kas/service.go, and netutil/route.go are actually active.
Alternatively, separate the CPO fix from the test change — Land the infra.go condition order fix as an independent PR first (it's a correctness bug affecting any ARO-HCP cluster using KAS LoadBalancer strategy). Once it's in the release payload, the test change in this PR will work.
Short-term workaround — Change the test to use APIServer Route strategy (instead of LoadBalancer) so the shared ingress condition doesn't misfire. This avoids the need for the CPO fix but changes the test semantics.

Evidence

Evidence	Detail
CPO log pattern	All 239 lines show repeating "Infrastructure is not yet ready" → "Waiting for infrastructure to be ready before proceeding" from 13:56:39 to 14:06:24 (10 min)
Missing `infraStatus` field	CPO logs lack the `"infraStatus"` diagnostic field added by the PR, confirming old CPO code runs
`MANAGED_SERVICE` env	CPO pod has `MANAGED_SERVICE=ARO-HCP`, making `UseSharedIngress()` return true
Buggy condition (infra.go:530)	`sharedingress.UseSharedIngress() \|\| (IBMCloud && Route)` — enters shared ingress path for ANY strategy when UseSharedIngress is true
KAS service strategy	HostedCluster spec: `service: APIServer, type: LoadBalancer` — no Route, so `KasRouteHostname()` returns empty
KAS LB provisioned	Service `kube-apiserver` has `status.loadBalancer.ingress[0].ip: 57.165.204.32` — LB is ready but never used
Routes have hostnames	konnectivity: `konnectivity.hypershift.azure.devcluster.openshift.com`, oauth: `oauthserver.hypershift.azure.devcluster.openshift.com` — both Admitted
HCP conditions	`KubeAPIServerAvailable=False: NotFound`, `EtcdAvailable=False: StatefulSetNotFound` — no CP components deployed
Platform type	`platform: type: None` (infraless) with `controllerAvailabilityPolicy: HighlyAvailable`
Test duration	`TestCreateClusterHABreakGlassCredentials/Main` ran for exactly 600.00s (10 min timeout)

… < 4.23 - Set explicit hostnames on Route-type services in BeforeApply so the CPO can resolve infrastructure readiness on non-OpenShift management clusters (e.g. AKS) without a default ingress domain. - Skip the test on release versions < 4.23 via AtLeast(Version423) because older CPO versions lack the LoadBalancer.Hostname bypass fix. - Add early KAS pod scheduling failure detection (10-minute timeout) to fail fast under resource pressure. - Poll SSR in validateCertificateAuth instead of failing immediately on Unauthorized, allowing time for the control-plane-pki-operator to reconcile the break-glass signer CA into the client-ca trust bundle. - Increase WaitForGuestClient timeout to 15 minutes to accommodate cloud LB provisioning time on Azure/AWS. Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-04T10:13:20Z

@sdminonne: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aks-4-22	`2f43cbe`	link	true	`/test e2e-aks-4-22`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sdminonne · 2026-06-04T12:47:37Z

/retest

sdminonne · 2026-06-04T14:25:57Z

/retest

sdminonne · 2026-06-04T20:58:13Z

/pipeline-required

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 19, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 19, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 97d69f9 to 1f3e9bc Compare May 21, 2026 07:37

hypershift-jira-solve-ci Bot mentioned this pull request May 21, 2026

OCPBUGS-85151: Re-enable serviceaccount-pull-secrets controller when registry managementState changes from Removed #8522

Merged

4 tasks

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 1f3e9bc to 5a6f1f4 Compare May 21, 2026 08:21

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 29, 2026

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 78d403d to 2f43cbe Compare June 2, 2026 09:17

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 2f43cbe to 618a7ba Compare June 4, 2026 08:50

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 618a7ba to c08ad8d Compare June 4, 2026 09:57

-	if len(route.Status.Ingress) > 0 {
-		if route.Status.Ingress[0].Host != "" {
-			return route.Status.Ingress[0].Host
-		}
-		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
-			return route.Status.Ingress[0].RouterCanonicalHostname
-		}
+func RouteHost(route *routev1.Route) string {
+	if route == nil {
+		return ""
+	}
+	if route.Spec.Host != "" {
+		return route.Spec.Host
+	}
+	var canonicalFallback string
+	for _, ingress := range route.Status.Ingress {
+		if ingress.Host != "" {
+			return ingress.Host
+		}
+		if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+			canonicalFallback = ingress.RouterCanonicalHostname
+		}
+	}
+	return canonicalFallback
+}

Conversation

sdminonne commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

openshift-ci-robot commented May 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented May 19, 2026

Uh oh!

openshift-merge-bot Bot commented May 19, 2026

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 19, 2026

AI Test Failure Analysis

Uh oh!

cwbotbot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdminonne May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdminonne May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 19, 2026

AI Test Failure Analysis

Uh oh!

sdminonne commented May 20, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 20, 2026

AI Test Failure Analysis

Uh oh!

sdminonne commented May 19, 2026 •

edited

Loading

openshift-ci-robot commented May 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

cwbotbot commented May 19, 2026 •

edited

Loading

coderabbitai Bot May 19, 2026 •

edited

Loading

coderabbitai Bot May 19, 2026 •

edited

Loading

openshift-ci-robot commented May 29, 2026 •

edited by openshift-ci Bot

Loading

Why `e2e-aks-4-22` fails while `e2e-aws-4-22` passes

Commit 1: `fix(hypershift-operator)` — RouteHost helper

Commit 2: `fix(control-plane-operator)` — KAS LB hostname fix

Commit 3: `test(e2e)` — infraless test + version skip

hypershift-jira-solve-ci Bot commented Jun 3, 2026 •

edited by openshift-ci Bot

Loading