Skip to content

CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546

Open
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3430/infraless-ha-break-glass-creds
Open

CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3430/infraless-ha-break-glass-creds

Conversation

@sdminonne
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne commented May 19, 2026

Summary

  • Make TestCreateClusterHABreakGlassCredentials work on non-OpenShift management clusters (e.g. AKS) by setting explicit hostnames on Route-type services in BeforeApply, so the CPO can resolve infrastructure readiness without a default ingress domain.
  • Add early KAS pod scheduling failure detection (10-minute timeout) to fail fast under resource pressure instead of waiting for the full WaitForGuestClient timeout.
  • Skip the test on release versions < 4.23 via e2eutil.AtLeast(t, e2eutil.Version423) because older CPO versions lack the LoadBalancer.Hostname bypass fix (merged in PR OCPBUGS-87028: Fix KAS ReconcileServiceStatus to check LoadBalancer.Hostname before LB provisioning #8663).
  • Poll SSR in validateCertificateAuth instead of failing immediately on Unauthorized, allowing time for the control-plane-pki-operator to reconcile the break-glass signer CA into the client-ca trust bundle.

Depends on: #8663

Test plan

  • e2e-aks-4-22: test should be skipped (not failed) due to AtLeast(Version423)
  • e2e-aks (4.23+): test should pass as before
  • e2e-aws: test should pass (cloud LB provisions external endpoints)

🤖 Generated with Claude Code

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 19, 2026

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Single-line change. The NonePlatform pattern is already proven by TestHAEtcdChaos
(test/e2e/chaos_test.go:32-66) which runs an HA control plane on NonePlatform with
NodePoolReplicas=0 and successfully calls WaitForGuestClient.

The framework handles NonePlatform correctly for this test:

  • before(): skips ValidatePublicCluster — OK, the test has explicit waits
  • after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)
  • after(): skips ValidateHostedClusterConditions — OK for NonePlatform
  • RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 19, 2026
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
  participant Route as routev1.Route
  participant RouteHost as netutil.RouteHost
  participant ServiceStatus as ServiceStatusUpdater
  Reconciler->>Route: read route resource
  Reconciler->>RouteHost: call RouteHost(route)
  RouteHost-->>Reconciler: host (or "")
  alt host != ""
    Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
  else host == ""
    Reconciler-->>ServiceStatus: set error message and return
  end
Loading

Suggested reviewers

  • muraee
  • Nirshal
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Microshift Test Compatibility ⚠️ Warning Test uses HighlyAvailable control plane (3 KAS replicas) which violates MicroShift's single-node architecture. No protective [Skipped:MicroShift] or [apigroup:...] labels present. Add [Skipped:MicroShift] label to test name or guard with exutil.IsMicroShiftCluster() check, since HighlyAvailable control planes unsupported on single-node MicroShift.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: converting the HA break-glass credentials test to use NonePlatform instead of cloud platform.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. TestRouteHost (7 cases) and new Route cases in TestReconcileServiceStatus (2 cases) use static, descriptive strings with no dynamic content.
Test Structure And Quality ✅ Passed Test meets quality standards: single responsibility, proper cleanup, explicit 20-min timeout, meaningful assertion messages, consistent patterns.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. The PR modifies an existing test (TestCreateClusterHABreakGlassCredentials) and adds only a unit test (TestRouteHost), which is outside the scope of this check.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are for route hostname resolution, infrastructure logging, and e2e test platform configuration only.
Ote Binary Stdout Contract ✅ Passed PR uses logr.Logger for all logging (framework-controlled). No fmt.Print/klog/stdout writes in process-level code. ValidateHostedClusterConditions is called within test bodies.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. New unit tests use domain names, not IPv4 hardcoding, and have no external connectivity requirements.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.44%. Comparing base (6a6d8c1) to head (c08ad8d).
⚠️ Report is 142 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8546      +/-   ##
==========================================
+ Coverage   40.68%   41.44%   +0.75%     
==========================================
  Files         755      756       +1     
  Lines       93368    93644     +276     
==========================================
+ Hits        37985    38807     +822     
+ Misses      52649    52114     -535     
+ Partials     2734     2723      -11     

see 56 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.87% <ø> (+0.17%) ⬆️
cpo-hostedcontrolplane 43.50% <ø> (+1.69%) ⬆️
cpo-other 42.79% <ø> (+1.40%) ⬆️
hypershift-operator 51.56% <ø> (+0.75%) ⬆️
other 31.64% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sdminonne
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Full review triggered.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056721417461829632 | Cost: $4.631663149999999 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 19, 2026

Test Results

e2e-aws

e2e-aks

Failed Tests

Total failed tests: 15

  • TestNodePool
  • TestNodePool/HostedCluster0
  • TestNodePool/HostedCluster0/Main
  • TestNodePool/HostedCluster0/Main/TestMirrorConfigs
  • TestNodePool/HostedCluster0/Main/TestNTOMachineConfigAppliedInPlace

... and 10 more failed tests

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.

In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.

In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400

📥 Commits

Reviewing files that changed from the base of the PR and between 880b44f and 4a99314.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/service.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/service_test.go
  • support/netutil/route.go
  • support/netutil/route_test.go

Comment thread control-plane-operator/controllers/hostedcontrolplane/oauth/service.go Outdated
Comment thread support/netutil/route_test.go Outdated
Comment on lines +472 to +559
func TestRouteHost(t *testing.T) {
testCases := []struct {
name string
route *routev1.Route
expectedHost string
}{
{
name: "When route is nil, it should return empty string",
route: nil,
expectedHost: "",
},
{
name: "When Spec.Host is set, it should return Spec.Host",
route: &routev1.Route{
Spec: routev1.RouteSpec{
Host: "oauth.example.com",
},
},
expectedHost: "oauth.example.com",
},
{
name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host",
route: &routev1.Route{
Spec: routev1.RouteSpec{
Host: "oauth.example.com",
},
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
Host: "ingress.example.com",
RouterCanonicalHostname: "canonical.example.com",
},
},
},
},
expectedHost: "oauth.example.com",
},
{
name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
Host: "ingress.example.com",
},
},
},
},
expectedHost: "ingress.example.com",
},
{
name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
RouterCanonicalHostname: "router-canonical.example.com",
},
},
},
},
expectedHost: "router-canonical.example.com",
},
{
name: "When all host fields are empty, it should return empty string",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{},
},
},
},
expectedHost: "",
},
{
name: "When Status.Ingress slice is empty, it should return empty string",
route: &routev1.Route{},
expectedHost: "",
},
}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
g := NewWithT(t)
g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost))
})
}
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a multi-ingress test case to lock expected fallback behavior.

TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in
TestRouteHost that asserts RouteHost prefers a non-empty host from a later
ingress when the first ingress is empty: create a routev1.Route with
Status.Ingress containing two entries where Ingress[0] is empty and
Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route)
to return that second ingress host; ensure the new case references RouteHost and
the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback
behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.


✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route_test.go:472-559
Timestamp: 2026-05-19T18:45:39.602Z
Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread support/netutil/route.go Outdated
Comment on lines +163 to +169
if len(route.Status.Ingress) > 0 {
if route.Status.Ingress[0].Host != "" {
return route.Status.Ingress[0].Host
}
if route.Status.Ingress[0].RouterCanonicalHostname != "" {
return route.Status.Ingress[0].RouterCanonicalHostname
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle multi-ingress routes instead of only index 0.

Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.

🔧 Suggested fix
 func RouteHost(route *routev1.Route) string {
 	if route == nil {
 		return ""
 	}
 	if route.Spec.Host != "" {
 		return route.Spec.Host
 	}
-	if len(route.Status.Ingress) > 0 {
-		if route.Status.Ingress[0].Host != "" {
-			return route.Status.Ingress[0].Host
-		}
-		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
-			return route.Status.Ingress[0].RouterCanonicalHostname
-		}
-	}
-	return ""
+	var canonicalFallback string
+	for _, ingress := range route.Status.Ingress {
+		if ingress.Host != "" {
+			return ingress.Host
+		}
+		if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+			canonicalFallback = ingress.RouterCanonicalHostname
+		}
+	}
+	return canonicalFallback
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(route.Status.Ingress) > 0 {
if route.Status.Ingress[0].Host != "" {
return route.Status.Ingress[0].Host
}
if route.Status.Ingress[0].RouterCanonicalHostname != "" {
return route.Status.Ingress[0].RouterCanonicalHostname
}
func RouteHost(route *routev1.Route) string {
if route == nil {
return ""
}
if route.Spec.Host != "" {
return route.Spec.Host
}
var canonicalFallback string
for _, ingress := range route.Status.Ingress {
if ingress.Host != "" {
return ingress.Host
}
if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
canonicalFallback = ingress.RouterCanonicalHostname
}
}
return canonicalFallback
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/netutil/route.go` around lines 163 - 169, The current logic only
inspects route.Status.Ingress[0] and returns its Host or
RouterCanonicalHostname, which misses later ingress entries; update the code
that references route.Status.Ingress[0] to iterate over all entries in
route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.

Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.


✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route.go:163-169
Timestamp: 2026-05-19T18:45:30.452Z
Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/retest

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056808766044639232 | Cost: $6.988275500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056995254766145536 | Cost: $2.7953644999999994 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057054221647220736 | Cost: $3.1325262499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057122430584885248 | Cost: $3.4875462500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 29, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 29, 2026

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Makes TestCreateClusterHABreakGlassCredentials truly infraless by eliminating
the dependency on cloud LoadBalancer provisioning, and fixes a CPO code
inconsistency that prevented the LoadBalancer.Hostname bypass from working.

This test validates control-plane PKI operations (CSR creation/approval/signing,
certificate revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and
an HA control plane (3 KAS replicas). None of these operations require cloud
infrastructure.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Why e2e-aks-4-22 fails while e2e-aws-4-22 passes

Both jobs use the CPO from the 4.22 release payload — not the PR-built code.
The 4.22 CPO has the old ordering in kas.ReconcileServiceStatus that calls
CollectLBMessageIfNotProvisioned before checking LoadBalancer.Hostname,
so the hostname bypass has no effect on either platform.

The difference is LB provisioning speed by the management cluster's cloud
controller manager. On AWS, the NLB provisions in seconds —
CollectLBMessageIfNotProvisioned finds svc.Status.LoadBalancer.Ingress
populated quickly and the CPO proceeds. On AKS, Azure LB provisioning is
slower, and under resource pressure the CPO pod itself may not schedule
promptly — causing the test to timeout waiting for infrastructure readiness.

Since the CPO fix (hostname check before LB wait) won't be in the release
payload until 4.23, the test is skipped on release versions < 4.23 via
e2eutil.AtLeast(t, e2eutil.Version423).

How the fix makes the test truly infraless

The CPO blocks ALL component deployment (etcd, KAS, PKI, etc.) until
InfrastructureStatus.IsReady() returns true. IsReady() requires three
hostnames: APIHost, KonnectivityHost, and OAuthHost. On NonePlatform,
the KAS service uses LoadBalancer type, while Konnectivity and OAuth use
Route type.

The problem was in kas.ReconcileServiceStatus: it called
CollectLBMessageIfNotProvisioned before checking
strategy.LoadBalancer.Hostname. This meant that even when the test set an
explicit hostname, the function returned early with a "load balancer is not
provisioned" message — the hostname check was never reached. OAuth and
Konnectivity already had the correct ordering (hostname check first), so this
was a KAS-specific inconsistency.

The fix reorders the check: LoadBalancer.Hostname is evaluated first.
Combined with the test's BeforeApply that sets explicit hostnames on all
services (Route hostnames for OAuth/Konnectivity, LoadBalancer hostname for
KAS pointing to the internal ClusterIP DNS), all three resolve immediately
without waiting for any cloud infrastructure.

Changes

Commit 1: fix(hypershift-operator) — RouteHost helper

  • Add RouteHost() in support/netutil with a fallback chain (Spec.Host
    Status.Ingress[0].Host) for resolving route hostnames.
    RouterCanonicalHostname is intentionally not used as a fallback: it is
    the shared router LB address, not a route-specific hostname, so it cannot be
    used for SNI routing, OAuth callback URLs, PKI certificate SANs, or
    konnectivity agent connect addresses.

Commit 2: fix(control-plane-operator) — KAS LB hostname fix

  • Reorder kas.ReconcileServiceStatus to check LoadBalancer.Hostname before
    CollectLBMessageIfNotProvisioned, aligning with the existing pattern in
    OAuth and Konnectivity service status functions.
  • Use RouteHost() in oauth.ReconcileServiceStatus and
    kas.ReconcileKonnectivityServerServiceStatus.
  • Guard the shared ingress path in reconcileAPIServerServiceStatus with
    serviceStrategy.Type == Route to prevent empty apiHost when
    UseSharedIngress()=true but strategy is LoadBalancer.
  • Add fmt.Stringer on InfrastructureStatus for structured logging when
    infra is not ready.
  • Unit tests for ReconcileServiceStatus, ReconcileKonnectivityServerServiceStatus,
    ReconcileOAuthServiceStatus, shared ingress guard, and NonePlatform infra
    integration test with fixtures.

Commit 3: test(e2e) — infraless test + version skip

  • Set explicit hostnames on all Route-type and LoadBalancer-type services via
    BeforeApply so the CPO resolves infrastructure readiness without depending
    on cloud LB provisioning or a default ingress domain.
  • Add early KAS pod scheduling failure detection (5-minute timeout) to fail
    fast under resource pressure.
  • Skip the test on release versions < 4.23 via e2eutil.AtLeast(t, e2eutil.Version423) because the CPO from those payloads does not contain
    the LoadBalancer.Hostname bypass fix.
  • Poll SelfSubjectReview in validateCertificateAuth with a retry loop to
    tolerate slow PKI reconciliation on NonePlatform clusters where KAS may be
    serving before the break-glass signer CA is reconciled into the trust bundle.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2026
@Nirshal
Copy link
Copy Markdown
Contributor

Nirshal commented May 29, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2060339158013972480 | Cost: $1.9629789999999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 78d403d to 2f43cbe Compare June 2, 2026 09:17
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

New changes are detected. LGTM label has been removed.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2061798353355673600 | Cost: $2.8396047500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/pipeline-required

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks-4-22

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade-hypershift-operator

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Jun 3, 2026

MANAGED_SERVICE=ARO-HCP — that means UseSharedIngress() returns true!

Now the bug is crystal clear. In the old code (line 530 of infra.go):

if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

Since UseSharedIngress() is true (ARO-HCP environment), this enters the shared ingress path for the KAS service even though the strategy is LoadBalancer, not Route. The KasRouteHostname() function returns an empty hostname because there's no KAS Route configured (KAS is using LoadBalancer), so APIHost stays empty, and IsReady() returns false forever.

The PR fixes this with:

if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {

This correctly checks the strategy type FIRST — only entering the shared ingress path if the strategy is Route. But this fix is in the CPO code which comes from the release image, not the PR build.

Now I have the complete root cause. Let me output the report:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aks-4-22
  • Build ID: 2062163112106135553
  • Target: e2e-aks-4-22
  • Test: TestCreateClusterHABreakGlassCredentials
  • PR: #8546CNTRLPLANE-3430: Make HA break-glass credentials test infraless

Test Failure Analysis

Error

TestCreateClusterHABreakGlassCredentials/Main: Failed to wait for at least one KAS pod
to be scheduled in 10m0s: context deadline exceeded
  - observed **v1.Pod collection invalid: no KAS pods found yet

HostedCluster conditions stuck at:
  KubeAPIServerAvailable=False: NotFound(Kube APIServer deployment not found)
  EtcdAvailable=False: StatefulSetNotFound
  InfrastructureReady=False: WaitingOnInfrastructureReady(Cluster infrastructure is still provisioning)

Summary

The test creates a platform: None (infraless) HostedCluster on an AKS management cluster running in an ARO-HCP environment (MANAGED_SERVICE=ARO-HCP). The control-plane-operator (CPO) gets stuck in an infinite loop reporting "Infrastructure is not yet ready" for 10+ minutes, never deploying etcd or kube-apiserver. The root cause is a pre-existing bug in the CPO's infra.go shared ingress condition (line 530) that the PR fixes — but the fix only applies to the HyperShift operator binary, not the CPO binary, which is sourced from the OCP release image and therefore runs the old (unfixed) code.

Root Cause

The failure is caused by a bug in the shared ingress condition order in the control-plane-operator's infra.go, combined with the fact that the PR's CPO fix is not deployed because the CPO runs from the OCP release image, not from the PR build.

The bug (old code at infra.go:530):

if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

On the AKS management cluster, MANAGED_SERVICE=ARO-HCP is set, so UseSharedIngress() returns true. This causes the function to enter the shared ingress path for the KAS service even though the strategy is LoadBalancer, not Route. Since there is no KAS Route (it uses LoadBalancer), KasRouteHostname() returns an empty string. This means APIHost stays empty, causing InfrastructureStatus.IsReady() to return false permanently — blocking all control plane component deployment (etcd, KAS, etc.).

The PR's fix (corrected condition):

if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {
    return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}

This correctly checks the strategy type FIRST, only entering shared ingress when the strategy is Route.

Why the fix doesn't take effect: The HyperShift CI builds the HyperShift operator and test binary from the PR, but the control-plane-operator (CPO) is deployed from the OCP release image specified in the HostedCluster's release.image field. The PR changes to kas/service.go, infra/infra.go, and support/netutil/route.go all live in the CPO binary — which is NOT built from the PR. Therefore, the old buggy code runs, and infrastructure readiness is permanently blocked.

Evidence from CPO logs confirming old code: The CPO logs show "Infrastructure is not yet ready" without the "infraStatus" field that the PR adds, confirming the unpatched CPO binary is running.

Recommendations
  1. The infra.go shared ingress condition fix must be present in the OCP release image's CPO — it needs to be backported/merged into the OCP release payload that the test uses, OR the test must use a release image that already contains the fix. Since this is a chicken-and-egg problem (the PR fixes CPO code but the CPO comes from the release), consider:

  2. Override the CPO image in the test — Set --control-plane-operator-image to use the CI-built image from the PR instead of the release image's CPO. This would ensure the PR's fixes to infra.go, kas/service.go, and netutil/route.go are actually active.

  3. Alternatively, separate the CPO fix from the test change — Land the infra.go condition order fix as an independent PR first (it's a correctness bug affecting any ARO-HCP cluster using KAS LoadBalancer strategy). Once it's in the release payload, the test change in this PR will work.

  4. Short-term workaround — Change the test to use APIServer Route strategy (instead of LoadBalancer) so the shared ingress condition doesn't misfire. This avoids the need for the CPO fix but changes the test semantics.

Evidence
Evidence Detail
CPO log pattern All 239 lines show repeating "Infrastructure is not yet ready" → "Waiting for infrastructure to be ready before proceeding" from 13:56:39 to 14:06:24 (10 min)
Missing infraStatus field CPO logs lack the "infraStatus" diagnostic field added by the PR, confirming old CPO code runs
MANAGED_SERVICE env CPO pod has MANAGED_SERVICE=ARO-HCP, making UseSharedIngress() return true
Buggy condition (infra.go:530) sharedingress.UseSharedIngress() || (IBMCloud && Route) — enters shared ingress path for ANY strategy when UseSharedIngress is true
KAS service strategy HostedCluster spec: service: APIServer, type: LoadBalancer — no Route, so KasRouteHostname() returns empty
KAS LB provisioned Service kube-apiserver has status.loadBalancer.ingress[0].ip: 57.165.204.32 — LB is ready but never used
Routes have hostnames konnectivity: konnectivity.hypershift.azure.devcluster.openshift.com, oauth: oauthserver.hypershift.azure.devcluster.openshift.com — both Admitted
HCP conditions KubeAPIServerAvailable=False: NotFound, EtcdAvailable=False: StatefulSetNotFound — no CP components deployed
Platform type platform: type: None (infraless) with controllerAvailabilityPolicy: HighlyAvailable
Test duration TestCreateClusterHABreakGlassCredentials/Main ran for exactly 600.00s (10 min timeout)

@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 2f43cbe to 618a7ba Compare June 4, 2026 08:50
… < 4.23

- Set explicit hostnames on Route-type services in BeforeApply so the CPO
  can resolve infrastructure readiness on non-OpenShift management clusters
  (e.g. AKS) without a default ingress domain.
- Skip the test on release versions < 4.23 via AtLeast(Version423) because
  older CPO versions lack the LoadBalancer.Hostname bypass fix.
- Add early KAS pod scheduling failure detection (10-minute timeout) to
  fail fast under resource pressure.
- Poll SSR in validateCertificateAuth instead of failing immediately on
  Unauthorized, allowing time for the control-plane-pki-operator to
  reconcile the break-glass signer CA into the client-ca trust bundle.
- Increase WaitForGuestClient timeout to 15 minutes to accommodate cloud
  LB provisioning time on Azure/AWS.

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 618a7ba to c08ad8d Compare June 4, 2026 09:57
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@sdminonne: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-4-22 2f43cbe link true /test e2e-aks-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@sdminonne
Copy link
Copy Markdown
Contributor Author

/retest

@sdminonne
Copy link
Copy Markdown
Contributor Author

/pipeline-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants