CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546
CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546sdminonne wants to merge 1 commit into
Conversation
|
@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
/test e2e-aks |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks. Sequence Diagram(s)sequenceDiagram
participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
participant Route as routev1.Route
participant RouteHost as netutil.RouteHost
participant ServiceStatus as ServiceStatusUpdater
Reconciler->>Route: read route resource
Reconciler->>RouteHost: call RouteHost(route)
RouteHost-->>Reconciler: host (or "")
alt host != ""
Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
else host == ""
Reconciler-->>ServiceStatus: set error message and return
end
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8546 +/- ##
==========================================
+ Coverage 40.68% 41.44% +0.75%
==========================================
Files 755 756 +1
Lines 93368 93644 +276
==========================================
+ Hits 37985 38807 +822
+ Misses 52649 52114 -535
+ Partials 2734 2723 -11 see 56 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Resultse2e-aws
e2e-aks
Failed TestsTotal failed tests: 15
... and 10 more failed tests |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.
In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.
In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400
📒 Files selected for processing (5)
control-plane-operator/controllers/hostedcontrolplane/kas/service.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/service.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/service_test.gosupport/netutil/route.gosupport/netutil/route_test.go
| func TestRouteHost(t *testing.T) { | ||
| testCases := []struct { | ||
| name string | ||
| route *routev1.Route | ||
| expectedHost string | ||
| }{ | ||
| { | ||
| name: "When route is nil, it should return empty string", | ||
| route: nil, | ||
| expectedHost: "", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is set, it should return Spec.Host", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "oauth.example.com", | ||
| }, | ||
| }, | ||
| expectedHost: "oauth.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "oauth.example.com", | ||
| }, | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| Host: "ingress.example.com", | ||
| RouterCanonicalHostname: "canonical.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "oauth.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| Host: "ingress.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "ingress.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| RouterCanonicalHostname: "router-canonical.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "router-canonical.example.com", | ||
| }, | ||
| { | ||
| name: "When all host fields are empty, it should return empty string", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| {}, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "", | ||
| }, | ||
| { | ||
| name: "When Status.Ingress slice is empty, it should return empty string", | ||
| route: &routev1.Route{}, | ||
| expectedHost: "", | ||
| }, | ||
| } | ||
|
|
||
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| g := NewWithT(t) | ||
| g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost)) | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Add a multi-ingress test case to lock expected fallback behavior.
TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in
TestRouteHost that asserts RouteHost prefers a non-empty host from a later
ingress when the first ingress is empty: create a routev1.Route with
Status.Ingress containing two entries where Ingress[0] is empty and
Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route)
to return that second ingress host; ensure the new case references RouteHost and
the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback
behavior.
There was a problem hiding this comment.
This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.
There was a problem hiding this comment.
@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.
✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route_test.go:472-559
Timestamp: 2026-05-19T18:45:39.602Z
Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| if len(route.Status.Ingress) > 0 { | ||
| if route.Status.Ingress[0].Host != "" { | ||
| return route.Status.Ingress[0].Host | ||
| } | ||
| if route.Status.Ingress[0].RouterCanonicalHostname != "" { | ||
| return route.Status.Ingress[0].RouterCanonicalHostname | ||
| } |
There was a problem hiding this comment.
Handle multi-ingress routes instead of only index 0.
Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.
🔧 Suggested fix
func RouteHost(route *routev1.Route) string {
if route == nil {
return ""
}
if route.Spec.Host != "" {
return route.Spec.Host
}
- if len(route.Status.Ingress) > 0 {
- if route.Status.Ingress[0].Host != "" {
- return route.Status.Ingress[0].Host
- }
- if route.Status.Ingress[0].RouterCanonicalHostname != "" {
- return route.Status.Ingress[0].RouterCanonicalHostname
- }
- }
- return ""
+ var canonicalFallback string
+ for _, ingress := range route.Status.Ingress {
+ if ingress.Host != "" {
+ return ingress.Host
+ }
+ if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+ canonicalFallback = ingress.RouterCanonicalHostname
+ }
+ }
+ return canonicalFallback
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(route.Status.Ingress) > 0 { | |
| if route.Status.Ingress[0].Host != "" { | |
| return route.Status.Ingress[0].Host | |
| } | |
| if route.Status.Ingress[0].RouterCanonicalHostname != "" { | |
| return route.Status.Ingress[0].RouterCanonicalHostname | |
| } | |
| func RouteHost(route *routev1.Route) string { | |
| if route == nil { | |
| return "" | |
| } | |
| if route.Spec.Host != "" { | |
| return route.Spec.Host | |
| } | |
| var canonicalFallback string | |
| for _, ingress := range route.Status.Ingress { | |
| if ingress.Host != "" { | |
| return ingress.Host | |
| } | |
| if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" { | |
| canonicalFallback = ingress.RouterCanonicalHostname | |
| } | |
| } | |
| return canonicalFallback | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@support/netutil/route.go` around lines 163 - 169, The current logic only
inspects route.Status.Ingress[0] and returns its Host or
RouterCanonicalHostname, which misses later ingress entries; update the code
that references route.Status.Ingress[0] to iterate over all entries in
route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
There was a problem hiding this comment.
In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.
Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.
There was a problem hiding this comment.
@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.
✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route.go:163-169
Timestamp: 2026-05-19T18:45:30.452Z
Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
/retest |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
97d69f9 to
1f3e9bc
Compare
1f3e9bc to
5a6f1f4
Compare
|
@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold cancel |
|
/lgtm |
|
Scheduling tests matching the |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
78d403d to
2f43cbe
Compare
|
New changes are detected. LGTM label has been removed. |
|
/test e2e-aks |
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
|
/pipeline-required |
|
/test e2e-aws |
|
/test e2e-aws-4-22 |
|
/test e2e-aws |
|
/test e2e-aks-4-22 |
|
/test e2e-aws-upgrade-hypershift-operator |
|
Now the bug is crystal clear. In the old code (line 530 of if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}Since The PR fixes this with: if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {This correctly checks the strategy type FIRST — only entering the shared ingress path if the strategy is Route. But this fix is in the CPO code which comes from the release image, not the PR build. Now I have the complete root cause. Let me output the report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe test creates a Root CauseThe failure is caused by a bug in the shared ingress condition order in the control-plane-operator's The bug (old code at if sharedingress.UseSharedIngress() || (hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform && serviceStrategy.Type == hyperv1.Route) {
return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}On the AKS management cluster, The PR's fix (corrected condition): if serviceStrategy.Type == hyperv1.Route && (sharedingress.UseSharedIngress() || hcp.Spec.Platform.Type == hyperv1.IBMCloudPlatform) {
return sharedingress.KasRouteHostname(hcp), sharedingress.ExternalDNSLBPort, "", nil
}This correctly checks the strategy type FIRST, only entering shared ingress when the strategy is Route. Why the fix doesn't take effect: The HyperShift CI builds the HyperShift operator and test binary from the PR, but the control-plane-operator (CPO) is deployed from the OCP release image specified in the HostedCluster's Evidence from CPO logs confirming old code: The CPO logs show Recommendations
Evidence
|
2f43cbe to
618a7ba
Compare
… < 4.23 - Set explicit hostnames on Route-type services in BeforeApply so the CPO can resolve infrastructure readiness on non-OpenShift management clusters (e.g. AKS) without a default ingress domain. - Skip the test on release versions < 4.23 via AtLeast(Version423) because older CPO versions lack the LoadBalancer.Hostname bypass fix. - Add early KAS pod scheduling failure detection (10-minute timeout) to fail fast under resource pressure. - Poll SSR in validateCertificateAuth instead of failing immediately on Unauthorized, allowing time for the control-plane-pki-operator to reconcile the break-glass signer CA into the client-ca trust bundle. - Increase WaitForGuestClient timeout to 15 minutes to accommodate cloud LB provisioning time on Azure/AWS. Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
618a7ba to
c08ad8d
Compare
|
@sdminonne: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest |
1 similar comment
|
/retest |
|
/pipeline-required |
Summary
TestCreateClusterHABreakGlassCredentialswork on non-OpenShift management clusters (e.g. AKS) by setting explicit hostnames on Route-type services inBeforeApply, so the CPO can resolve infrastructure readiness without a default ingress domain.WaitForGuestClienttimeout.e2eutil.AtLeast(t, e2eutil.Version423)because older CPO versions lack theLoadBalancer.Hostnamebypass fix (merged in PR OCPBUGS-87028: Fix KAS ReconcileServiceStatus to check LoadBalancer.Hostname before LB provisioning #8663).validateCertificateAuthinstead of failing immediately on Unauthorized, allowing time for the control-plane-pki-operator to reconcile the break-glass signer CA into the client-ca trust bundle.Depends on: #8663
Test plan
e2e-aks-4-22: test should be skipped (not failed) due toAtLeast(Version423)e2e-aks(4.23+): test should pass as beforee2e-aws: test should pass (cloud LB provisions external endpoints)🤖 Generated with Claude Code