CNTRLPLANE-3371: Fix AllowedCIDRs e2e test for Route-based KAS#8469
CNTRLPLANE-3371: Fix AllowedCIDRs e2e test for Route-based KAS#8469bryan-cox wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@bryan-cox: This pull request references CNTRLPLANE-3371 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe test utility ValidateKubeAPIServerAllowedCIDRs now passes the guest cluster REST config into ensureAPIServerAllowedCIDRs. The helper waits for the control-plane to reconcile HostedCluster.Spec.Networking.APIServer.AllowedCIDRBlocks into the downstream Service.spec.LoadBalancerSourceRanges (target Service chosen by publishing strategy and cloud-specific rules). Once reconciled, reachability is polled by creating a fresh guest kubeclient per attempt (copying rest.Config with a custom Dial) and calling ServerVersion() to validate network restrictions. Sequence Diagram(s)sequenceDiagram
participant Test as Test Harness
participant CP as Control-Plane Reconciler
participant LB as Downstream Service/LoadBalancer
participant GuestAPI as Guest kube-apiserver
Test->>CP: Set AllowedCIDRBlocks on HostedCluster spec
Note right of CP: Reconciler updates target Service based on publishing strategy/cloud
CP->>LB: Update spec.LoadBalancerSourceRanges
loop Poll for reconciliation
Test->>LB: GET Service.spec.LoadBalancerSourceRanges
alt ranges match expected
Note right of Test: perform reachability checks
loop Reachability attempts
Test->>GuestAPI: Create fresh kubeclient (copied rest.Config + custom Dial) and call ServerVersion()
GuestAPI-->>Test: respond (reachable / unreachable)
end
else not yet reconciled
Test-->>Test: wait and retry
end
end
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8469 +/- ##
==========================================
+ Coverage 37.49% 40.00% +2.50%
==========================================
Files 751 751
Lines 91984 92838 +854
==========================================
+ Hits 34487 37137 +2650
+ Misses 54854 53014 -1840
- Partials 2643 2687 +44 see 57 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/pipeline required |
51d7116 to
6b609b0
Compare
|
/pipeline required |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
Failed TestsTotal failed tests: 3
|
|
/retest |
|
/test e2e-aws |
|
/test e2e-aks-4-22 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
6b609b0 to
29672c9
Compare
|
/test e2e-aks |
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws |
|
I have all the evidence needed. Here is the complete analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is a CI infrastructure failure, not a test or code failure. The ci-operator pod for the Root CauseThe CI pod could not be scheduled on the
The 2 nodes that were actually eligible for this pod type did not have enough memory to schedule it. Preemption was also not possible — the scheduler found no viable preemption victims on the memory-constrained nodes. The pod waited for 30 minutes (the default Prow scheduling timeout) before being terminated. This is a transient cluster capacity issue on Recommendations
Evidence
|
|
/test security |
|
/test e2e-aws |
|
/auto-cc |
cblecker
left a comment
There was a problem hiding this comment.
The root cause analysis here is solid — the HTTP/2 connection reuse explanation is clear and the fix (fresh client per poll + waiting for the downstream service to reflect the updated source ranges before testing reachability) is the right approach rather than just a timing band-aid. The allowedCIDRsTargetService helper is a nice encapsulation of the CPO service selection logic.
A few comments inline, the most notable being a potential issue with the ARO HCP guard in the Route case.
| } | ||
| switch strategy.Type { | ||
| case hyperv1.Route: | ||
| if azureutil.IsAroHCP() && !netutil.IsPrivateHC(hc) { |
There was a problem hiding this comment.
The compound condition here doesn't quite match how the CPO makes this decision. The CPO uses IsAroHCP() as a standalone check when handling router services (infra.go:459 — deletes RouterPublicService for ARO HCP unconditionally). ARO HCP never has a public router LB service with LoadBalancerSourceRanges, since Swift handles connectivity.
The issue is with PublicAndPrivate topology: IsPublicHC returns true (so the top guard passes), but IsPrivateHC also returns true (via the topology check), making !IsPrivateHC false. The guard doesn't fire and we return RouterPublicService — a service the CPO actively deletes for ARO HCP. The downstream Eventually would then time out waiting for LoadBalancerSourceRanges on a service that doesn't exist.
Simplifying to just azureutil.IsAroHCP() matches the CPO's logic:
if azureutil.IsAroHCP() {
return nil
}There was a problem hiding this comment.
Done. Simplified to just azureutil.IsAroHCP() — good catch on the PublicAndPrivate topology case where both IsPublicHC and IsPrivateHC return true.
AI-assisted response via Claude Code
| // allowedCIDRsTargetService returns the LoadBalancer service that enforces AllowedCIDRBlocks | ||
| // based on the HostedCluster's APIServer publishing strategy. Returns nil when no LB service | ||
| // carries source ranges (private clusters, NodePort, ARO HCP). | ||
| // Mirrors service selection in CPO: infra.go:reconcileAPIServerService, kas/service.go:ReconcileService. |
There was a problem hiding this comment.
This citation is incomplete — it only covers the LoadBalancer path. For the Route case (which is the main path this PR is fixing), the relevant CPO code is infra.go:reconcileHCPRouterServices → ingress/router.go:ReconcileRouterService. The cited kas/service.go:ReconcileService sets LoadBalancerSourceRanges only in the LoadBalancer case, not for Route.
These file-level references are also going to get stale as the CPO migrates to the v2 component framework. Something like this would age better:
// Mirrors CPO's API server and router service reconciliation logic.There was a problem hiding this comment.
Done. Simplified to a generic reference that won't go stale with the v2 component migration.
AI-assisted response via Claude Code
f0d3966 to
87a1a7e
Compare
|
/retest |
87a1a7e to
b0ea145
Compare
|
/retest |
|
/test e2e-aws |
|
/retest |
cblecker
left a comment
There was a problem hiding this comment.
Second round overall looks good — all the previous feedback was addressed. A few new things came up on closer look, one of which is a real bug in the HTTP/2 fix.
| // subsequent requests reuse that connection and bypass the restriction. | ||
| g.Eventually(func(g Gomega) { | ||
| _, err = guestClient.ServerVersion() | ||
| freshClient, err := kubeclient.NewForConfig(rest.CopyConfig(guestConfig)) |
There was a problem hiding this comment.
Unfortunately rest.CopyConfig doesn't actually give you a new HTTP transport here. I traced through the client-go source: kubeclient.NewForConfig → rest.HTTPClientFor → transport.New(), which calls tlsCache.get() when config.Transport is nil. The cache key (tlsCacheKey in transport/cache.go) is built from the TLS data values — string(c.TLS.CAData), string(c.TLS.CertData), string(c.TLS.KeyData), etc. — not pointers. CopyConfig copies the same byte content, and since Dial is nil, TransportConfig() leaves DialHolder nil too. Both the original config and the copy produce identical cache keys, so the cache returns the same *http.Transport instance with its existing HTTP/2 connection pool.
The simplest fix is to set Dial on the copied config before creating the client. TransportConfig() wraps a non-nil Dial in a new &DialHolder{} each time, making the pointer unique and busting the cache:
cfg := rest.CopyConfig(guestConfig)
cfg.Dial = (&net.Dialer{Timeout: 30 * time.Second, KeepAlive: 30 * time.Second}).DialContext
freshClient, err := kubeclient.NewForConfig(cfg)There was a problem hiding this comment.
Done. Set cfg.Dial to create a unique *transport.DialHolder pointer per iteration, busting the TLS transport cache.
AI-assisted response via Claude Code
|
|
||
| // Create a fresh kubeclient per poll to avoid HTTP/2 connection reuse. Go's HTTP/2 | ||
| // transport multiplexes requests over a single persistent TCP connection. If a prior | ||
| // successful request established a connection before NSG rules took effect, all |
There was a problem hiding this comment.
nit: "NSG rules" is Azure-specific — this function runs on AWS (security groups) and GCP (firewall rules) too. Something like "network restrictions" or "load balancer source-range enforcement" would be accurate across platforms.
There was a problem hiding this comment.
Done. Replaced "NSG rules" with "load balancer source-range restrictions".
AI-assisted response via Claude Code
| }, | ||
| }, | ||
| wantNil: true, | ||
| }, |
There was a problem hiding this comment.
These two test cases (NodePort and no-strategy) don't actually reach the branch they claim to test. Both create an AWSPlatform HostedCluster without setting Platform.AWS, so IsPublicHC evaluates ptr.Deref(nil, AWSPlatformSpec{}).EndpointAccess == "" — which matches neither Public nor PublicAndPrivate — and returns false. The function exits at the !IsPublicHC(hc) guard before the strategy switch is ever reached.
Using the publicHC helper fixes this (it correctly sets EndpointAccess: hyperv1.Public for AWS):
{
name: "When NodePort strategy it should return nil",
hc: publicHC(hyperv1.AWSPlatform, hyperv1.NodePort),
wantNil: true,
},
{
name: "When no APIServer strategy it should return nil",
hc: func() *hyperv1.HostedCluster {
hc := publicHC(hyperv1.AWSPlatform, hyperv1.Route)
hc.Spec.Services = nil
return hc
}(),
wantNil: true,
},The tests still return wantNil: true either way, but a bug in the default switch case or the nil-strategy guard wouldn't be caught as-is.
There was a problem hiding this comment.
Done. NodePort now uses publicHC(hyperv1.AWSPlatform, hyperv1.NodePort) so it passes IsPublicHC and exercises the switch default. No-strategy case uses publicHC with hc.Spec.Services = nil so it reaches the strategy == nil guard.
AI-assisted response via Claude Code
|
|
||
| for _, tc := range tests { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| g := NewWithT(t) |
There was a problem hiding this comment.
nit: azureutil.SetAsAroHCPTest(t) already does exactly this — might as well use the helper for consistency with how the rest of the codebase sets up ARO HCP test environments.
There was a problem hiding this comment.
Done. Switched to azureutil.SetAsAroHCPTest(t).
AI-assisted response via Claude Code
The ValidateKubeAPIServerAllowedCIDRs test fails on v2 Azure self-managed clusters because KAS uses Route publishing strategy (via external-dns-domain), not LoadBalancer. Two fixes: 1. Wait for the downstream LB service (router or KAS LB) to have its LoadBalancerSourceRanges updated by the CPO before asserting KAS reachability. The target service is determined by the HC's APIServer publishing strategy. 2. Create a fresh kubeclient per poll iteration to prevent HTTP/2 connection reuse. Go's HTTP/2 transport multiplexes all requests over a single persistent TCP connection — if a prior request succeeded before Azure NSG rules took effect, subsequent requests bypass the restriction on the same connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b0ea145 to
43d818b
Compare
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, cblecker The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
@bryan-cox: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
cc @muraee |
What
Fixes the
ValidateKubeAPIServerAllowedCIDRse2e test so it passes on v2 Azure self-managed clusters where KAS uses Route publishing strategy (via--external-dns-domain).Why
The test was skipped in v2 CI (
--ginkgo.skip="KAS allowed CIDRs") because it always failed. Both v1 and v2 Azure self-managed use Route strategy for KAS, but v1 passes while v2 fails due to a difference in cluster lifecycle timing combined with HTTP/2 connection reuse.Root cause: HTTP/2 connection reuse
The test reuses a single
kubeclient.Clientsetacross allServerVersion()poll iterations. Go's HTTP/2 transport multiplexes all requests over a single persistent TCP connection. If the first poll succeeds before Azure NSG rules take effect, all subsequent polls reuse that connection and never observe the expected failure.Why v1 passes but v2 fails: In v1, the cluster is created fresh inside
TestCreateCluster, so the CPO is in its initial reconciliation burst — the router service'sLoadBalancerSourceRangesand corresponding Azure NSG rules are updated before the firstServerVersion()call. In v2, the cluster is pre-created and shared across tests, so the CPO is in steady-state with longer reconciliation intervals. The firstServerVersion()call succeeds before the NSG rules catch up, and HTTP/2 holds that connection open for all subsequent polls.Additional fix: missing downstream service wait
The test waits for
AllowedCIDRBlocksto propagate from the HostedCluster to the HostedControlPlane, but does not wait for the CPO to reconcile the downstream LoadBalancer service'sLoadBalancerSourceRanges. This is a race condition that exists in both v1 and v2 — v1 just happens to win the race due to CPO being in active reconciliation. Adding an explicit wait makes the test correct rather than relying on timing.Changes
test/e2e/util/util.go— single file, three changes:ensureAPIServerAllowedCIDRssignature:*kubeclient.Clientset→*rest.Configto enable fresh client creation per pollServerVersion()iteration creates a new client viakubeclient.NewForConfig(rest.CopyConfig(guestConfig)), preventing HTTP/2 connection reuse.allowedCIDRsTargetService()helper determines the correct LB service based on APIServer publishing strategy (Route →router, LoadBalancer → platform-specific KAS LB). AnEventuallyblock waits for the service'sLoadBalancerSourceRangesto match before checking KAS reachability.Test Plan
go build -tags e2e ./test/e2e/...— compilesgo build -tags e2ev2 ./test/e2e/v2/...— compilesgo vet -tags e2e ./test/e2e/...— passes🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests