CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584
CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584Nirshal wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@Nirshal: This pull request references CNTRLPLANE-3277 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughAdds --oauth-publishing-strategy=LoadBalancer and the private NAT subnet arg to the Azure private cluster ExtraArgs, expands the Azure private test label selector to include self-managed-azure-oauth-lb-private, and adds an ordered E2E test (AzureOAuthLoadBalancerPrivateTest) that waits for oauth-openshift to be a LoadBalancer, asserts the Azure internal LoadBalancer annotation, and validates the OAuth token flow via the LoadBalancer endpoint. Sequence Diagram(s)sequenceDiagram
participant TestRunner
participant HostedCluster
participant AzureLoadBalancer
participant ManagementClient
TestRunner->>HostedCluster: wait for oauth-openshift Service type=LoadBalancer
HostedCluster->>AzureLoadBalancer: allocate LB and ingress endpoint
AzureLoadBalancer->>HostedCluster: annotate Service with InternalLoadBalancerAnnotation
TestRunner->>ManagementClient: perform OAuth token flow via LB endpoint
ManagementClient->>HostedCluster: validate token exchange through LB ingress
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Nirshal The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)
304-320: ⚡ Quick winMove platform/topology guard skips to
BeforeEachwith standardized skip text.In Line 304-320, the new suite uses
BeforeAlland custom skip messages for platform checks. Please switch toBeforeEachand the repository’s required message format for platform-specific skips.♻️ Suggested adjustment
- BeforeAll(func() { + BeforeEach(func() { testCtx = getTestCtx() hc = testCtx.GetHostedCluster() if hc == nil || hc.Spec.Platform.Type != hyperv1.AzurePlatform { - Skip("Azure OAuth LB Private tests are only for Azure platform") + Skip("[Azure OAuth LB Private] test is only for Azure platform") } if hc.Spec.Platform.Azure == nil || hc.Spec.Platform.Azure.Topology != hyperv1.AzureTopologyPrivate { Skip("Azure OAuth LB Private tests require Private topology") }As per coding guidelines, "
test/e2e/v2/tests/**/*_test.go: UseBeforeEachwithSkip()for platform-specific tests with message format:[Feature] test is only for [Platform] platform".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 304 - 320, Replace the platform/topology guard block currently inside BeforeAll with a BeforeEach that performs the same checks (use getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type, hc.Spec.Platform.Azure.Topology, and netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the repository-standard messages: for platform use "Azure OAuth LB Private tests is only for Azure platform" and for topology use "Azure OAuth LB Private tests requires Private topology" (also keep the existing Skip for publishing strategy if present); update references to BeforeAll -> BeforeEach and keep the same symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace, netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 304-320: Replace the platform/topology guard block currently
inside BeforeAll with a BeforeEach that performs the same checks (use
getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type,
hc.Spec.Platform.Azure.Topology, and
netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move
controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the
repository-standard messages: for platform use "Azure OAuth LB Private tests is
only for Azure platform" and for topology use "Azure OAuth LB Private tests
requires Private topology" (also keep the existing Skip for publishing strategy
if present); update references to BeforeAll -> BeforeEach and keep the same
symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace,
netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4942b228-db2d-42ed-83bf-2dd3738e5e94
📒 Files selected for processing (2)
test/e2e/v2/lifecycle/azure.gotest/e2e/v2/tests/hosted_cluster_azure_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8584 +/- ##
==========================================
+ Coverage 40.41% 40.53% +0.12%
==========================================
Files 755 755
Lines 93235 93235
==========================================
+ Hits 37679 37794 +115
+ Misses 52854 52740 -114
+ Partials 2702 2701 -1 see 3 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Add e2e validation for OAuth LoadBalancer publishing on Azure Private topology. Reuse the existing private cluster variant by adding --oauth-publishing-strategy=LoadBalancer to its creation args, avoiding an additional Azure cluster in CI. Signed-off-by: Alessandro Rossi <alesross@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
|
Re: CodeRabbit suggestion to change This This matches the existing pattern in |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 379-382: The test calls
ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the HostedCluster
OAuth config and creates an htpasswd Secret without restoring them; capture the
original HostedCluster.Spec.OAuth (and any other mutated fields) and the
existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bcc13421-364c-476a-94f1-73cf7c97a172
📒 Files selected for processing (2)
test/e2e/v2/lifecycle/azure.gotest/e2e/v2/tests/hosted_cluster_azure_test.go
| It("should complete OAuth token flow through LoadBalancer endpoint", func() { | ||
| ctx := testCtx.Context | ||
| e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc) | ||
| }) |
There was a problem hiding this comment.
Restore cluster auth state after OAuth flow mutation.
Line 381 calls ValidateOAuthWithIdentityProviderViaLoadBalancer, which mutates HostedCluster OAuth config and creates an htpasswd Secret without restoring/deleting them. This makes outcomes order-dependent on shared clusters. Please capture original state and defer restoration, with apierrors.IsNotFound() handling during cleanup.
As per coding guidelines: "When a test mutates cluster state, capture original state before mutation and defer restoration; check apierrors.IsNotFound() in cleanup to handle already-deleted resources".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 379 - 382, The
test calls ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the
HostedCluster OAuth config and creates an htpasswd Secret without restoring
them; capture the original HostedCluster.Spec.OAuth (and any other mutated
fields) and the existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.
There was a problem hiding this comment.
The mutation happens inside the shared helper ValidateOAuthWithIdentityProviderViaLoadBalancer (in test/e2e/util/oauth.go), which is also used by the existing AzureOAuthLoadBalancerTest (line 284, same file) without cleanup. Adding cleanup in the caller would mean duplicating restore logic that belongs in the helper itself.
This is a pre-existing concern, not introduced by this PR. Noted in the PR description as a known tradeoff. If we address it, it should be done in the shared helper as a follow-up.
There was a problem hiding this comment.
@Nirshal, that's a fair point — the concern is pre-existing and your reasoning is correct. Duplicating restore logic in each caller would be the wrong approach; the cleanup belongs inside ValidateOAuthWithIdentityProviderViaLoadBalancer itself so all callers benefit automatically.
Would you like me to open a follow-up GitHub issue to track adding htpasswd Secret cleanup (and OAuth config restoration) to the shared helper?
🧠 Learnings used
Learnt from: cblecker
Repo: openshift/hypershift PR: 8511
File: test/e2e/v2/tests/hosted_cluster_metrics_test.go:203-245
Timestamp: 2026-05-15T03:34:00.446Z
Learning: When reviewing HyperShift v2 e2e tests under test/e2e/v2/, do not require MicroShift “guard” logic or markers. HyperShift v2 e2e tests in this directory run only against hosted clusters (no MicroShift targets), so missing exutil.IsMicroShiftCluster() checks and/or [Skipped:MicroShift] labels should not be flagged.
Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8527
File: test/e2e/v2/tests/nodepool_autoscaling_test.go:236-236
Timestamp: 2026-05-15T14:34:25.062Z
Learning: In openshift/hypershift v2 e2e lifecycle CI tests under test/e2e/v2/tests/, it is OK to reference external image registries because the CI environment is not air-gapped. Image references like registry.access.redhat.com/ubi9/ubi-minimal:latest in lifecycle DaemonSet definitions should not be flagged for requiring internal registries. For the autoscaling workload created by newAutoscalingWorkload (nodepool_autoscaling_test.go), ensure it uses registry.k8s.io/pause:3.9 (not ubi-minimal).
Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8469
File: test/e2e/util/util_test.go:17-131
Timestamp: 2026-05-17T22:37:02.057Z
Learning: In Go e2e tests under test/e2e, do not call t.Parallel() (neither at the top-level test nor in any subtest) when the test (or any helper it calls) uses t.Setenv—for example, when calling azureutil.SetAsAroHCPTest(t). The Go test runner panics with "testing: test using t.Setenv can not use t.Parallel" because environment variables are process-global. Keep such tests serial to ensure the environment setup is safe and deterministic.
|
/test ? |
1 similar comment
|
/test ? |
|
/test e2e-azure-v2-self-managed |
|
@Nirshal: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Now the picture is crystal clear:
So there's only ONE execution of this test (the private group), and it failed. The "successful" output at lines 369-378 is also from this same test — it shows the The test has TWO parts: (1) OAuth token flow (lines 375-377, succeeded) and (2) healthcheck via raw TLS dial (lines 396-405, failed). The test sequence is: token flow succeeded first, then the healthcheck timed out. Now I have full understanding. Let me produce the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe new e2e test introduced by this PR ( Root CausePrimary cause: Network unreachability of Azure internal LoadBalancer IP from CI test runner. The test The test calls Why the OAuth token flow step succeeded: The same test's earlier step ( Contrast with the non-private variant: The Secondary failures: The Recommendations
Evidence
|
cblecker
left a comment
There was a problem hiding this comment.
The test structure itself is solid — good use of Ordered + BeforeAll, thorough skip guards, and the internal LB annotation check is a meaningful addition for private topology coverage.
Two things to address before this can merge:
1. Healthcheck unreachable from CI (blocking): WaitForOAuthLoadBalancerReady dials the OAuth hostname directly via TLS, which resolves to the internal LB IP for private topology — unreachable from the build cluster. The CI run confirmed this (job 2059278516129632256, oauth.go:375, 300s timeout). The token flow step worked; it's only the raw healthcheck that fails. Fix needs to happen in the shared helper.
2. Test isolation on the shared cluster: The public AzureOAuthLoadBalancerTest works because it has a dedicated cluster. This private variant shares a cluster with read-only tests and relies on declaration order for isolation. Either a dedicated cluster, or a sequential matrix step, would give the same guarantee the public variant has.
Minor nit: the Service type + endpoint predicates in the first It block are nearly identical to the public variant (~30 lines). Not blocking, but worth a shared helper if more LB variants get added.
| It("should complete OAuth token flow through LoadBalancer endpoint", func() { | ||
| ctx := testCtx.Context | ||
| e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc) | ||
| }) |
There was a problem hiding this comment.
The CI run showed this test can't pass as-is — WaitForOAuthLoadBalancerReady (in test/e2e/util/oauth.go:350-375) does a direct TLS dial to https://<oauth-hostname>/healthz. For private topology that hostname resolves to the Azure internal LB IP (e.g., 10.0.0.22), which is unreachable from the CI build cluster.
Interestingly the OAuth token flow itself succeeds because it routes through the guest kubeconfig transport — it's specifically the raw healthcheck that fails. So the underlying validation logic works; it's just the connectivity check that's incompatible with private topology.
The fix needs to happen in the shared helper. A few approaches:
- Skip
WaitForOAuthLoadBalancerReadyentirely for private topology and rely on the token flow validation alone — it already exercises the full OAuth stack end-to-end. - Use the guest kubeconfig transport for the healthcheck too, the same way the token flow step does.
- Route it through a pod on the management cluster that has VNet connectivity.
This needs to be addressed in the helper before this can land.
| AzureOAuthLoadBalancerPrivateTest(getTestCtx) | ||
| } | ||
|
|
||
| var _ = Describe("Hosted Cluster Azure", Label("hosted-cluster-azure"), func() { |
There was a problem hiding this comment.
The public variant (AzureOAuthLoadBalancerTest) works cleanly because it runs on a dedicated cluster (cluster-name-oauth-lb) created specifically for it — mutation is safe there since nothing else runs on that cluster. This private variant is trying to replicate that pattern but without the dedicated cluster, sharing cluster-name-private with the read-only AzurePrivateTopologyTest instead.
Registering this last is a reasonable mitigation, but declaration order between Context blocks isn't enforced by Ginkgo's Ordered decorator — that only applies within a single Context. If the registration order changes, or a new test gets inserted between the two, the read-only private topology tests could run after the HC has been mutated (htpasswd IDP created, kubeadmin secret deleted).
The cleanest fix that matches how the public variant is structured would be a dedicated cluster-name-oauth-lb-private cluster. I understand the reviewer feedback to reduce cluster count, so if a dedicated cluster is off the table, the other option is to make this a sequential step in the test matrix (after the read-only private tests), the way etcd-chaos runs sequentially after upgrade on the upgrade cluster. That would give the same isolation guarantee without an extra cluster.
What this PR does / why we need it:
Adds the
AzureOAuthLoadBalancerPrivateTestto the v2 Ginkgo e2e suite, validatingthat the OAuth server works correctly when published via LoadBalancer on an Azure
Private topology cluster.
The test verifies:
oauth-openshiftService is created as type LoadBalancer with an allocated endpointChanges to
lifecycle/azure.go:--oauth-publishing-strategy=LoadBalancerto the existingprivatecluster variant, reusing it instead of spinning up a separate cluster (per reviewer feedback to reduce Azure CI cluster count)self-managed-azure-oauth-lb-privateto the private TestGroup's LabelFilter so the new test runs on the same clusterNote: The OAuth token flow test requires the test runner to have network connectivity
to the Azure VNet (e.g., running in the management cluster or via VPN/peering).
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3277
Special notes for your reviewer:
AzureOAuthLoadBalancerPrivateTestmutates the HostedCluster (creates htpasswd IDP, triggers kubeadmin secret deletion). It is registered last for the private cluster inRegisterHostedClusterAzureTeststo avoid affecting other private topology tests.openshift/releaseis needed since the private cluster already exists in CI.Checklist:
Summary by CodeRabbit