Skip to content

CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584

Open
Nirshal wants to merge 1 commit into
openshift:mainfrom
Nirshal:CNTRLPLANE-3277
Open

CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584
Nirshal wants to merge 1 commit into
openshift:mainfrom
Nirshal:CNTRLPLANE-3277

Conversation

@Nirshal

@Nirshal Nirshal commented May 25, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Adds the AzureOAuthLoadBalancerPrivateTest to the v2 Ginkgo e2e suite, validating
that the OAuth server works correctly when published via LoadBalancer on an Azure
Private topology cluster.

The test verifies:

  • The oauth-openshift Service is created as type LoadBalancer with an allocated endpoint
  • The Service carries the Azure internal LoadBalancer annotation (confirming it's an ILB, not public)
  • The OAuth token flow (kubeadmin login + htpasswd IDP setup + token request) works through the private endpoint

Changes to lifecycle/azure.go:

  • Adds --oauth-publishing-strategy=LoadBalancer to the existing private cluster variant, reusing it instead of spinning up a separate cluster (per reviewer feedback to reduce Azure CI cluster count)
  • Adds self-managed-azure-oauth-lb-private to the private TestGroup's LabelFilter so the new test runs on the same cluster

Note: The OAuth token flow test requires the test runner to have network connectivity
to the Azure VNet (e.g., running in the management cluster or via VPN/peering).

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3277

Special notes for your reviewer:

  • This builds on top of CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework #8527 (CNTRLPLANE-3222) which ported the v1 lifecycle tests to v2 and established the Azure self-managed test infrastructure.
  • The private cluster variant now always creates with OAuth via LoadBalancer. This means private topology with default OAuth (Route) is no longer tested in CI. This tradeoff was agreed upon to avoid an additional Azure cluster.
  • AzureOAuthLoadBalancerPrivateTest mutates the HostedCluster (creates htpasswd IDP, triggers kubeadmin secret deletion). It is registered last for the private cluster in RegisterHostedClusterAzureTests to avoid affecting other private topology tests.
  • No companion PR in openshift/release is needed since the private cluster already exists in CI.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests. No unit tests needed — changes are e2e test code and CI cluster configuration only, with no new business logic.

Summary by CodeRabbit

  • Tests
    • Added end-to-end test coverage for Azure private-topology clusters running OAuth behind a LoadBalancer, verifying service type, internal LB annotation, and token flow through the LoadBalancer endpoint.
    • Introduced an ordered validation suite for private Azure deployments to ensure OAuth LoadBalancer behavior.
    • Broadened the Azure test selection to include an additional private OAuth LoadBalancer variant.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 25, 2026
@openshift-ci-robot

openshift-ci-robot commented May 25, 2026

Copy link
Copy Markdown

@Nirshal: This pull request references CNTRLPLANE-3277 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds the AzureOAuthLoadBalancerPrivateTest to the v2 Ginkgo e2e suite, validating
that the OAuth server works correctly when published via LoadBalancer on an Azure
Private topology cluster.

The test verifies:

  • The oauth-openshift Service is created as type LoadBalancer with an allocated endpoint
  • The Service carries the Azure internal LoadBalancer annotation (confirming it's an ILB, not public)
  • The OAuth token flow (kubeadmin login + htpasswd IDP setup + token request) works through the private endpoint

Also adds:

  • oauth-lb-private cluster spec in lifecycle/azure.go combining --endpoint-access=Private with --oauth-publishing-strategy=LoadBalancer
  • Corresponding test matrix entry with label self-managed-azure-oauth-lb-private

Note: The OAuth token flow test requires the test runner to have network connectivity
to the Azure VNet (e.g., running in the management cluster or via VPN/peering).

Companion PR needed: openshift/release changes to create/destroy/dump the
oauth-lb-private cluster and wire the label filter in the CI job.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3277

Special notes for your reviewer:

  • This builds on top of CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework #8527 (CNTRLPLANE-3222) which ported the v1 lifecycle tests to v2 and established the Azure self-managed test infrastructure.
  • The naming oauth-lb-private follows the existing oauth-lb convention rather than private-oauth-lb to stay consistent with Bryan's naming. Open to renaming both to public-oauth-lb / private-oauth-lb for clarity if preferred.
  • The ValidateOAuthWithIdentityProviderViaLoadBalancer helper in test/e2e/util/oauth.go does not clean up the htpasswd Secret it creates — this is pre-existing and safe since each test variant uses a separate cluster, but worth a follow-up.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests. No unit tests needed — changes are e2e test code and CI cluster configuration only, with no new business logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@openshift-ci

openshift-ci Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds --oauth-publishing-strategy=LoadBalancer and the private NAT subnet arg to the Azure private cluster ExtraArgs, expands the Azure private test label selector to include self-managed-azure-oauth-lb-private, and adds an ordered E2E test (AzureOAuthLoadBalancerPrivateTest) that waits for oauth-openshift to be a LoadBalancer, asserts the Azure internal LoadBalancer annotation, and validates the OAuth token flow via the LoadBalancer endpoint.

Sequence Diagram(s)

sequenceDiagram
  participant TestRunner
  participant HostedCluster
  participant AzureLoadBalancer
  participant ManagementClient
  TestRunner->>HostedCluster: wait for oauth-openshift Service type=LoadBalancer
  HostedCluster->>AzureLoadBalancer: allocate LB and ingress endpoint
  AzureLoadBalancer->>HostedCluster: annotate Service with InternalLoadBalancerAnnotation
  TestRunner->>ManagementClient: perform OAuth token flow via LB endpoint
  ManagementClient->>HostedCluster: validate token exchange through LB ingress
Loading

Suggested reviewers

  • enxebre
  • cblecker
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning ValidateOAuthWithIdentityProviderViaLoadBalancer creates htpasswd Secret and mutates HostedCluster OAuth config without cleanup, causing test order-dependency on shared clusters. Add AfterAll block to restore original OAuth config and delete htpasswd Secret with apierrors.IsNotFound checks during cleanup.
Microshift Test Compatibility ⚠️ Warning Test uses config.openshift.io/v1, user.openshift.io/v1, and osin.openshift.io/v1 APIs unavailable on MicroShift without protection mechanisms like [apigroup:config.openshift.io] label. Add [apigroup:config.openshift.io] tag to test name, or add [Skipped:MicroShift] label, or guard with exutil.IsMicroShiftCluster() check.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding an Azure OAuth LoadBalancer private topology e2e test. It is concise, specific, and directly matches the pull request's primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic with no dynamic content; dynamic values correctly appear only in test bodies, not titles.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new AzureOAuthLoadBalancerPrivateTest does not make multi-node assumptions; it tests Service creation, annotations, and OAuth token flow which all work on single-node clusters.
Topology-Aware Scheduling Compatibility ✅ Passed This PR only modifies e2e test files. It does not introduce deployment manifests, operator code, or controllers with scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR introduces no OTE stdout contract violations: adds only configuration strings and Ginkgo test functions (inside Context/It blocks), with no process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test uses DNS hostnames validated by regex for URL construction, not raw IP addresses. No hardcoded IPv4/IPv6 addresses or public internet connectivity detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 25, 2026
@openshift-ci

openshift-ci Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Nirshal
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

304-320: ⚡ Quick win

Move platform/topology guard skips to BeforeEach with standardized skip text.

In Line 304-320, the new suite uses BeforeAll and custom skip messages for platform checks. Please switch to BeforeEach and the repository’s required message format for platform-specific skips.

♻️ Suggested adjustment
-		BeforeAll(func() {
+		BeforeEach(func() {
 			testCtx = getTestCtx()
 			hc = testCtx.GetHostedCluster()
 			if hc == nil || hc.Spec.Platform.Type != hyperv1.AzurePlatform {
-				Skip("Azure OAuth LB Private tests are only for Azure platform")
+				Skip("[Azure OAuth LB Private] test is only for Azure platform")
 			}
 			if hc.Spec.Platform.Azure == nil || hc.Spec.Platform.Azure.Topology != hyperv1.AzureTopologyPrivate {
 				Skip("Azure OAuth LB Private tests require Private topology")
 			}

As per coding guidelines, "test/e2e/v2/tests/**/*_test.go: Use BeforeEach with Skip() for platform-specific tests with message format: [Feature] test is only for [Platform] platform".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 304 - 320,
Replace the platform/topology guard block currently inside BeforeAll with a
BeforeEach that performs the same checks (use
getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type,
hc.Spec.Platform.Azure.Topology, and
netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move
controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the
repository-standard messages: for platform use "Azure OAuth LB Private tests is
only for Azure platform" and for topology use "Azure OAuth LB Private tests
requires Private topology" (also keep the existing Skip for publishing strategy
if present); update references to BeforeAll -> BeforeEach and keep the same
symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace,
netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 304-320: Replace the platform/topology guard block currently
inside BeforeAll with a BeforeEach that performs the same checks (use
getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type,
hc.Spec.Platform.Azure.Topology, and
netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move
controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the
repository-standard messages: for platform use "Azure OAuth LB Private tests is
only for Azure platform" and for topology use "Azure OAuth LB Private tests
requires Private topology" (also keep the existing Skip for publishing strategy
if present); update references to BeforeAll -> BeforeEach and keep the same
symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace,
netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4942b228-db2d-42ed-83bf-2dd3738e5e94

📥 Commits

Reviewing files that changed from the base of the PR and between 7625684 and b8a5e30.

📒 Files selected for processing (2)
  • test/e2e/v2/lifecycle/azure.go
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.53%. Comparing base (6d994e4) to head (5d52eb0).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8584      +/-   ##
==========================================
+ Coverage   40.41%   40.53%   +0.12%     
==========================================
  Files         755      755              
  Lines       93235    93235              
==========================================
+ Hits        37679    37794     +115     
+ Misses      52854    52740     -114     
+ Partials     2702     2701       -1     

see 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.45% <ø> (+<0.01%) ⬆️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 41.23% <ø> (+0.91%) ⬆️
hypershift-operator 50.72% <ø> (ø)
other 31.58% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add e2e validation for OAuth LoadBalancer publishing on Azure Private
topology. Reuse the existing private cluster variant by adding
--oauth-publishing-strategy=LoadBalancer to its creation args, avoiding
an additional Azure cluster in CI.

Signed-off-by: Alessandro Rossi <alesross@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@Nirshal Nirshal force-pushed the CNTRLPLANE-3277 branch from b8a5e30 to 5d52eb0 Compare May 26, 2026 13:51
@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Re: CodeRabbit suggestion to change BeforeAll to BeforeEach — intentionally keeping BeforeAll here.

This Context uses the Ordered decorator with shared state (testCtx, hc, controlPlaneNamespace) across three sequential It blocks. BeforeAll runs once and initializes the shared variables; BeforeEach would redundantly re-fetch the test context and re-run the guards before every It.

This matches the existing pattern in AzurePrivateTopologyTest (same file, line 91) which also uses Ordered + BeforeAll for the same reason.

@Nirshal Nirshal marked this pull request as ready for review May 26, 2026 13:56
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and sjenning May 26, 2026 13:58

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 379-382: The test calls
ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the HostedCluster
OAuth config and creates an htpasswd Secret without restoring them; capture the
original HostedCluster.Spec.OAuth (and any other mutated fields) and the
existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bcc13421-364c-476a-94f1-73cf7c97a172

📥 Commits

Reviewing files that changed from the base of the PR and between b8a5e30 and 5d52eb0.

📒 Files selected for processing (2)
  • test/e2e/v2/lifecycle/azure.go
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

Comment on lines +379 to +382
It("should complete OAuth token flow through LoadBalancer endpoint", func() {
ctx := testCtx.Context
e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc)
})

@coderabbitai coderabbitai Bot May 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore cluster auth state after OAuth flow mutation.

Line 381 calls ValidateOAuthWithIdentityProviderViaLoadBalancer, which mutates HostedCluster OAuth config and creates an htpasswd Secret without restoring/deleting them. This makes outcomes order-dependent on shared clusters. Please capture original state and defer restoration, with apierrors.IsNotFound() handling during cleanup.

As per coding guidelines: "When a test mutates cluster state, capture original state before mutation and defer restoration; check apierrors.IsNotFound() in cleanup to handle already-deleted resources".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 379 - 382, The
test calls ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the
HostedCluster OAuth config and creates an htpasswd Secret without restoring
them; capture the original HostedCluster.Spec.OAuth (and any other mutated
fields) and the existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutation happens inside the shared helper ValidateOAuthWithIdentityProviderViaLoadBalancer (in test/e2e/util/oauth.go), which is also used by the existing AzureOAuthLoadBalancerTest (line 284, same file) without cleanup. Adding cleanup in the caller would mean duplicating restore logic that belongs in the helper itself.

This is a pre-existing concern, not introduced by this PR. Noted in the PR description as a known tradeoff. If we address it, it should be done in the shared helper as a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nirshal, that's a fair point — the concern is pre-existing and your reasoning is correct. Duplicating restore logic in each caller would be the wrong approach; the cleanup belongs inside ValidateOAuthWithIdentityProviderViaLoadBalancer itself so all callers benefit automatically.

Would you like me to open a follow-up GitHub issue to track adding htpasswd Secret cleanup (and OAuth config restoration) to the shared helper?


🧠 Learnings used
Learnt from: cblecker
Repo: openshift/hypershift PR: 8511
File: test/e2e/v2/tests/hosted_cluster_metrics_test.go:203-245
Timestamp: 2026-05-15T03:34:00.446Z
Learning: When reviewing HyperShift v2 e2e tests under test/e2e/v2/, do not require MicroShift “guard” logic or markers. HyperShift v2 e2e tests in this directory run only against hosted clusters (no MicroShift targets), so missing exutil.IsMicroShiftCluster() checks and/or [Skipped:MicroShift] labels should not be flagged.

Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8527
File: test/e2e/v2/tests/nodepool_autoscaling_test.go:236-236
Timestamp: 2026-05-15T14:34:25.062Z
Learning: In openshift/hypershift v2 e2e lifecycle CI tests under test/e2e/v2/tests/, it is OK to reference external image registries because the CI environment is not air-gapped. Image references like registry.access.redhat.com/ubi9/ubi-minimal:latest in lifecycle DaemonSet definitions should not be flagged for requiring internal registries. For the autoscaling workload created by newAutoscalingWorkload (nodepool_autoscaling_test.go), ensure it uses registry.k8s.io/pause:3.9 (not ubi-minimal).

Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8469
File: test/e2e/util/util_test.go:17-131
Timestamp: 2026-05-17T22:37:02.057Z
Learning: In Go e2e tests under test/e2e, do not call t.Parallel() (neither at the top-level test nor in any subtest) when the test (or any helper it calls) uses t.Setenv—for example, when calling azureutil.SetAsAroHCPTest(t). The Go test runner panics with "testing: test using t.Setenv can not use t.Parallel" because environment variables are process-global. Keep such tests serial to ensure the environment setup is safe and deterministic.

@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test ?

1 similar comment
@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test ?

@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@openshift-ci

openshift-ci Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

@Nirshal: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-v2-self-managed 5d52eb0 link false /test e2e-azure-v2-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented May 26, 2026

Copy link
Copy Markdown

Now the picture is crystal clear:

  • Lines 358-378: The oauth-lb group binary runs tests tagged self-managed-azure-oauth-lb against cluster oauth-lb-0f50cdadbc. But confusingly, lines 369-378 show the private topology test (tagged self-managed-azure-oauth-lb-private) running against cluster private-0f50cdadbc. This output is interleaved from the private binary — the output of multiple parallel processes goes to stdout together.

  • Lines 392-395: The oauth-lb group finishes: "Ran 2 of 919 Specs" → PASSED.

  • Lines 396-440: The private group binary runs 7 specs. The should complete OAuth token flow through LoadBalancer endpoint test (line 408) with label self-managed-azure-oauth-lb-private fails after 300 seconds of i/o timeouts to 10.0.0.22:443.

So there's only ONE execution of this test (the private group), and it failed. The "successful" output at lines 369-378 is also from this same test — it shows the ValidateOAuthWithIdentityProviderViaLoadBalancer step which successfully got an OAuth token. But AFTER that, the WaitForOAuthLoadBalancerReady healthcheck at oauth.go:375 fails trying to do a raw TCP health check to 10.0.0.22:443.

The test has TWO parts: (1) OAuth token flow (lines 375-377, succeeded) and (2) healthcheck via raw TLS dial (lines 396-405, failed). The test sequence is: token flow succeeded first, then the healthcheck timed out.

Now I have full understanding. Let me produce the final report.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed
  • Build ID: 2059278516129632256
  • Target: e2e-azure-v2-self-managed
  • PR: #8584CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test
  • State: failure
  • Duration: ~3h 42m (14:20 → 18:02 UTC)
  • Failed Test: Hosted Cluster Azure > Azure OAuth LoadBalancer in Private Topology > should complete OAuth token flow through LoadBalancer endpoint
  • Test Groups: PASS: oauth-lb, upgrade, etcd-chaos, autoscaling, public | FAIL: private

Test Failure Analysis

Error

[FAILED] [300.202 seconds]
Hosted Cluster Azure > Azure OAuth LoadBalancer in Private Topology > should complete OAuth token flow through LoadBalancer endpoint

failed waiting for OAuth LoadBalancer oauth-private-0f50cdadbc.sm.hcp-sm-azure.azure.devcluster.openshift.com to be healthy
Unexpected error:
    <context.deadlineExceededError>:
    context deadline exceeded
    {}
occurred

In [It] at: /hypershift/test/e2e/util/oauth.go:375

Repeated errors: dial tcp 10.0.0.22:443: i/o timeout (9 consecutive attempts over 300 seconds)

Summary

The new e2e test introduced by this PR (AzureOAuthLoadBalancerPrivateTest) calls WaitForOAuthLoadBalancerReady() which performs a raw TLS healthcheck (/healthz endpoint) against the OAuth LoadBalancer. For the private hosted cluster (private-0f50cdadbc), the OAuth LoadBalancer is backed by an Azure internal LB (annotation service.beta.kubernetes.io/azure-load-balancer-internal: "true") and gets a private VNet IP (10.0.0.22). The CI test runner pod runs outside this VNet on the build cluster and cannot reach the private IP, causing the TLS healthcheck to time out after 300 seconds. Notably, the OAuth token flow step within the same test does succeed (via ValidateOAuthWithIdentityProviderViaLoadBalancer) because it uses a different code path that goes through the management cluster kubeconfig / proxy — but the subsequent raw healthcheck dials the private IP directly and fails. The secondary failures in the destroy-guests and destroy-management-cluster post-steps are caused by the Azure PrivateLinkService NIC resource dependency preventing resource group cleanup — a teardown consequence of the OAuth PLS created by the test.

Root Cause

Primary cause: Network unreachability of Azure internal LoadBalancer IP from CI test runner.

The test should complete OAuth token flow through LoadBalancer endpoint (label: self-managed-azure-oauth-lb-private) runs against the private hosted cluster variant, which uses --endpoint-access=Private. This causes the oauth-openshift Service to be created as an Azure internal LoadBalancer with the annotation service.beta.kubernetes.io/azure-load-balancer-internal: "true", resulting in a private VNet IP (10.0.0.22).

The test calls WaitForOAuthLoadBalancerReady() at oauth.go:375, which performs a raw TLS GET /healthz against the OAuth hostname. DNS resolves oauth-private-0f50cdadbc.sm.hcp-sm-azure.azure.devcluster.openshift.com to 10.0.0.22, which is on the Azure VNet's internal subnet. The CI test runner pod runs on the OpenShift build cluster (build01) which has no VNet peering, VPN, or Private Endpoint to this Azure VNet, making the IP unreachable — hence dial tcp 10.0.0.22:443: i/o timeout.

Why the OAuth token flow step succeeded: The same test's earlier step (ValidateOAuthWithIdentityProviderViaLoadBalancer) obtained an OAuth token successfully. This step uses an HTTP client configured with the hosted cluster's kubeconfig/proxy settings, which routes through the management cluster's API server — not via direct TCP dial to the internal LB IP. This is why line 377 shows "OAuth token retrieved successfully for user testuser" in ~6 seconds while the raw healthcheck at line 396+ fails.

Contrast with the non-private variant: The oauth-lb test group runs the same test type against cluster oauth-lb-0f50cdadbc, which uses a public LoadBalancer with external IP 52.154.160.248. The healthcheck succeeds immediately because the external IP is reachable from the CI build cluster.

Secondary failures: The destroy-guests step failed because the private hosted cluster couldn't be finalized within the 40-minute timeout (context deadline exceeded). The destroy-management-cluster step failed with Azure error NicInUseWithPrivateLinkService — the NIC associated with the OAuth PrivateLinkService (pls-oauth-openshift-*) could not be deleted because the PLS was still attached. Both are teardown cascading failures caused by the OAuth PLS resources created for the private topology test.

Recommendations
  1. Remove the raw healthcheck for private topology tests. The WaitForOAuthLoadBalancerReady() call at oauth.go:375 performs a direct TLS dial that is fundamentally incompatible with private/internal LB endpoints. Since the OAuth token flow already validates end-to-end functionality via ValidateOAuthWithIdentityProviderViaLoadBalancer, the raw healthcheck is redundant for private topology. Skip or remove the WaitForOAuthLoadBalancerReady() call when endpointAccess == Private.

  2. Alternatively, proxy the healthcheck through the management cluster. If a raw healthcheck is required, route it through the management cluster's network (e.g., by exec'ing into a pod on the management cluster that has VNet connectivity) rather than dialing the private IP directly from the test runner.

  3. Ensure PrivateLinkService teardown is handled. The destroy failures indicate the PLS NIC dependency blocks resource group deletion. The destroy sequence should explicitly delete the PrivateLinkService resource before attempting to delete the resource group, or the hypershift destroy cluster azure command should handle this cleanup order.

  4. Consider adding a guard label. If the private OAuth healthcheck is intentionally not supported from outside the VNet, add explicit documentation and a skip annotation so future test runs don't regress on this known limitation.

Evidence
Evidence Detail
Failed test Azure OAuth LoadBalancer in Private Topology > should complete OAuth token flow through LoadBalancer endpoint
Label self-managed-azure-oauth-lb-private
Failure location /hypershift/test/e2e/util/oauth.go:375WaitForOAuthLoadBalancerReady()
Target IP 10.0.0.22 (Azure internal LB, private VNet IP)
Error dial tcp 10.0.0.22:443: i/o timeoutcontext deadline exceeded after 300s
OAuth token flow Succeeded (line 377: "OAuth token retrieved successfully for user testuser" in 6.2s)
Non-private variant Passed — oauth-lb-0f50cdadbc used external IP 52.154.160.248, healthcheck succeeded instantly
Private cluster config --endpoint-access=Private → internal LB annotation azure-load-balancer-internal: true
Test runner location CI build cluster build01, no VNet peering to Azure hosted cluster VNet
Destroy-guests failure hostedcluster wasn't finalized, aborting delete: context deadline exceeded (40m timeout)
Destroy-mgmt failure NicInUseWithPrivateLinkService — OAuth PLS NIC (pls-oauth-openshift-*) blocked resource group deletion
PR introducing tests #8584 — this is the first run of these new tests
Test groups 5 passed (oauth-lb, upgrade, etcd-chaos, autoscaling, public), 1 failed (private)
Build-log timestamp Private tests failed at 15:21:43 UTC; destroy steps failed at 17:42:00 and 18:01:56 UTC

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test structure itself is solid — good use of Ordered + BeforeAll, thorough skip guards, and the internal LB annotation check is a meaningful addition for private topology coverage.

Two things to address before this can merge:

1. Healthcheck unreachable from CI (blocking): WaitForOAuthLoadBalancerReady dials the OAuth hostname directly via TLS, which resolves to the internal LB IP for private topology — unreachable from the build cluster. The CI run confirmed this (job 2059278516129632256, oauth.go:375, 300s timeout). The token flow step worked; it's only the raw healthcheck that fails. Fix needs to happen in the shared helper.

2. Test isolation on the shared cluster: The public AzureOAuthLoadBalancerTest works because it has a dedicated cluster. This private variant shares a cluster with read-only tests and relies on declaration order for isolation. Either a dedicated cluster, or a sequential matrix step, would give the same guarantee the public variant has.

Minor nit: the Service type + endpoint predicates in the first It block are nearly identical to the public variant (~30 lines). Not blocking, but worth a shared helper if more LB variants get added.

It("should complete OAuth token flow through LoadBalancer endpoint", func() {
ctx := testCtx.Context
e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc)
})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI run showed this test can't pass as-is — WaitForOAuthLoadBalancerReady (in test/e2e/util/oauth.go:350-375) does a direct TLS dial to https://<oauth-hostname>/healthz. For private topology that hostname resolves to the Azure internal LB IP (e.g., 10.0.0.22), which is unreachable from the CI build cluster.

Interestingly the OAuth token flow itself succeeds because it routes through the guest kubeconfig transport — it's specifically the raw healthcheck that fails. So the underlying validation logic works; it's just the connectivity check that's incompatible with private topology.

The fix needs to happen in the shared helper. A few approaches:

  1. Skip WaitForOAuthLoadBalancerReady entirely for private topology and rely on the token flow validation alone — it already exercises the full OAuth stack end-to-end.
  2. Use the guest kubeconfig transport for the healthcheck too, the same way the token flow step does.
  3. Route it through a pod on the management cluster that has VNet connectivity.

This needs to be addressed in the helper before this can land.

AzureOAuthLoadBalancerPrivateTest(getTestCtx)
}

var _ = Describe("Hosted Cluster Azure", Label("hosted-cluster-azure"), func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public variant (AzureOAuthLoadBalancerTest) works cleanly because it runs on a dedicated cluster (cluster-name-oauth-lb) created specifically for it — mutation is safe there since nothing else runs on that cluster. This private variant is trying to replicate that pattern but without the dedicated cluster, sharing cluster-name-private with the read-only AzurePrivateTopologyTest instead.

Registering this last is a reasonable mitigation, but declaration order between Context blocks isn't enforced by Ginkgo's Ordered decorator — that only applies within a single Context. If the registration order changes, or a new test gets inserted between the two, the read-only private topology tests could run after the HC has been mutated (htpasswd IDP created, kubeadmin secret deleted).

The cleanest fix that matches how the public variant is structured would be a dedicated cluster-name-oauth-lb-private cluster. I understand the reviewer feedback to reduce cluster count, so if a dedicated cluster is off the table, the other option is to make this a sequential step in the test matrix (after the read-only private tests), the way etcd-chaos runs sequentially after upgrade on the upgrade cluster. That would give the same isolation guarantee without an extra cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants