Skip to content

fix: OCPBUGS-86693: add machineNetwork CIDRs to KAS NO_PROXY to prevent x509 errors#8628

Closed
dpateriya wants to merge 1 commit into
openshift:mainfrom
dpateriya:fix/kas-noproxy-machine-network
Closed

fix: OCPBUGS-86693: add machineNetwork CIDRs to KAS NO_PROXY to prevent x509 errors#8628
dpateriya wants to merge 1 commit into
openshift:mainfrom
dpateriya:fix/kas-noproxy-machine-network

Conversation

@dpateriya
Copy link
Copy Markdown
Contributor

@dpateriya dpateriya commented May 28, 2026

Summary

When the management cluster uses an HTTP proxy, the KAS container in the hosted control plane inherits the management cluster's proxy environment via proxy.SetEnvVars() in the control-plane-operator. The NO_PROXY list for KAS already includes the guest cluster's clusterNetwork (pod) and serviceNetwork CIDRs, but omits the machineNetwork CIDRs.

KAS connects to kubelets on node IPs (which fall in the machineNetwork range) via the konnectivity tunnel for subresource operations (oc logs, oc exec, oc attach). Without the machineNetwork CIDRs in NO_PROXY, these HTTPS connections are routed through the management cluster's HTTP proxy. The proxy performs TLS interception, presenting its own certificate which KAS does not trust, resulting in:

x509: certificate signed by unknown authority

Root Cause

In control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go, the additionalNoProxyCIDRS slice only includes clusterNetwork and serviceNetwork, but not machineNetwork:

var additionalNoProxyCIDRS []string
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ClusterCIDRs(hcp.Spec.Networking.ClusterNetwork)...)
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ServiceCIDRs(hcp.Spec.Networking.ServiceNetwork)...)
// machineNetwork was missing here
proxy.SetEnvVars(&c.Env, additionalNoProxyCIDRS...)

Fix

Add machineNetwork CIDRs to the KAS NO_PROXY list:

additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.MachineCIDRs(hcp.Spec.Networking.MachineNetwork)...)

How to Reproduce

  1. Deploy a management cluster behind an HTTP proxy
  2. Create a HostedCluster where machineNetwork is NOT included in the management cluster's NO_PROXY
  3. Run oc logs <any-pod> from the hosted cluster
  4. Observe: x509: certificate signed by unknown authority error

Customer Validation

A customer confirmed this fix resolves the issue — after manually adding the guest cluster's machineNetwork (along with clusterNetwork and serviceNetwork) to the management cluster's NO_PROXY, the oc logs x509 error was immediately resolved. This PR automates that by ensuring the control-plane-operator always includes machineNetwork in KAS's NO_PROXY.

Test Plan

  • Existing unit tests pass (go test ./control-plane-operator/controllers/hostedcontrolplane/v2/kas/...)
  • Verify on a proxied management cluster that oc logs, oc exec, oc attach work without manually adding machineNetwork to management cluster NO_PROXY

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • Proxy exemption handling updated so machine network CIDRs are excluded from proxy interception alongside pod and service networks, preventing TLS handshake failures caused by proxy interception and improving control-plane connectivity reliability.
  • Tests
    • Added tests validating NO_PROXY is populated correctly, including cases with and without machine network CIDRs and dual-stack networks.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

The kube-apiserver deployment configuration now extends the no-proxy exemption list to include machine network CIDRs in addition to cluster and service CIDRs. This change prevents proxy-based routing of management traffic through konnectivity, which would cause TLS certificate validation failures. The updateMainContainer function appends the machine CIDRs via netutil.MachineCIDRs(...) to the existing proxy environment variable configuration.

Possibly related PRs

  • openshift/hypershift#8569: Propagates proxy environment variables (including NO_PROXY contents) into the konnectivity sidecar, complementing this PR's expansion of no-proxy exemptions in the KAS container.

Suggested reviewers

  • enxebre
  • sdminonne
  • bryan-cox
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive Custom check specifies "Ginkgo test code" review, but PR adds standard Go unit tests (using testing.T, t.Run) with Gomega assertions, not Ginkgo-style tests. Clarify if check applies to standard Go tests. If so, test passes all requirements: single responsibility, appropriate setup (t.Setenv), meaningful assertions, and follows repo patterns.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: adding machineNetwork CIDRs to KAS NO_PROXY environment variable to fix x509 TLS errors. It is specific, concise, and accurately reflects the primary objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR contains only standard Go tests (testing.T), not Ginkgo tests. All test names are static, descriptive strings with no dynamic values (pod names, timestamps, UUIDs, IPs, etc.).
Topology-Aware Scheduling Compatibility ✅ Passed PR adds machineNetwork CIDRs to NO_PROXY env var only. No scheduling constraints, pod affinity, topology spread, nodeSelector, replica count logic, or taints/tolerations are introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds unit tests (TestUpdateMainContainerNoProxy), not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests (It, Describe, Context, When, etc.), so it is not applicable here.
No-Weak-Crypto ✅ Passed PR contains no weak crypto algorithms, custom crypto, or non-constant-time secret comparisons. Changes only add machine network CIDRs to proxy NO_PROXY environment configuration.
Container-Privileges ✅ Passed PR only modifies NO_PROXY environment variable configuration; no privileged container settings, hostPID/Network/IPC, capabilities, or security contexts are introduced or modified.
No-Sensitive-Data-In-Logs ✅ Passed PR contains no logging statements and only manipulates network CIDR environment variables; no sensitive data exposure risk identified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Nirshal and bryan-cox May 28, 2026 17:02
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dpateriya
Once this PR has been reviewed and has the lgtm label, please assign devguyio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go`:
- Line 192: Add a unit test that exercises updateMainContainer so NO_PROXY
includes the machineNetwork CIDRs: instantiate an HCP with
Spec.Networking.MachineNetwork set to known CIDRs, call the code path that runs
updateMainContainer (or extract and call the helper that computes
additionalNoProxyCIDRS), and assert that the env var produced via
proxy.SetEnvVars contains the expected
netutil.MachineCIDRs(hcp.Spec.Networking.MachineNetwork) values; reference
updateMainContainer, netutil.MachineCIDRs and proxy.SetEnvVars when locating
code and add the test to the existing deployment_test.go (or a new focused test
file) similar to support/proxy/proxy_test.go so the NO_PROXY propagation from
MachineNetwork is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fe1c654d-7425-4506-8b76-6a59b45737b7

📥 Commits

Reviewing files that changed from the base of the PR and between adfbcdd and 407d826.

📒 Files selected for processing (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.79%. Comparing base (adfbcdd) to head (407d826).

Files with missing lines Patch % Lines
...ontrollers/hostedcontrolplane/v2/kas/deployment.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8628      +/-   ##
==========================================
- Coverage   45.84%   41.79%   -4.05%     
==========================================
  Files         440      252     -188     
  Lines       52824    16414   -36410     
==========================================
- Hits        24218     6861   -17357     
+ Misses      26816     9061   -17755     
+ Partials     1790      492    -1298     
Files with missing lines Coverage Δ
...ontrollers/hostedcontrolplane/v2/kas/deployment.go 26.47% <0.00%> (-0.32%) ⬇️

... and 188 files with indirect coverage changes

Flag Coverage Δ
cpo-hostedcontrolplane 41.79% <0.00%> (-0.02%) ⬇️
cpo-other ?
hypershift-operator ?

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dpateriya dpateriya force-pushed the fix/kas-noproxy-machine-network branch from 407d826 to ef9280f Compare May 28, 2026 17:16
@dpateriya dpateriya changed the title Bug: KAS NO_PROXY missing machineNetwork CIDRs causes x509 errors for oc logs/exec/attach fix: OCPBUGS-86693: add machineNetwork CIDRs to KAS NO_PROXY to prevent x509 errors May 28, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-86693, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

When the management cluster uses an HTTP proxy, the KAS container in the hosted control plane inherits the management cluster's proxy environment via proxy.SetEnvVars() in the control-plane-operator. The NO_PROXY list for KAS already includes the guest cluster's clusterNetwork (pod) and serviceNetwork CIDRs, but omits the machineNetwork CIDRs.

KAS connects to kubelets on node IPs (which fall in the machineNetwork range) via the konnectivity tunnel for subresource operations (oc logs, oc exec, oc attach). Without the machineNetwork CIDRs in NO_PROXY, these HTTPS connections are routed through the management cluster's HTTP proxy. The proxy performs TLS interception, presenting its own certificate which KAS does not trust, resulting in:

x509: certificate signed by unknown authority

Root Cause

In control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go, the additionalNoProxyCIDRS slice only includes clusterNetwork and serviceNetwork, but not machineNetwork:

var additionalNoProxyCIDRS []string
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ClusterCIDRs(hcp.Spec.Networking.ClusterNetwork)...)
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ServiceCIDRs(hcp.Spec.Networking.ServiceNetwork)...)
// machineNetwork was missing here
proxy.SetEnvVars(&c.Env, additionalNoProxyCIDRS...)

Fix

Add machineNetwork CIDRs to the KAS NO_PROXY list:

additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.MachineCIDRs(hcp.Spec.Networking.MachineNetwork)...)

How to Reproduce

  1. Deploy a management cluster behind an HTTP proxy
  2. Create a HostedCluster where machineNetwork is NOT included in the management cluster's NO_PROXY
  3. Run oc logs <any-pod> from the hosted cluster
  4. Observe: x509: certificate signed by unknown authority error

Customer Validation

A customer confirmed this fix resolves the issue — after manually adding the guest cluster's machineNetwork (along with clusterNetwork and serviceNetwork) to the management cluster's NO_PROXY, the oc logs x509 error was immediately resolved. This PR automates that by ensuring the control-plane-operator always includes machineNetwork in KAS's NO_PROXY.

Test Plan

  • Existing unit tests pass (go test ./control-plane-operator/controllers/hostedcontrolplane/v2/kas/...)
  • Verify on a proxied management cluster that oc logs, oc exec, oc attach work without manually adding machineNetwork to management cluster NO_PROXY

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
  • Fixed potential TLS failures in proxy configurations by ensuring machine network traffic is properly exempted from proxy rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@openshift-ci-robot
Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-86693, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

When the management cluster uses an HTTP proxy, the KAS container in the hosted control plane inherits the management cluster's proxy environment via proxy.SetEnvVars() in the control-plane-operator. The NO_PROXY list for KAS already includes the guest cluster's clusterNetwork (pod) and serviceNetwork CIDRs, but omits the machineNetwork CIDRs.

KAS connects to kubelets on node IPs (which fall in the machineNetwork range) via the konnectivity tunnel for subresource operations (oc logs, oc exec, oc attach). Without the machineNetwork CIDRs in NO_PROXY, these HTTPS connections are routed through the management cluster's HTTP proxy. The proxy performs TLS interception, presenting its own certificate which KAS does not trust, resulting in:

x509: certificate signed by unknown authority

Root Cause

In control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go, the additionalNoProxyCIDRS slice only includes clusterNetwork and serviceNetwork, but not machineNetwork:

var additionalNoProxyCIDRS []string
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ClusterCIDRs(hcp.Spec.Networking.ClusterNetwork)...)
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ServiceCIDRs(hcp.Spec.Networking.ServiceNetwork)...)
// machineNetwork was missing here
proxy.SetEnvVars(&c.Env, additionalNoProxyCIDRS...)

Fix

Add machineNetwork CIDRs to the KAS NO_PROXY list:

additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.MachineCIDRs(hcp.Spec.Networking.MachineNetwork)...)

How to Reproduce

  1. Deploy a management cluster behind an HTTP proxy
  2. Create a HostedCluster where machineNetwork is NOT included in the management cluster's NO_PROXY
  3. Run oc logs <any-pod> from the hosted cluster
  4. Observe: x509: certificate signed by unknown authority error

Customer Validation

A customer confirmed this fix resolves the issue — after manually adding the guest cluster's machineNetwork (along with clusterNetwork and serviceNetwork) to the management cluster's NO_PROXY, the oc logs x509 error was immediately resolved. This PR automates that by ensuring the control-plane-operator always includes machineNetwork in KAS's NO_PROXY.

Test Plan

  • Existing unit tests pass (go test ./control-plane-operator/controllers/hostedcontrolplane/v2/kas/...)
  • Verify on a proxied management cluster that oc logs, oc exec, oc attach work without manually adding machineNetwork to management cluster NO_PROXY

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
  • Improved proxy exemption handling so machine network ranges are now excluded from proxy interception alongside pod and service networks. This prevents TLS handshake failures caused by egress/konnectivity proxies touching machine-network traffic, reducing connectivity errors and improving reliability for control-plane communications.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dpateriya
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-86693, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dpateriya dpateriya force-pushed the fix/kas-noproxy-machine-network branch from ef9280f to 5aa05e6 Compare May 28, 2026 17:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go (1)

15-15: ⚡ Quick win

Use Gherkin-style naming for the new test cases.

Please rename these tests to follow the required “When ... it should ...” pattern (or wrap assertions in t.Run cases with that phrasing), so they match repo test conventions.

💡 Example update
-func TestUpdateMainContainerNoProxy(t *testing.T) {
+func TestUpdateMainContainer_WhenProxyConfiguredWithMachineNetwork_ItShouldIncludeClusterServiceMachineAndKubeAPIServerInNoProxy(t *testing.T) {

...

-func TestUpdateMainContainerNoProxyWithoutMachineNetwork(t *testing.T) {
+func TestUpdateMainContainer_WhenProxyConfiguredWithoutMachineNetwork_ItShouldIncludeClusterAndServiceInNoProxy(t *testing.T) {

As per coding guidelines, "**/*_test.go: Always use 'When ... it should ...' format for describing test cases when creating unit tests" and "Prefer Gherkin Syntax to define unit test cases (e.g., 'When... it should...')".

Also applies to: 58-58

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go`
at line 15, Rename the top-level test functions to use Gherkin-style
descriptions (e.g., "When ... it should ...") or wrap existing assertions in
t.Run subtests using that phrasing; specifically change
TestUpdateMainContainerNoProxy and the other test in the file (the one around
the Test* at the other noted location) to names like
TestWhenUpdatingMainContainerWithoutProxyItShould... or use t.Run("When updating
main container without proxy it should ...", func(t *testing.T) { ... }) inside
the existing Test* wrappers so the function/ subtest labels follow the "When ...
it should ..." convention and match repo test naming guidelines while leaving
the test logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go`:
- Line 15: Rename the top-level test functions to use Gherkin-style descriptions
(e.g., "When ... it should ...") or wrap existing assertions in t.Run subtests
using that phrasing; specifically change TestUpdateMainContainerNoProxy and the
other test in the file (the one around the Test* at the other noted location) to
names like TestWhenUpdatingMainContainerWithoutProxyItShould... or use
t.Run("When updating main container without proxy it should ...", func(t
*testing.T) { ... }) inside the existing Test* wrappers so the function/ subtest
labels follow the "When ... it should ..." convention and match repo test naming
guidelines while leaving the test logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 20d74fd2-993f-4c15-8d84-fb4f84cccd34

📥 Commits

Reviewing files that changed from the base of the PR and between ef9280f and 5aa05e6.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a dual-stack test case. MachineNetwork supports up to 2 entries. A test with both IPv4 and IPv6 CIDRs (e.g., 192.168.1.0/24 + fd00::/48) would validate the iteration works correctly for dual-stack clusters.

…nt x509 errors

When the management cluster uses an HTTP proxy, the kube-apiserver (KAS)
container in the hosted control plane inherits the management cluster proxy
environment. The NO_PROXY list for KAS already includes the guest cluster's
pod network (clusterNetwork) and service network (serviceNetwork) CIDRs,
but omits the machine network (machineNetwork) CIDRs.

Because KAS connects to kubelets on node IPs (in the machineNetwork range)
via the konnectivity tunnel for operations like oc logs, oc exec, and
oc attach, these HTTPS connections are routed through the management
cluster's HTTP proxy. The proxy performs TLS interception, presenting its
own certificate which is not trusted by KAS, resulting in:

  x509: certificate signed by unknown authority

Adding the machineNetwork CIDRs to the NO_PROXY list ensures these
connections bypass the proxy and go directly through konnectivity.

Co-authored-by: Cursor <cursoragent@cursor.com>
@dpateriya dpateriya force-pushed the fix/kas-noproxy-machine-network branch from 5aa05e6 to de5fab1 Compare May 31, 2026 14:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go (1)

15-91: ⚡ Quick win

Consider adding a test case for when proxy environment variables are not set.

The current test cases all set HTTP_PROXY and HTTPS_PROXY. Adding a test case where these are not set would validate that updateMainContainer doesn't add NO_PROXY when proxy is not configured, ensuring no regression in non-proxy environments.

✨ Suggested test case
+		{
+			name:           "When proxy is not configured it should not set NO_PROXY",
+			clusterNetwork: []hyperv1.ClusterNetworkEntry{{CIDR: *ipnet.MustParseCIDR("10.128.0.0/14")}},
+			serviceNetwork: []hyperv1.ServiceNetworkEntry{{CIDR: *ipnet.MustParseCIDR("172.30.0.0/16")}},
+			machineNetwork: []hyperv1.MachineNetworkEntry{{CIDR: *ipnet.MustParseCIDR("192.168.1.0/24")}},
+			expectedCIDRs:  []string{},
+		},

And update the test logic to handle this case:

 	for _, tc := range testCases {
 		t.Run(tc.name, func(t *testing.T) {
-			t.Setenv("HTTP_PROXY", "http://proxy.example.com:3128")
-			t.Setenv("HTTPS_PROXY", "http://proxy.example.com:3128")
-			t.Setenv("NO_PROXY", "")
+			if len(tc.expectedCIDRs) > 0 {
+				t.Setenv("HTTP_PROXY", "http://proxy.example.com:3128")
+				t.Setenv("HTTPS_PROXY", "http://proxy.example.com:3128")
+				t.Setenv("NO_PROXY", "")
+			}
 
 			// ... rest of test
 
 			g := NewWithT(t)
 			var noProxyValue string
 			for _, env := range podSpec.Containers[0].Env {
 				if env.Name == "NO_PROXY" {
 					noProxyValue = env.Value
 					break
 				}
 			}
-			g.Expect(noProxyValue).ToNot(BeEmpty(), "NO_PROXY env var should be set when proxy is configured")
+			if len(tc.expectedCIDRs) > 0 {
+				g.Expect(noProxyValue).ToNot(BeEmpty(), "NO_PROXY env var should be set when proxy is configured")
+			} else {
+				g.Expect(noProxyValue).To(BeEmpty(), "NO_PROXY env var should not be set when proxy is not configured")
+			}
 			for _, cidr := range tc.expectedCIDRs {
 				g.Expect(noProxyValue).To(ContainSubstring(cidr), "NO_PROXY should contain %s", cidr)
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go`
around lines 15 - 91, Add a new sub-test inside TestUpdateMainContainerNoProxy
that leaves HTTP_PROXY and HTTPS_PROXY unset (do not call t.Setenv for them) and
sets NO_PROXY to "" or leaves it unset, construct the same hcp/podSpec used by
the other cases, call updateMainContainer(podSpec, hcp) and assert that the
NO_PROXY env var is not added to podSpec.Containers[0].Env (i.e., iterate the
Env for name "NO_PROXY" and Expect it to be empty/not found). This ensures
updateMainContainer (and the ComponentName container) does not inject NO_PROXY
when no proxy is configured.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go`:
- Around line 15-91: Add a new sub-test inside TestUpdateMainContainerNoProxy
that leaves HTTP_PROXY and HTTPS_PROXY unset (do not call t.Setenv for them) and
sets NO_PROXY to "" or leaves it unset, construct the same hcp/podSpec used by
the other cases, call updateMainContainer(podSpec, hcp) and assert that the
NO_PROXY env var is not added to podSpec.Containers[0].Env (i.e., iterate the
Env for name "NO_PROXY" and Expect it to be empty/not found). This ensures
updateMainContainer (and the ComponentName container) does not inject NO_PROXY
when no proxy is configured.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 15019770-7dfb-4b60-aa64-be4fa1080794

📥 Commits

Reviewing files that changed from the base of the PR and between 5aa05e6 and de5fab1.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 31, 2026

@dpateriya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images de5fab1 link true /test images
ci/prow/okd-scos-images de5fab1 link true /test okd-scos-images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dpateriya
Copy link
Copy Markdown
Contributor Author

Hi @bryan-cox,

I guess the hcp.Spec.Networking.MachineNetwork never gets populated, even if customer has a HCP KubeVirt cluster with secondary network or HCP agent cluster or any other cluster, I guess this hcp.Spec.Networking.MachineNetwork gets never populated. So no point in fixing this because even if this is fixed the CPO will read it from the hcp.Spec.Networking.MachineNetwork which will be nil/empty.

The only remaining option for customer is to have management cluster no_proxy updated with the secondary machine network.

Anything we can do, like some sort of an RFE to get this hcp.Spec.Networking.MachineNetwork populated with correct networks?

@dpateriya
Copy link
Copy Markdown
Contributor Author

dpateriya commented May 31, 2026

As I deployed HCP AWS and I see the hosted cluster got machine network populated here:

  networking:
    allocateNodeCIDRs: Disabled
    clusterNetwork:
    - cidr: 10.132.0.0/14
    machineNetwork:
    - cidr: 10.0.0.0/16
    networkType: OVNKubernetes
    serviceNetwork:
    - cidr: 172.31.0.0/16

So this PR can be valid for cloud deployment of HCP @bryan-cox

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job 1: pull-ci-openshift-hypershift-main-okd-scos-images
  • Build ID: 2061094627028504576
  • Target: [images] (variant: okd-scos)
  • Cluster: build01
  • Duration: 6m43s (14:37:19Z – 14:44:02Z)
  • Prow Job 2: pull-ci-openshift-hypershift-main-images
  • Build ID: 2061094626990755840
  • Target: [images], [release:latest], hypershift-cli, hypershift-operator, hypershift-tests
  • Cluster: build01
  • Duration: 11m50s (14:37:19Z – 14:49:09Z)

Test Failure Analysis

Error

Job 1 (okd-scos-images):
  step [release-inputs:latest] failed: could not resolve source imagestream origin/scos-4.21
  for release latest: imagestreams.image.openshift.io "scos-4.21" not found

Job 2 (images):
  step [release-inputs:latest] failed: could not resolve source imagestream ocp/5.0
  for release latest: imagestreams.image.openshift.io "5.0" not found

Summary

Both failures are caused by a CI infrastructure issue — the imagestreams origin/scos-4.21 and ocp/5.0 were deleted or became unavailable on the build01 CI cluster on May 31, 2026. These imagestreams are used by ci-operator to assemble the "release latest" image for testing. All actual code compilation and image build steps succeeded (src, hypershift, hypershift-operator, hypershift-tests, hypershift-cli). The failures are completely unrelated to PR #8628's code changes (which add machineNetwork CIDRs to KAS NO_PROXY) and are confirmed to affect other unrelated PRs (e.g., PR #8645) with identical errors on the same day.

Root Cause

The root cause is a CI infrastructure issue on the build01 OpenShift CI cluster where two imagestreams required by ci-operator to build release images have been deleted or are temporarily unavailable:

  1. origin/scos-4.21 — Used by the okd-scos variant of the hypershift CI config to build the OKD/SCOS release stream. The ci-operator config at https://config.ci.openshift.org references origin/scos-4.21 as the integrated stream for the okd-scos variant, but the imagestream scos-4.21 no longer exists in the origin namespace on build01.

  2. ocp/5.0 — Used by the main (non-variant) CI config to build the OCP 5.0 release stream. The ci-operator config references ocp/5.0 as the integrated stream, but the imagestream 5.0 no longer exists in the ocp namespace on build01.

Both imagestreams were available as recently as May 30 (PR #8643's okd-scos-images job succeeded at 2026-05-30T00:43:53Z; PR #8640's images job succeeded at 2026-05-30T00:39:41Z). The deletion or unavailability occurred sometime between May 30 and May 31, likely due to a CI infrastructure maintenance operation, an imagestream garbage collection, or a release controller configuration change.

The [release-inputs:latest] step fails immediately (~8ms execution time) when the imagestream lookup returns a 404, causing the entire job to be reported as failed even though all compilation and image build steps completed successfully.

Recommendations
  1. No action required on PR fix: OCPBUGS-86693: add machineNetwork CIDRs to KAS NO_PROXY to prevent x509 errors #8628 — These failures are not caused by the PR's code changes. The PR modifies KAS NO_PROXY configuration and has no relation to CI imagestream resolution.

  2. Retry the jobs — Once the CI infrastructure issue is resolved, simply re-trigger both jobs with /retest or /test images and /test okd-scos-images.

  3. Report to TRT / CI infrastructure team — File a bug or ping #forum-ocp-testplatform on Slack reporting that the origin/scos-4.21 and ocp/5.0 imagestreams are missing from the build01 cluster. This is blocking all PRs in openshift/hypershift (and likely other repos that depend on these streams).

  4. Check release controller status — Verify that the release controller is still maintaining the scos-4.21 and 5.0 imagestreams. The imagestreams may have been intentionally removed as part of a version lifecycle change, in which case the ci-operator configs in openshift/release need updating.

Evidence
Evidence Detail
Failed step (both jobs) [release-inputs:latest] — imagestream resolution for release assembly
Error (okd-scos-images) imagestreams.image.openshift.io "scos-4.21" not found in namespace origin
Error (images) imagestreams.image.openshift.io "5.0" not found in namespace ocp
Build cluster build01 (both jobs)
All image builds ✅ Succeeded — src, hypershift, hypershift-operator, hypershift-tests, hypershift-cli
PR #8643 okd-scos-images (May 30) ✅ Succeeded — origin/scos-4.21 was available
PR #8640 images (May 30) ✅ Succeeded — ocp/5.0 was available
PR #8645 okd-scos-images (May 31) ❌ Same failure — origin/scos-4.21 not found
PR #8645 images (May 31) ❌ Same failure — ocp/5.0 not found
Failure reason (ci-operator) executing_graph:step_failed:creating_release_images
Step execution time <8ms (immediate failure on imagestream lookup)
PR #8628 code changes Adds machineNetwork CIDRs to KAS NO_PROXY — completely unrelated to CI config

@bryan-cox
Copy link
Copy Markdown
Member

Anything we can do, like some sort of an RFE to get this hcp.Spec.Networking.MachineNetwork populated with correct networks?

@dpateriya - yeah you're welcome to open an RFE for this. Since this PR doesn't fix the kubevirt thing and it needs an RFE, I think we should close the bug ticket out and this PR.

@dpateriya
Copy link
Copy Markdown
Contributor Author

Thanks @bryan-cox.

Closing this in favor of RFE.

@dpateriya dpateriya closed this Jun 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-86693. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Summary

When the management cluster uses an HTTP proxy, the KAS container in the hosted control plane inherits the management cluster's proxy environment via proxy.SetEnvVars() in the control-plane-operator. The NO_PROXY list for KAS already includes the guest cluster's clusterNetwork (pod) and serviceNetwork CIDRs, but omits the machineNetwork CIDRs.

KAS connects to kubelets on node IPs (which fall in the machineNetwork range) via the konnectivity tunnel for subresource operations (oc logs, oc exec, oc attach). Without the machineNetwork CIDRs in NO_PROXY, these HTTPS connections are routed through the management cluster's HTTP proxy. The proxy performs TLS interception, presenting its own certificate which KAS does not trust, resulting in:

x509: certificate signed by unknown authority

Root Cause

In control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go, the additionalNoProxyCIDRS slice only includes clusterNetwork and serviceNetwork, but not machineNetwork:

var additionalNoProxyCIDRS []string
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ClusterCIDRs(hcp.Spec.Networking.ClusterNetwork)...)
additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.ServiceCIDRs(hcp.Spec.Networking.ServiceNetwork)...)
// machineNetwork was missing here
proxy.SetEnvVars(&c.Env, additionalNoProxyCIDRS...)

Fix

Add machineNetwork CIDRs to the KAS NO_PROXY list:

additionalNoProxyCIDRS = append(additionalNoProxyCIDRS, netutil.MachineCIDRs(hcp.Spec.Networking.MachineNetwork)...)

How to Reproduce

  1. Deploy a management cluster behind an HTTP proxy
  2. Create a HostedCluster where machineNetwork is NOT included in the management cluster's NO_PROXY
  3. Run oc logs <any-pod> from the hosted cluster
  4. Observe: x509: certificate signed by unknown authority error

Customer Validation

A customer confirmed this fix resolves the issue — after manually adding the guest cluster's machineNetwork (along with clusterNetwork and serviceNetwork) to the management cluster's NO_PROXY, the oc logs x509 error was immediately resolved. This PR automates that by ensuring the control-plane-operator always includes machineNetwork in KAS's NO_PROXY.

Test Plan

  • Existing unit tests pass (go test ./control-plane-operator/controllers/hostedcontrolplane/v2/kas/...)
  • Verify on a proxied management cluster that oc logs, oc exec, oc attach work without manually adding machineNetwork to management cluster NO_PROXY

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
  • Proxy exemption handling updated so machine network CIDRs are excluded from proxy interception alongside pod and service networks, preventing TLS handshake failures caused by proxy interception and improving control-plane connectivity reliability.
  • Tests
  • Added tests validating NO_PROXY is populated correctly, including cases with and without machine network CIDRs and dual-stack networks.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants