Skip to content

OCPBUGS-86661: Konnectivity retry proxy connection on timeout#8579

Open
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-add-startup-probe-konnectivity-sock5-sidecar
Open

OCPBUGS-86661: Konnectivity retry proxy connection on timeout#8579
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-add-startup-probe-konnectivity-sock5-sidecar

Conversation

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam YamunadeviShanmugam commented May 22, 2026

This PR fixes the intermittent E2E test failures where the konnectivity-proxy-socks5 sidecar container crashes and restarts during cluster bootstrap. Fix includes retry loop, native Kubernetes startup probes, and explicit component dependency ordering.
Proxy Internal Retry Logic

  • Added a new retry infrastructure with exponential backoff to handle transient connection errors safely. The container process now stays alive while waiting for the network tunnel to become available.

Native Startup Probes
konnectivity-container.go: Added a native TCP startup probe to all Konnectivity sidecar containers. This gives the sidecars a safe 60-second window to initialize during cluster bootstrap before any readiness or liveness checks trigger.

Component Dependency Ordering
Deployment graph is updated with dependency check. The openshift-oauth-apiserver pods will now explicitly wait for the konnectivity-agent to be completely Available and RolloutComplete before they begin scheduling.

Test Fixtures are updated accordingly

Summary by CodeRabbit

  • New Features

    • Konnectivity SOCKS5 proxy guard with automatic retries/backoff, graceful shutdown, and guarded startup.
    • Default serving port constant and TCP startup probe for konnectivity proxy; mode-specific serving-port selection.
    • Multiple hosted control-plane components now declare the Konnectivity agent as an explicit dependency.
  • Bug Fixes

    • Startup errors now propagate and exit cleanly instead of panicking.
  • Tests

    • Extensive unit tests for guard/bootstrap, transient-error detection, dialing, CLI flags, container build/probes, and graceful shutdown.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a guarded Konnectivity-backed SOCKS5 proxy (bootstrap, TCP verify, transient retry/backoff, graceful shutdown), moves proxy startup to a signal-aware guard in main, adds a TCP StartupProbe and DefaultServingPort (8090) for the konnectivity proxy container, and updates multiple hosted-control-plane components to declare the konnectivity agent as a dependency. Comprehensive unit tests for the guard, dial checks, main command, and container probe behavior are included.

Sequence Diagram(s)

sequenceDiagram
  participant SignalHandler
  participant KonnectivityGuard
  participant KubernetesAPI
  participant KonnectivityDialer
  participant SOCKS5Server

  SignalHandler->>KonnectivityGuard: start(ctx)
  KonnectivityGuard->>KubernetesAPI: create controller-runtime client
  KonnectivityGuard->>KonnectivityDialer: tryBootstrapKonnectivity()
  KonnectivityDialer->>KubernetesAPI: use injected client
  KonnectivityDialer->>KonnectivityServer: TCP dial host:port (verify)
  KonnectivityGuard->>SOCKS5Server: serveWithGracefulShutdown(server, KonnectivityDialer)
  SOCKS5Server-->>KonnectivityGuard: exit error
  KonnectivityGuard->>KonnectivityGuard: isTransientKonnectivityError? -> backoff & retry
Loading
🚥 Pre-merge checks | ✅ 7 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive Tests use standard Go testing, not Ginkgo; check explicitly requires "Ginkgo test code" with Ginkgo patterns (It blocks, BeforeEach/AfterEach) absent here. Clarify: does check apply to standard Go tests? If yes, assertion messages need improvement (some assertions lack meaningful messages for failure diagnosis).
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR adds no Ginkgo tests—all new test files (guard_test.go, main_test.go, konnectivity-container_test.go) use standard Go testing with t.Run, not Ginkgo It()/Describe(). Check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-breaking constraints: startup probes are simple TCP checks, no affinity/topology spread rules, no node selectors, no replica calculations based on node count.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only unit tests (guard_test.go, main_test.go, konnectivity-container_test.go), not Ginkgo e2e tests. Custom check applies only to new Ginkgo e2e tests; check is not applicable here.
Title check ✅ Passed The title clearly describes the main change: adding retry logic for konnectivity proxy connections on timeout, which is the core feature introduced across guard.go, main.go, and related component dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 22, 2026
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from e81da62 to c1f6355 Compare May 22, 2026 17:40
@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/test all

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 75.25773% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.85%. Comparing base (f13c62d) to head (fa71f72).

Files with missing lines Patch % Lines
konnectivity-socks5-proxy/guard.go 77.01% 32 Missing and 5 partials ⚠️
konnectivity-socks5-proxy/main.go 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8579      +/-   ##
==========================================
+ Coverage   41.43%   41.85%   +0.42%     
==========================================
  Files         756      757       +1     
  Lines       93647    93822     +175     
==========================================
+ Hits        38802    39272     +470     
+ Misses      52124    51826     -298     
- Partials     2721     2724       +3     
Files with missing lines Coverage Δ
...hostedcontrolplane/v2/oauth_apiserver/component.go 100.00% <100.00%> (+72.41%) ⬆️
...t/controlplane-component/konnectivity-container.go 71.00% <100.00%> (+36.92%) ⬆️
konnectivity-socks5-proxy/main.go 66.00% <0.00%> (+66.00%) ⬆️
konnectivity-socks5-proxy/guard.go 77.01% <77.01%> (ø)

... and 8 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.11% <100.00%> (+0.23%) ⬆️
cpo-hostedcontrolplane 44.87% <100.00%> (+1.36%) ⬆️
cpo-other 42.74% <ø> (ø)
hypershift-operator 51.57% <ø> (ø)
other 32.80% <72.09%> (+1.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
konnectivity-socks5-proxy/guard_test.go (2)

28-58: ⚡ Quick win

Consider refactoring to a table-driven test.

The test cases all follow the same structure (create error, call isTransientKonnectivityError, assert result). Table-driven tests improve maintainability and make it easier to add new cases. As per coding guidelines, prefer table-driven tests where possible.

♻️ Proposed table-driven refactor
 func TestIsTransientKonnectivityError(t *testing.T) {
-	t.Run("When error is nil, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(nil)).To(BeFalse())
-	})
-
-	t.Run("When error is context deadline exceeded, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(context.DeadlineExceeded)).To(BeTrue())
-	})
-
-	t.Run("When error is connection refused, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(fmt.Errorf("dial tcp: connection refused"))).To(BeTrue())
-	})
-
-	t.Run("When error is syscall ECONNREFUSED, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(syscall.ECONNREFUSED)).To(BeTrue())
-	})
-
-	t.Run("When error is validation failure, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(errors.New("failed validation: KonnectivityHost is required"))).To(BeFalse())
-	})
-
-	t.Run("When error is file not found, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(os.ErrNotExist)).To(BeFalse())
-	})
+	testCases := []struct {
+		name       string
+		err        error
+		isTransient bool
+	}{
+		{
+			name:       "When error is nil, it should not be transient",
+			err:        nil,
+			isTransient: false,
+		},
+		{
+			name:       "When error is context deadline exceeded, it should be transient",
+			err:        context.DeadlineExceeded,
+			isTransient: true,
+		},
+		{
+			name:       "When error is connection refused, it should be transient",
+			err:        fmt.Errorf("dial tcp: connection refused"),
+			isTransient: true,
+		},
+		{
+			name:       "When error is syscall ECONNREFUSED, it should be transient",
+			err:        syscall.ECONNREFUSED,
+			isTransient: true,
+		},
+		{
+			name:       "When error is validation failure, it should not be transient",
+			err:        errors.New("failed validation: KonnectivityHost is required"),
+			isTransient: false,
+		},
+		{
+			name:       "When error is file not found, it should not be transient",
+			err:        os.ErrNotExist,
+			isTransient: false,
+		},
+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {
+			g := NewGomegaWithT(t)
+			if tc.isTransient {
+				g.Expect(isTransientKonnectivityError(tc.err)).To(BeTrue())
+			} else {
+				g.Expect(isTransientKonnectivityError(tc.err)).To(BeFalse())
+			}
+		})
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 28 - 58, Refactor
TestIsTransientKonnectivityError into a table-driven test: create a slice of
cases with fields name, err and want (bool) containing the current scenarios
(nil, context.DeadlineExceeded, fmt.Errorf("dial tcp: connection refused"),
syscall.ECONNREFUSED, errors.New("failed validation..."), os.ErrNotExist), then
loop over cases calling t.Run(case.name, func(t *testing.T){ g :=
NewGomegaWithT(t);
g.Expect(isTransientKonnectivityError(case.err)).To(Equal(case.want)) }); keep
the test function name TestIsTransientKonnectivityError and reuse
isTransientKonnectivityError and NewGomegaWithT to preserve existing assertions.

95-103: ⚡ Quick win

Avoid mutating package-level state in tests.

Modifying the package-level dialKonnectivityServer variable violates the guideline to avoid global state in tests. While the defer restores the original value, this pattern is unsafe for parallel test execution and can cause race conditions. As per coding guidelines, tests should run with parallel execution enabled.

Consider refactoring the code under test to accept the dialer as a parameter (dependency injection) rather than relying on a package-level variable. This would make tests safer, support parallel execution, and improve testability.

Example approach:

// In guard.go, make dialKonnectivityServer a parameter or field
func retryDialKonnectivityServer(ctx context.Context, log logr.Logger, host string, port uint32, dialFunc func(string, uint32) error) error {
    // ... use dialFunc instead of global dialKonnectivityServer
}

// In test
func TestDialKonnectivityServerRetries(t *testing.T) {
    attempts := 0
    mockDial := func(host string, port uint32) error {
        attempts++
        if attempts < 2 {
            return fmt.Errorf("dial tcp: connection refused")
        }
        return dialKonnectivityServerTCP(host, port)
    }
    
    err := retryDialKonnectivityServer(ctx, log, "127.0.0.1", port, mockDial)
    // ... assertions
}

As per coding guidelines, avoid global state in tests and run tests with parallel execution.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 95 - 103, The test
mutates the package-level dialKonnectivityServer which is unsafe for parallel
tests; refactor retryDialKonnectivityServer (in guard.go) to accept a dial
function parameter (e.g., dialFunc func(string,uint32) error or as a field on a
struct) and update the test to call retryDialKonnectivityServer with a local
mockDial closure that increments attempts and returns the simulated error on the
first call and delegates to the real dial (e.g., dialKonnectivityServerTCP) on
success; remove any package-level reassignment in tests and use dependency
injection to supply the mock instead.
support/controlplane-component/konnectivity-container.go (1)

184-198: ⚡ Quick win

Refactor to eliminate duplication and add defensive default case.

The HTTPS and Socks5 branches contain identical logic, violating DRY. Additionally, the switch lacks a default case—if opts.Mode is neither HTTPS nor Socks5, servingPort remains zero, creating an invalid probe configuration.

♻️ Proposed refactoring
-	var servingPort int32
-	switch opts.Mode {
-	case HTTPS:
-		if opts.HTTPSOptions.ServingPort != 0 {
-			servingPort = int32(opts.HTTPSOptions.ServingPort)
-		} else {
-			servingPort = 8090
-		}
-	case Socks5:
-		if opts.Socks5Options.ServingPort != 0 {
-			servingPort = int32(opts.Socks5Options.ServingPort)
-		} else {
-			servingPort = 8090
-		}
-	}
+	servingPort := int32(8090)
+	switch opts.Mode {
+	case HTTPS:
+		if opts.HTTPSOptions.ServingPort != 0 {
+			servingPort = int32(opts.HTTPSOptions.ServingPort)
+		}
+	case Socks5:
+		if opts.Socks5Options.ServingPort != 0 {
+			servingPort = int32(opts.Socks5Options.ServingPort)
+		}
+	default:
+		// Should never occur given validation at line 81-84, but defensive.
+		panic(fmt.Sprintf("unexpected konnectivity proxy mode: %s", opts.Mode))
+	}

As per coding guidelines, follow the 100-go-mistakes rules: prefer eliminating code duplication and using defensive programming patterns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/konnectivity-container.go` around lines 184 -
198, The servingPort selection duplicates identical logic for HTTPS and Socks5
and lacks a default branch, so update the code that sets servingPort (variable
servingPort using opts.Mode, HTTPS, Socks5, opts.HTTPSOptions.ServingPort,
opts.Socks5Options.ServingPort) to remove duplication by first selecting the
candidate port (e.g., from opts.HTTPSOptions.ServingPort or
opts.Socks5Options.ServingPort based on opts.Mode) and then applying a single
fallback to 8090 if the candidate is zero; also add a defensive default for the
switch (or handle unknown modes) to ensure servingPort is always set to a valid
default.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard.go`:
- Around line 59-60: The retry backoff currently uses time.Sleep(delay()) which
ignores ctx cancellation; update both retry sites in guard.go (where
time.Sleep(delay()) is called) to use a cancellable wait: replace
time.Sleep(delay()) with a select that waits on time.After(delay()) and
ctx.Done(), returning or breaking when ctx is cancelled. Ensure the surrounding
function (the loop that calls delay()) respects ctx cancellation (e.g., check
ctx.Err() or return on ctx.Done()) so retries exit immediately when the provided
ctx is cancelled.
- Around line 30-35: coreGuardBackoff currently has Steps==0 so DelayFunc
returns a constant delay and exponential growth never happens; update
coreGuardBackoff to set a positive Steps (e.g., 5 or 10) so Factor and Cap take
effect, and replace every time.Sleep(delay()) call (where delay is produced by
coreGuardBackoff.DelayFunc or similar) with a ctx-aware wait: compute delay :=
delay() then use select { case <-ctx.Done(): return ctx.Err() (or break out
appropriately); case <-time.After(delay): /* continue retry */ } so retries exit
promptly on cancellation; reference coreGuardBackoff, DelayFunc and the sites
using time.Sleep(delay()) when making these edits.

In `@konnectivity-socks5-proxy/main.go`:
- Around line 55-71: signals.SetupSignalHandler() cancels ctx but the current
runWithCoreGuard call blocks on server.ListenAndServe and never observes
ctx.Done(), so SIGTERM doesn't stop the proxy and context cancellation becomes a
fatal error; change the socks5 server start to create a net.Listener (net.Listen
with fmt.Sprintf(":%d", servingPort)), run server.Serve(listener) in a
goroutine, and in the main goroutine select on ctx.Done() vs serve result: on
ctx.Done() call listener.Close() and wait for Serve to return and then return
nil (treat context cancellation as clean shutdown), and if Serve returns a
non-context error propagate it as before; update the block that constructs
socks5.Config and calls server.ListenAndServe to use this listener/Serve + ctx
cancellation handling so signals.SetupSignalHandler(), runWithCoreGuard,
server.Serve, and listener.Close() coordinate graceful shutdown.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 28-58: Refactor TestIsTransientKonnectivityError into a
table-driven test: create a slice of cases with fields name, err and want (bool)
containing the current scenarios (nil, context.DeadlineExceeded,
fmt.Errorf("dial tcp: connection refused"), syscall.ECONNREFUSED,
errors.New("failed validation..."), os.ErrNotExist), then loop over cases
calling t.Run(case.name, func(t *testing.T){ g := NewGomegaWithT(t);
g.Expect(isTransientKonnectivityError(case.err)).To(Equal(case.want)) }); keep
the test function name TestIsTransientKonnectivityError and reuse
isTransientKonnectivityError and NewGomegaWithT to preserve existing assertions.
- Around line 95-103: The test mutates the package-level dialKonnectivityServer
which is unsafe for parallel tests; refactor retryDialKonnectivityServer (in
guard.go) to accept a dial function parameter (e.g., dialFunc
func(string,uint32) error or as a field on a struct) and update the test to call
retryDialKonnectivityServer with a local mockDial closure that increments
attempts and returns the simulated error on the first call and delegates to the
real dial (e.g., dialKonnectivityServerTCP) on success; remove any package-level
reassignment in tests and use dependency injection to supply the mock instead.

In `@support/controlplane-component/konnectivity-container.go`:
- Around line 184-198: The servingPort selection duplicates identical logic for
HTTPS and Socks5 and lacks a default branch, so update the code that sets
servingPort (variable servingPort using opts.Mode, HTTPS, Socks5,
opts.HTTPSOptions.ServingPort, opts.Socks5Options.ServingPort) to remove
duplication by first selecting the candidate port (e.g., from
opts.HTTPSOptions.ServingPort or opts.Socks5Options.ServingPort based on
opts.Mode) and then applying a single fallback to 8090 if the candidate is zero;
also add a defensive default for the switch (or handle unknown modes) to ensure
servingPort is always set to a valid default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 56b1509a-1ce4-4bfd-9308-e453f94028b4

📥 Commits

Reviewing files that changed from the base of the PR and between d24af10 and e81da62.

⛔ Files ignored due to path filters (40)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/AROSwift/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/GCP/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/AROSwift/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/GCP/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/AROSwift/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/GCP/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/AROSwift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/GCP/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/IBMCloud/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/AROSwift/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/GCP/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/AROSwift/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/GCP/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/IBMCloud/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/component.go
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • support/controlplane-component/konnectivity-container.go

Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/main.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
konnectivity-socks5-proxy/guard_test.go (1)

79-110: 🏗️ Heavy lift

Test the production retry path instead of a local clone.

retryDialKonnectivityServer reimplements its own loop, so this test can stay green while bootstrapKonnectivity or runWithCoreGuard regress. Prefer injecting the dial/bootstrap dependency into the real code path and asserting retries there.

As per coding guidelines, "Always include unit tests when creating new functions or modifying existing ones" and "Unit test any code changes and additions."

Also applies to: 113-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 79 - 110, The test
currently calls retryDialKonnectivityServer directly and stubs
dialKonnectivityServer, which doesn't exercise the real production path;
refactor the code to inject the dialing/bootstrap dependency (e.g., accept a
DialFunc or a KonnectivityBootstrap interface) into
bootstrapKonnectivity/runWithCoreGuard and update the test to call the real
entrypoint (bootstrapKonnectivity or runWithCoreGuard) with a test dial
implementation that fails once then succeeds, assert the entrypoint eventually
succeeds and that the injected dial was invoked multiple times; keep references
to dialKonnectivityServer, retryDialKonnectivityServer, bootstrapKonnectivity,
and runWithCoreGuard to locate and change signatures, and ensure tests restore
any global state and use context timeouts as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard.go`:
- Around line 36-37: Thread a context through bootstrapKonnectivity →
tryBootstrapKonnectivity → dialKonnectivityServerTCP so the TCP probe is
cancellable: add a ctx parameter to bootstrapKonnectivity and propagate it into
tryBootstrapKonnectivity and dialKonnectivityServerTCP, then replace the
net.DialTimeout(...) call inside dialKonnectivityServerTCP with
(&net.Dialer{Timeout: konnectivityDialTimeout}).DialContext(ctx, network, addr)
so the in-progress probe is aborted when ctx is cancelled; update all callers to
pass the ctx accordingly.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 79-110: The test currently calls retryDialKonnectivityServer
directly and stubs dialKonnectivityServer, which doesn't exercise the real
production path; refactor the code to inject the dialing/bootstrap dependency
(e.g., accept a DialFunc or a KonnectivityBootstrap interface) into
bootstrapKonnectivity/runWithCoreGuard and update the test to call the real
entrypoint (bootstrapKonnectivity or runWithCoreGuard) with a test dial
implementation that fails once then succeeds, assert the entrypoint eventually
succeeds and that the injected dial was invoked multiple times; keep references
to dialKonnectivityServer, retryDialKonnectivityServer, bootstrapKonnectivity,
and runWithCoreGuard to locate and change signatures, and ensure tests restore
any global state and use context timeouts as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a56c9242-6df5-4a69-85b4-450d2a131e1a

📥 Commits

Reviewing files that changed from the base of the PR and between e81da62 and c1f6355.

⛔ Files ignored due to path filters (40)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/AROSwift/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/GCP/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/AROSwift/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/GCP/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/AROSwift/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/GCP/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/AROSwift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/GCP/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/IBMCloud/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/AROSwift/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/GCP/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/AROSwift/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/GCP/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/IBMCloud/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/component.go
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • support/controlplane-component/konnectivity-container.go

Comment thread konnectivity-socks5-proxy/guard.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
konnectivity-socks5-proxy/main.go (1)

94-96: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat bootstrap-time context cancellation as a clean exit.

If runWithCoreGuard returns context.Canceled while still bootstrapping, Lines 94-96 still print an error and os.Exit(1). That makes an intentional SIGTERM during startup look like a crash.

Suggested change
 import (
+	"context"
+	"errors"
 	"fmt"
 	"net"
 	"os"
@@
 		})
 		if err != nil {
+			if errors.Is(err, context.Canceled) {
+				return
+			}
 			fmt.Fprintf(os.Stderr, "Error: %v\n", err)
 			os.Exit(1)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/main.go` around lines 94 - 96, When handling the
error returned from runWithCoreGuard, treat bootstrap-time cancellation as a
clean exit by checking if errors.Is(err, context.Canceled) (or err ==
context.Canceled) and, if so, do not print an error and exit 0; otherwise keep
the existing fmt.Fprintf(os.Stderr, ...) and os.Exit(1). Update the error
handling around runWithCoreGuard to perform this conditional check (and add an
import for the errors package if needed) so a SIGTERM during startup is treated
as a normal shutdown.
🧹 Nitpick comments (1)
konnectivity-socks5-proxy/main_test.go (1)

100-134: ⚡ Quick win

Exercise the production shutdown path instead of a copied closure.

This subtest reimplements the listener/Serve/ctx.Done() flow locally, so it can stay green even if NewStartCommand drifts. Extract that block into a small package-private helper and call the helper from both main.go and this test.

As per coding guidelines, "Unit test any code changes and additions".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/main_test.go` around lines 100 - 134, The test
duplicates the listener/Serve/ctx.Done() shutdown logic (serveFunc) instead of
exercising the real production path; refactor that logic into a package-private
helper (e.g., create a function serveWithGracefulShutdown(ctx context.Context,
servingPort uint32, server *socks5.Server) error) and replace the test-local
closure and the implementation in main.go to call this helper; update
NewStartCommand (or the function in main.go that currently inlines the logic) to
call serveWithGracefulShutdown so the test can simply invoke the real code path
and verify graceful shutdown, and adjust tests to pass a server instance and
listener port to the new helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 213-219: The test currently dials a hardcoded port 54321 which can
be occupied on CI; instead create a short-lived listener on "127.0.0.1:0" to get
an ephemeral port, read its actual port from listener.Addr(), close the listener
to free the port, then call dialKonnectivityServerTCP(ctx, "127.0.0.1", port)
and assert error with isTransientKonnectivityError; update the anonymous t.Run
block to use this ephemeral-port pattern and keep the same assertions using
dialKonnectivityServerTCP and isTransientKonnectivityError.

---

Duplicate comments:
In `@konnectivity-socks5-proxy/main.go`:
- Around line 94-96: When handling the error returned from runWithCoreGuard,
treat bootstrap-time cancellation as a clean exit by checking if errors.Is(err,
context.Canceled) (or err == context.Canceled) and, if so, do not print an error
and exit 0; otherwise keep the existing fmt.Fprintf(os.Stderr, ...) and
os.Exit(1). Update the error handling around runWithCoreGuard to perform this
conditional check (and add an import for the errors package if needed) so a
SIGTERM during startup is treated as a normal shutdown.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/main_test.go`:
- Around line 100-134: The test duplicates the listener/Serve/ctx.Done()
shutdown logic (serveFunc) instead of exercising the real production path;
refactor that logic into a package-private helper (e.g., create a function
serveWithGracefulShutdown(ctx context.Context, servingPort uint32, server
*socks5.Server) error) and replace the test-local closure and the implementation
in main.go to call this helper; update NewStartCommand (or the function in
main.go that currently inlines the logic) to call serveWithGracefulShutdown so
the test can simply invoke the real code path and verify graceful shutdown, and
adjust tests to pass a server instance and listener port to the new helper.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 099bd414-6684-4f69-8d67-a453ed5f21b3

📥 Commits

Reviewing files that changed from the base of the PR and between c1f6355 and 922b57f.

📒 Files selected for processing (5)
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • konnectivity-socks5-proxy/main_test.go
  • support/controlplane-component/konnectivity-container_test.go

Comment thread konnectivity-socks5-proxy/guard_test.go
@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/test all

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2058851957484818432 | Cost: $2.74653275 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 25, 2026

Test Results

e2e-aws

e2e-aks

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@jparrill
Copy link
Copy Markdown
Contributor

/ok-to-test

@openshift-ci openshift-ci Bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 26, 2026
@jparrill
Copy link
Copy Markdown
Contributor

/area dependencies

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@jparrill: The label(s) area/dependencies cannot be applied, because the repository doesn't have them.

Details

In response to this:

/area dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks

Apart from those comments inline, you will need to:

  • Create a bug in OCPBUGS project
  • /retitle OCPBUGS-XXXXX: Konnectivity retry proxy connection on timeout
  • /jira refresh

Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go
Comment thread konnectivity-socks5-proxy/main.go Outdated
Comment thread konnectivity-socks5-proxy/guard_test.go
Comment thread konnectivity-socks5-proxy/guard_test.go Outdated
Comment thread konnectivity-socks5-proxy/guard_test.go Outdated
Comment thread konnectivity-socks5-proxy/main_test.go Outdated
Comment thread support/controlplane-component/konnectivity-container_test.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go
@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented Jun 2, 2026

/retest-required

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented Jun 3, 2026

Re-review: dependency scope and bootstrap impact

I've verified the latest push addresses most of my previous review comments (struct conversion, delay reset, error wrapping, client hoisting, test helpers). Good work there.

However, the dependency scope has grown significantly since my first review, and I want to flag this explicitly:

8 components now depend on konnectivity-agent

The original bug (OCPBUGS-86661) was about the konnectivity-socks5-proxy sidecar crashing during bootstrap. The retry logic + startup probe should handle that at the container level. But this PR also adds WithDependencies(konnectivityagent.ComponentName) to 8 components:

  • cno
  • ingress-operator
  • openshift-apiserver (oapi)
  • oauth
  • oauth-apiserver
  • catalog-operator
  • olm-operator
  • packageserver

This means none of these components will start scheduling until konnectivity-agent reports Available + RolloutComplete. That's a significant change to the control plane bootstrap graph — these components are now serialized behind konnectivity-agent.

Questions:

  1. Has the impact on bootstrap time been measured? Even a few seconds of added latency per component compounds across 8.
  2. If the retry loop + startup probe is sufficient to handle the sidecar race (which it should be — that's the whole point of the guard), do we actually need the component-level dependency at all? The dependency is a belt-and-suspenders approach that also constrains upgrades/rollouts.
  3. Previously only oauth-apiserver had this dependency. What changed to justify all 8? The answer "all of them inject a konnectivity sidecar" is true but was also true before this bug — if they worked without the dependency before (modulo the sidecar restart), and the retry/probe now handles the restart, what's the net justification?

I'd like to see data or at minimum a clear rationale before approving this change to the deployment graph.

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented Jun 3, 2026

@enxebre — ping on this PR. Your question from May 26 about the root cause and whether this needs more than dependency + simple wait.ExponentialBackoff is still the key design question.

Yamuna responded explaining that DialContext is lazy (not called during bootstrap), so the sidecar can start listening but fail all proxied requests if konnectivity-server isn't reachable yet. That justifies the dial check in the guard. But the scope has expanded: there are now 3 layers of defense (retry guard, startup probe, AND component dependency on konnectivity-agent for 8 components), which feels over-engineered for a transient bootstrap race.

Could you take a look at the current state and share whether:

  1. You're OK with the approach, or prefer something simpler
  2. The 8-component dependency addition is acceptable or should be scoped down

This PR has been open for ~2 weeks with no approvals, so it'd be good to unblock or redirect. Thanks!

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

@jparrill : I looked closer at the changes, and I agree with you
we only need to keep WithDependencies(konnectivity-agent) on oauth-apiserver. I'm reverting it for the other 7 components (cno, ingress-operator, openshift-apiserver, oauth, catalog-operator, olm-operator, and packageserver).
oauth-apiserver is a special case as itneeds the dependency because data plane webhooks have to reach back into it to function. It's a hard requirement, not a timing workaround. For other 7 components it is not needed.

@openshift-ci openshift-ci Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 5, 2026
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from 0116c4d to 2b7902d Compare June 5, 2026 12:24
@openshift-ci openshift-ci Bot added area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing labels Jun 5, 2026
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from 2b7902d to 582c8ea Compare June 5, 2026 12:27
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from 582c8ea to 95f2d94 Compare June 5, 2026 12:29
@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have all the information needed. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: lint / Lint (GitHub Actions)
  • Build ID: 27014909840 / Job 79727348963
  • PR: #8579 — OCPBUGS-86661: Konnectivity retry proxy connection on timeout
  • Branch: fix-add-startup-probe-konnectivity-sock5-sidecar
  • Commit: 95f2d9459e4be5350faf78a147aa478e3864a6f5
  • Duration: ~7.5 minutes (12:29:57 → 12:37:32 UTC)

Test Failure Analysis

Error

konnectivity-socks5-proxy/guard.go:109:29: net.Listen must not be called. use (*net.ListenConfig).Listen (noctx)
konnectivity-socks5-proxy/guard_test.go:186:24: net.Listen must not be called. use (*net.ListenConfig).Listen (noctx)

2 issues: * noctx: 2
make: *** [Makefile:120: lint] Error 1

Summary

The PR introduces two new files (guard.go and guard_test.go) that use net.Listen() directly instead of the context-aware (*net.ListenConfig).Listen() API. The noctx linter — one of 16 active linters in the project's golangci-lint configuration — flags bare net.Listen calls because they bypass context propagation, preventing graceful cancellation and timeout control. Both violations are in newly added code from this PR, not pre-existing issues.

Root Cause

The noctx linter enforces that network operations use context-aware APIs. In Go, net.Listen("tcp", addr) is a convenience wrapper that does not accept a context.Context, meaning the call cannot be cancelled or timed out externally.

Violation 1 — guard.go:109:

listener, err := net.Listen("tcp", fmt.Sprintf(":%d", servingPort))

This creates the SOCKS5 serving listener in the runWithCoreGuard function. It should use (*net.ListenConfig).Listen(ctx, ...) so the listener creation respects the context already available in the function signature (ctx context.Context).

Violation 2 — guard_test.go:186:

ln, err := net.Listen("tcp", "127.0.0.1:0")

This creates a test listener in TestDialKonnectivityServerTCP. Even in test code, the linter applies. It should use (*net.ListenConfig).Listen(context.Background(), ...) or the test's context.

The noctx linter is enabled project-wide (confirmed in the active linter list: [dupword durationcheck errcheck errorlint fatcontext gci gocyclo govet ineffassign misspell nilerr noctx staticcheck unparam unused usestdlibvars]), and the diff filter confirmed both issues are in new code (2/2 passed the diff processor), so there is no workaround via exclusion — the code must be fixed.

Recommendations
  1. Fix guard.go:109 — Replace the bare net.Listen call with a net.ListenConfig-based call using the existing ctx parameter:

    // Before:
    listener, err := net.Listen("tcp", fmt.Sprintf(":%d", servingPort))
    
    // After:
    lc := net.ListenConfig{}
    listener, err := lc.Listen(ctx, "tcp", fmt.Sprintf(":%d", servingPort))

    This is the correct fix since runWithCoreGuard already receives a context.Context — propagating it to the listener ensures the listen operation can be cancelled during graceful shutdown.

  2. Fix guard_test.go:186 — Replace the bare net.Listen in the test with a context-aware variant:

    // Before:
    ln, err := net.Listen("tcp", "127.0.0.1:0")
    
    // After:
    lc := net.ListenConfig{}
    ln, err := lc.Listen(context.Background(), "tcp", "127.0.0.1:0")

    Alternatively, use t.Context() (Go 1.24+) or context.TODO() if the test doesn't have a context available.

  3. No changes to linter configuration are needed — these are legitimate findings that improve the code's correctness for cancellation and shutdown handling.

Evidence
Evidence Detail
Linter noctx — enforces context-aware network calls
Issue count 2 issues total, both noctx violations
File 1 konnectivity-socks5-proxy/guard.go:109net.Listen("tcp", ...) in runWithCoreGuard()
File 2 konnectivity-socks5-proxy/guard_test.go:186net.Listen("tcp", ...) in TestDialKonnectivityServerTCP
Required fix Use (*net.ListenConfig).Listen(ctx, "tcp", addr) instead
File status Both files are newly added in this PR (status: added)
Lint processing "Issues before processing: 1019, after processing: 2" — only these 2 issues survived diff filtering
Exit code make: *** [Makefile:120: lint] Error 1 → process exit code 2

The proxy was exiting hard on startup timeouts. This change adds
exponential backoff retry logic with startup probes.
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from 95f2d94 to fa71f72 Compare June 5, 2026 12:57
@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@YamunadeviShanmugam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants