Skip to content

Add graceful connection draining on tunnel close#350

Open
myleshorton wants to merge 3 commits intomainfrom
graceful-drain-on-tunnel-close
Open

Add graceful connection draining on tunnel close#350
myleshorton wants to merge 3 commits intomainfrom
graceful-drain-on-tunnel-close

Conversation

@myleshorton
Copy link
Contributor

Summary

When the tunnel closes (e.g. during server location change, routing mode toggle, or split tunneling change), wait up to 5 seconds for active connections to finish before tearing down the TUN interface. This avoids forcibly killing all in-flight connections (e.g. video streams), reducing the user-visible disruption from ~1 minute to a few seconds.

Problem

From ticket #168459: user reports ~1 minute streaming disruption when changing server location, routing mode, or split tunneling on Windows. Logs show 158 connections forcibly terminated on a single tunnel stop event. Video players then need time to detect the failure and reconnect.

See engineering#3055 for the full diagnosis.

Changes

  • vpn/tunnel.go: Add drainConnections() that polls conntrack.Count() until active connections are gone or DrainTimeout (5s) expires. Called at the start of tunnel.close(), before closing resources.
  • vpn/drain_test.go: 4 tests covering immediate return (no connections), graceful wait, timeout, and tunnel close integration.
  • .github/workflows/go.yml: Add with_conntrack build tag to CI so connection tracking is enabled in tests.

Testing

All new tests pass:

=== RUN   TestDrainConnectionsNoConnections       --- PASS
=== RUN   TestDrainConnectionsWaitsForConnections  --- PASS (0.20s)
=== RUN   TestDrainConnectionsTimeout              --- PASS (0.20s)
=== RUN   TestTunnelCloseCallsDrain                --- PASS

Existing tests (TestConnection, TestUpdateServers) continue to pass.

Copilot AI review requested due to automatic review settings March 5, 2026 17:16
When the tunnel closes (e.g. during server location change, routing mode
toggle, or split tunneling change), wait up to 5 seconds for active
connections to finish before tearing down the TUN interface. This avoids
forcibly killing all in-flight connections (e.g. video streams), reducing
the disruption from ~1 minute to a few seconds.

- Add drainConnections() that polls conntrack.Count() until connections
  are gone or DrainTimeout (5s) expires
- Call drain phase at the start of tunnel.close(), before closing resources
- Add with_conntrack build tag to CI for connection tracking in tests
- Add drain_test.go with tests for immediate return, graceful wait,
  timeout, and tunnel close integration

Fixes getlantern/engineering#3055
@myleshorton myleshorton force-pushed the graceful-drain-on-tunnel-close branch from b8bc7b2 to 0559cbc Compare March 5, 2026 17:18
Move drainConnections() before t.cancel() so in-flight requests can
finish gracefully before context cancellation tears them down. Fix
misleading "forcibly closing" log to "proceeding with tunnel teardown"
since drain doesn't actually force-close connections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@myleshorton myleshorton requested a review from Copilot March 5, 2026 17:34
@garmr-ulfr
Copy link
Collaborator

garmr-ulfr commented Mar 5, 2026

@myleshorton, sing-box doesn't work that way. It doesn't close/stop anything until box.Close is called. The context is only used for storing and passing values between goroutines.

I haven't looked at the logs yet but, toggling split tunneling or adding/removing filter items won't affect connections. If the user manually switches servers, then connections being closed immediately is expected and typical with VPNs. Same with changing the routing mode as it modifies network config. This would really only be an bug if it happened when using auto-connect.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@myleshorton
Copy link
Contributor Author

If the user manually switches servers, then connections being closed immediately is expected and typical with VPNs. Same with changing the routing mode as it modifies network config

My understanding is that this is where the user experience was impacted because it took the clients (youtube in a browser) ~1m to switch over to the new connection, whereas the intention here is to gracefully drain those connections to avoid negatively impacting the user experience.

…lose

- Replace time.After with time.NewTimer + defer Stop() to avoid
  leaked timers on early return
- Guard against DrainPollInterval <= 0 to prevent ticker panic
- Remove duplicate tracked.Close() in TestDrainConnectionsTimeout
  (defer already handles cleanup)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@garmr-ulfr
Copy link
Collaborator

Isn't closing the connection what triggers clients to attempt to reconnect? If the connection isn't closed, they won't switch over.

And, as mentioned previously, sing-box doesn't support that. Connections won't be closed until box.Close is called.

@myleshorton
Copy link
Contributor Author

Isn't closing the connection what triggers clients to attempt to reconnect? If the connection isn't closed, they won't switch over.

Yeah I agree, but I suppose if there's a lot of data in flight on those connections then not draining them could have the observed impact on the user experience? I also don't have a better explanation, but it seems plausible. The ideal would be to just reproduce it. Let me see if I can do that.

@garmr-ulfr
Copy link
Collaborator

I think the issue might actually be the time it takes VPN and tunnel to close and reopen. The user is on Windows so the double IPC and double TUN will definitely add some overhead.

@garmr-ulfr
Copy link
Collaborator

But how would we drain them? If the clients won't attempt a new connection until the existing one is closed, aren't we just delaying the issue until the timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants