Add graceful connection draining on tunnel close#350
Add graceful connection draining on tunnel close#350myleshorton wants to merge 3 commits intomainfrom
Conversation
When the tunnel closes (e.g. during server location change, routing mode toggle, or split tunneling change), wait up to 5 seconds for active connections to finish before tearing down the TUN interface. This avoids forcibly killing all in-flight connections (e.g. video streams), reducing the disruption from ~1 minute to a few seconds. - Add drainConnections() that polls conntrack.Count() until connections are gone or DrainTimeout (5s) expires - Call drain phase at the start of tunnel.close(), before closing resources - Add with_conntrack build tag to CI for connection tracking in tests - Add drain_test.go with tests for immediate return, graceful wait, timeout, and tunnel close integration Fixes getlantern/engineering#3055
b8bc7b2 to
0559cbc
Compare
Move drainConnections() before t.cancel() so in-flight requests can finish gracefully before context cancellation tears them down. Fix misleading "forcibly closing" log to "proceeding with tunnel teardown" since drain doesn't actually force-close connections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@myleshorton, sing-box doesn't work that way. It doesn't close/stop anything until I haven't looked at the logs yet but, toggling split tunneling or adding/removing filter items won't affect connections. If the user manually switches servers, then connections being closed immediately is expected and typical with VPNs. Same with changing the routing mode as it modifies network config. This would really only be an bug if it happened when using auto-connect. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
My understanding is that this is where the user experience was impacted because it took the clients (youtube in a browser) ~1m to switch over to the new connection, whereas the intention here is to gracefully drain those connections to avoid negatively impacting the user experience. |
…lose - Replace time.After with time.NewTimer + defer Stop() to avoid leaked timers on early return - Guard against DrainPollInterval <= 0 to prevent ticker panic - Remove duplicate tracked.Close() in TestDrainConnectionsTimeout (defer already handles cleanup) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Isn't closing the connection what triggers clients to attempt to reconnect? If the connection isn't closed, they won't switch over. And, as mentioned previously, sing-box doesn't support that. Connections won't be closed until |
Yeah I agree, but I suppose if there's a lot of data in flight on those connections then not draining them could have the observed impact on the user experience? I also don't have a better explanation, but it seems plausible. The ideal would be to just reproduce it. Let me see if I can do that. |
|
I think the issue might actually be the time it takes VPN and tunnel to close and reopen. The user is on Windows so the double IPC and double TUN will definitely add some overhead. |
|
But how would we drain them? If the clients won't attempt a new connection until the existing one is closed, aren't we just delaying the issue until the timeout? |
Summary
When the tunnel closes (e.g. during server location change, routing mode toggle, or split tunneling change), wait up to 5 seconds for active connections to finish before tearing down the TUN interface. This avoids forcibly killing all in-flight connections (e.g. video streams), reducing the user-visible disruption from ~1 minute to a few seconds.
Problem
From ticket #168459: user reports ~1 minute streaming disruption when changing server location, routing mode, or split tunneling on Windows. Logs show 158 connections forcibly terminated on a single tunnel stop event. Video players then need time to detect the failure and reconnect.
See engineering#3055 for the full diagnosis.
Changes
vpn/tunnel.go: AdddrainConnections()that pollsconntrack.Count()until active connections are gone orDrainTimeout(5s) expires. Called at the start oftunnel.close(), before closing resources.vpn/drain_test.go: 4 tests covering immediate return (no connections), graceful wait, timeout, and tunnel close integration..github/workflows/go.yml: Addwith_conntrackbuild tag to CI so connection tracking is enabled in tests.Testing
All new tests pass:
Existing tests (
TestConnection,TestUpdateServers) continue to pass.