feat: add connection lifecycle protections (idle/max-duration/throughput floor, per-IP cap, TCP keepalive, dial timeout) by yaoge123 · Pull Request #37 · ustclug/rsync-proxy

yaoge123 · 2026-06-02T01:03:19Z

Summary

This PR adds a comprehensive set of connection-lifecycle protection mechanisms to defend rsync-proxy against slow-loris-style abuse, half-open / hung connections, dead upstreams, and single-IP exhaustion. All settings default to off (0), preserving existing behavior, and can be enabled per-deployment via [proxy] config.

The new metrics that go alongside these features make it possible to observe what's happening in production and tune the limits with evidence rather than guesses.

The commits are organized so each one can be reviewed independently:

feat: add relay-phase idle timeout — mirrors rsyncd.conf timeout = semantics for the bidirectional relay phase.
feat: add connection controls (TCP keepalive, per-IP cap, relay max duration) — three independent knobs.
feat: add dial timeout, throughput floor, and termination-cause counters — defensive dial timeout, slow-throughput detection, and per-cause counters that distinguish watcher-killed from natural connection completions.

Motivation

Production experience operating this proxy in front of a busy public mirror surfaced several pain points that the existing max_active_connections / max_queued_connections knobs cannot address:

Connections that sit idle indefinitely (clients suspended, NAT entries expired, network half-open) hold a slot until the OS TCP keepalive kicks in (~2 hours by default).
Single-IP misuse — one client opening many parallel rsync connections to the same upstream — easily fills the active-conn pool and causes legitimate users to hit the queue.
"Slow loris" style abuse — clients that read at byte-per-minute rates — keep slots locked for hours without ever triggering an idle timeout.
A dead upstream causes every new dial to wait ~75 s on kernel SYN retries before failing, amplifying the impact during outages.
When a connection ends, there was no way to tell from metrics whether it ended naturally or was force-closed by the proxy, making it hard to tune the new timeouts safely.

What's added

`[proxy]` configuration knobs

Setting	Type	Default	Behavior
`relay_idle_timeout`	int (s)	`0`	Close the relay if neither direction has any I/O for the given duration. Mirrors `rsyncd.conf timeout =` semantics. `0` disables.
`relay_max_duration`	int (s)	`0`	Hard ceiling on a single relay's lifetime. Long-running rsync clients reconnect transparently. `0` disables.
`tcp_keepalive`	int (s)	`0`	TCP keepalive period applied to both accepted client connections and dialed upstream connections. `0` keeps OS defaults.
`dial_timeout`	int (s)	`0`	`net.Dialer.Timeout` used when dialing upstreams. `0` keeps OS SYN-retry defaults (~75 s).
`per_ip_max_active_connections`	int	`0`	Proxy-wide default cap on simultaneous relayed connections per (upstream, client IP). `0` disables. Per-upstream override available.
`min_throughput_bytes`	int64	`0`	Together with `min_throughput_window`, defines a floor: if a connection moves fewer than this many bytes (sent + received) over any window after the grace period, it is closed. `0` disables.
`min_throughput_window`	int (s)	`60`	Sliding window length for the throughput floor check. Inactive unless `min_throughput_bytes > 0`.
`min_throughput_grace`	int (s)	`=window`	Grace period after relay start during which the throughput floor is not enforced. Defaults to the window value if unset. Useful for letting the rsync file-list phase complete on very large modules without triggering the floor.

`[upstreams.<name>]` override

Setting	Type	Default	Behavior
`per_ip_max_active_connections`	int	inherits proxy-wide	Per-upstream override of the cap. `0` falls back to the proxy-wide setting; if both are `0` there is no cap.

New Prometheus metrics

Metric	Type	Labels	Increments when
`rsync_proxy_per_ip_rejected_total`	counter	`upstream`	A connection is rejected because the (upstream, IP) cap was already at limit.
`rsync_proxy_relay_idle_timeout_terminated_total`	counter	`upstream`	The relay watcher closed a connection due to `relay_idle_timeout`.
`rsync_proxy_relay_max_duration_terminated_total`	counter	`upstream`	The relay watcher closed a connection due to `relay_max_duration`.
`rsync_proxy_throughput_floor_terminated_total`	counter	`upstream`	The relay watcher closed a connection due to falling below `min_throughput_bytes` over a window.

These three *_terminated_total counters are deliberately separate from the existing completed_connections_total so operators can distinguish natural completions from enforced terminations and tune limits without flying blind.

Implementation notes

Watcher consolidation

The three relay-phase watchers (idle, max-duration, throughput floor) are implemented as a single goroutine with one time.Ticker. The tick interval is computed as min(idle/4, maxDuration/4, window/4) over the enabled checks, with a 1 s lower bound. The watcher is only started if at least one check is enabled, so connections with all checks disabled have zero overhead beyond what's already there.

Race-safety with config reload

All new timing fields are read through helpers that take reloadLock.RLock() (getRelayTimings, getTCPKeepAlive, getDialer, getPerIPLimitForUpstream). getRelayTimings returns a relayTimings struct snapshot to keep all six values mutually consistent within one tick. getDialer returns a copy of s.dialer so that the dialer's Timeout and KeepAlive can be updated by reload without racing with in-flight dials.

Per-IP counters

Stored in a sync.Map keyed by {upstream, ip} with *atomic.Int64 values. Lazy-initialized on first use; entries are not removed when the count returns to zero (kept simple — the working set is bounded by len(upstreams) × distinct active IPs, which is small in practice).

The check is CAS-style: increment first, decrement+reject if over limit, otherwise defer counter.Add(-1). This avoids any TOCTOU race between checking and reserving the slot.

TCP keepalive

Applied via a small helper applyTCPKeepAlive(conn, period) that no-ops on non-TCP connections (e.g., unix sockets) and on period <= 0. Called both on the accepted client connection in handleConn and on the dialed upstream connection after relay() connects.

Throughput floor

The watcher samples info.SentBytes + info.ReceivedBytes (already tracked atomically by the countingReader). On each tick after the grace period, it checks whether the per-window delta has fallen below the configured byte floor. The grace period is critical: it allows the rsync file-list / generator phase to complete on modules with millions of small files without false positives. The grace defaults to the window length, which is enough for most modules; operators with very large modules (e.g. debian-archive, opensuse-source-vault) should set it higher.

Why throughput floor and idle timeout are both needed

A "slow loris" style client that sends one byte per second will never trigger an idle timeout (there's always recent activity), but will trivially trip a sub-MiB-per-minute throughput floor. Conversely, a fully suspended client triggers idle quickly without needing the throughput check.

Backward compatibility

All new knobs default to 0 and the new behaviors are off unless explicitly enabled. Existing deployments see no behavioral change.

The new counters all have upstream labels matching the existing PR #34 convention (config name, not IP:port). They are pure additions — no existing series are renamed or removed.

Testing

Each commit ships with focused unit tests:

TestRelayIdleTimeoutClosesIdleConnection / TestRelayIdleTimeoutNotTriggeredWhenActive
TestPerIPCapRejectsConnectionsBeyondLimit
TestRelayMaxDurationClosesLongConnection
TestApplyTCPKeepAliveOnTCPConn
TestLoadConfigPropagatesTCPKeepAliveToDialer
TestLoadConfigRejectsNegativeTimings (table-driven)
TestThroughputFloorTerminatesSlowConnection
TestLoadConfigPropagatesDialTimeoutAndThroughputSettings (incl. grace-defaults-to-window)
TestLoadConfigRejectsNegativeNewSettings
Tests assert both the underlying counters (upstreamCounters.*Terminated) and the rendered /metrics output.

go vet, go build, go test -race ./..., and golangci-lint run all pass.

Production validation

This change set has been running on a busy public rsync mirror for several weeks. The new counters surfaced previously-invisible problems immediately — most notably a sustained rate of queue-full rejections during peaks and individual relays that had stayed open for multiple days at a time — and the protection knobs brought the situation under control.

Once a client connection passes the rsync handshake and enters the bidirectional relay phase, the existing ReadTimeout/WriteTimeout no longer apply: the proxy explicitly clears deadlines and just runs io.Copy in both directions. A misbehaving or stuck client could then hold a connection open indefinitely without transferring data. Add an opt-in idle timeout for the relay phase, mirroring rsyncd's "timeout" semantics (rsyncd.conf(5)): * New config option [proxy].relay_idle_timeout (seconds). 0 (default) disables the timeout, matching the previous behavior. * countingReader now optionally records a UnixNano timestamp of the last non-zero read. A single timestamp is shared between the two directions, so any traffic in either direction resets the clock. * relay() starts a watcher goroutine when the timeout is enabled. It ticks at idle_timeout/4 (minimum 1s) and closes both endpoints once the configured duration has elapsed without activity, logging an access-log entry. * The expected I/O errors that follow such a forced close are suppressed from the error log via an idleTimedOut atomic flag. Tests cover both the timeout-fires path (idle upstream is torn down and an access-log line is written) and the steady-traffic path (a stream slower than the timeout but continuously active is not cut).

…uration) Three new mechanisms to mitigate clients holding upstream slots without making progress ("squatters"): * tcp_keepalive (proxy-wide, seconds): enables TCP keepalive with the configured period on both accepted client connections and dialed upstream connections, so a half-open peer is detected within minutes rather than the OS default (~2 hours, or disabled). 0 keeps OS-default behavior. * per_ip_max_active_connections (proxy-wide default with per-upstream override): caps the number of simultaneous active relay connections a single client IP may have to one upstream. Counted before queue admission so the cap also bounds queueing. Rejected requests get an @error line and are tracked via the new rsync_proxy_per_ip_rejected_total{upstream} counter. * relay_max_duration (proxy-wide, seconds): hard wall-clock cap on the bidirectional relay phase. Connections still alive past this duration are closed regardless of activity, complementing the existing relay_idle_timeout (which only fires on no-I/O). The existing relay watcher goroutine now handles both checks. The reload-mutated timing fields (RelayIdleTimeout, RelayMaxDuration, TCPKeepAlive, dialer.KeepAlive) are now read under the reload lock via getRelayTimings/getTCPKeepAlive/getDialer helpers to avoid data races with config reloads. Tests cover per-IP rejection (counter + metric + access log), max-duration enforcement (with idle disabled), the keepalive helper, config propagation to the dialer, and rejection of negative values for the three new settings.

Three additions to the connection-controls family that close gaps the existing idle/max-duration/per-IP/keepalive controls leave open and make the existing teardown paths observable. dial_timeout (proxy-wide) Bounds upstream Dial latency, sourced from net.Dialer.Timeout. The previous default behaviour relied on the kernel SYN-retry budget (~75s on Linux), which let unreachable backends pile up handshake- blocked goroutines and exhaust per-upstream queues. min_throughput_bytes / min_throughput_window / min_throughput_grace Sliding-window throughput floor for the relay phase. Idle timeout cannot catch slow drips (a client that sips a few hundred bytes every second never goes idle). The watcher samples info.SentBytes+info.ReceivedBytes once per window after grace and tears the connection down if the delta falls below the configured byte budget. min_throughput_grace defaults to min_throughput_window when unset to avoid false positives during handshake. Termination-cause counters rsync_proxy_relay_idle_timeout_terminated_total{upstream} rsync_proxy_relay_max_duration_terminated_total{upstream} rsync_proxy_throughput_floor_terminated_total{upstream} Until now an operator could not distinguish a watcher-killed connection from a normal close on the lifetime counter; these separate the three forced-shutdown paths. Watcher restructured to share one ticker across all three checks; the interval is min(idle/4, maxDuration/4, window/4) bounded at 1s. New config values are validated for non-negative seconds. The race-safe helper getRelayTimings() now returns a relayTimings struct so the RLock-guarded snapshot covers all six fields atomically. Tests cover counter increments and metrics output for idle, max- duration and throughput-floor paths, propagation of dial_timeout and min_throughput_* settings into the dialer and Server fields, the grace-defaults-to-window fallback, and rejection of negative values for all four new settings.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces new connection-control features for the rsync proxy (per-IP concurrency caps, relay idle/max-duration timeouts, TCP keepalive, and throughput floor enforcement), along with Prometheus metrics and tests to validate behavior.

Changes:

Add new proxy/upstream config options (timeouts, keepalive, throughput floor, per-IP cap) and propagate them into server behavior.
Add relay-phase watcher logic and per-upstream counters, exposing new Prometheus metrics.
Add extensive end-to-end tests plus documentation updates in the example config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/server/server.go	Implements new relay controls (watchers, keepalive, per-IP cap), config propagation, and new counters.
pkg/server/metrics.go	Exposes new per-upstream counters via Prometheus metrics.
pkg/server/config.go	Adds TOML fields and documentation for new proxy/upstream settings.
pkg/server/server_test.go	Adds integration tests for idle timeout, max duration, per-IP cap, keepalive helper, config parsing, and throughput floor.
assets/config.example.toml	Documents new configuration knobs with examples and explanations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yaoge123 · 2026-06-02T12:55:08Z

+	if perIPLimit := s.getPerIPLimitForUpstream(target.Upstream); perIPLimit > 0 {
+		counter := s.getPerIPCounter(target.Upstream, ip)
+		if n := counter.Add(1); int(n) > perIPLimit {
+			counter.Add(-1)
+			s.getUpstreamCounters(target.Upstream).perIPRejected.Add(1)
+			s.accessLog.F("client %s rejected for upstream %s module %s: per-IP cap of %d reached", ip, target.Upstream, moduleName, perIPLimit)
+			_, _ = writeWithTimeout(downConn, fmt.Appendf(nil, "@ERROR: per-IP connection limit of %d for upstream %s reached, retry later\n", perIPLimit, target.Upstream), writeTimeout)
+			return nil
+		}
+		defer counter.Add(-1)
+	}


Fixed in af5c335 — comparing as n > int64(perIPLimit) directly, dropping the conversion. In practice cap values are tiny (single-digit defaults are common) so overflow was theoretical, but the int64 form is cleaner.

yaoge123 · 2026-06-02T12:55:29Z

+	// Per-(upstream, client IP) active-connection counters.
+	// Lazy-initialized via getPerIPCounter.
+	// map key is perIPCountKey. Value is *atomic.Int64.
+	perIPCounts sync.Map


Fair concern — tracked as a follow-up rather than rolling into this PR. The clean fix needs a CAS-style decrement-and-delete to avoid a race with a concurrent Add(1) that just observed the entry, which felt like material worth its own commit and review. As a rough bound, the working set is Σ(distinct active client IPs per upstream) and entries are tiny (*atomic.Int64), so the practical leak is small even on busy public mirrors. Will open a follow-up issue once these PRs land.

yaoge123 · 2026-06-02T12:55:40Z

+	srv.modules = map[string][]Target{
+		"fake": {{Upstream: "u1", Addr: fakeRsync.Listener.Addr().String()}},
+	}
+	srv.upstreams = []upstreamConfig{{Name: "u1"}}
+	srv.upstreamQueues = map[string]*queue.Queue{"u1": queue.New(0, 0)}


The pattern (mutating srv.modules / srv.upstreamQueues after startServer without taking reloadLock) is the established convention in the existing test file (see e.g. TestClientReadTimeout, TestRelayBetweenClientAndUpstream). The mutations land before any net.Dial against the listener, so the listener loop has not observed any traffic at that point and -race has not flagged it across many runs. Locking these for consistency is reasonable but feels like a separate cleanup of the existing tests rather than something to ship in this PR.

yaoge123 · 2026-06-02T12:55:10Z

+		// Wake at most a few times per the smallest of the
+		// configured windows so timeouts are detected within roughly
+		// 1.25x the configured value in the worst case, while keeping
+		// wakeup overhead negligible.


Fixed in af5c335 — the comment now explicitly documents the 1s floor and the resulting degradation for sub-4s settings (only relevant in tests, since production-meaningful values are well above 4 seconds).

yaoge123 · 2026-06-02T12:55:25Z

+		if interval < time.Second {
+			interval = time.Second
+		}


Same fix as the comment on line 978 — see af5c335.

yaoge123 · 2026-06-02T12:55:43Z

+// applyTCPKeepAlive enables TCP keepalive on the given connection if
+// it is a *net.TCPConn and a positive period is provided. Other
+// connection types (e.g. unix sockets) are silently ignored.
+func applyTCPKeepAlive(conn net.Conn, period time.Duration) {
+	if period <= 0 {
+		return
+	}
+	tc, ok := conn.(*net.TCPConn)
+	if !ok {
+		return
+	}
+	_ = tc.SetKeepAlive(true)
+	_ = tc.SetKeepAlivePeriod(period)
+}


Intentional. applyTCPKeepAlive is a best-effort optimization: if the platform / connection type does not support the syscall (e.g. unix sockets, exotic listeners) the proxy still works correctly, just falling back to the OS default. Logging a warning per accepted connection on such platforms would be more noise than signal. Open to adding a one-shot startup probe if reviewers feel observability there is important.

- Use int64 comparison for per-IP cap. perIPLimit values from config are always small (typical caps are single-digit to low-double-digit), so the practical risk of overflow on 32-bit GOARCH was zero. Still, comparing the int64 counter against int64(perIPLimit) is more correct and removes the conversion entirely from review noise. - Clarify the watcher interval comment. The 1.25x worst-case detection claim only holds when all enabled timeouts are above 4 seconds, because of the explicit 1s floor on the wakeup interval. Document the trade-off explicitly so future maintainers know that sub-second timeouts (used only in tests) trade detection latency for kernel timer overhead.

yaoge123 added 3 commits June 2, 2026 08:55

Copilot AI review requested due to automatic review settings June 2, 2026 01:03

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add connection lifecycle protections (idle/max-duration/throughput floor, per-IP cap, TCP keepalive, dial timeout)#37

feat: add connection lifecycle protections (idle/max-duration/throughput floor, per-IP cap, TCP keepalive, dial timeout)#37
yaoge123 wants to merge 4 commits into
ustclug:masterfrom
yaoge123:upstream-pr/connection-protections

yaoge123 commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

yaoge123 Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoge123 commented Jun 2, 2026

Summary

Motivation

What's added

[proxy] configuration knobs

[upstreams.<name>] override

New Prometheus metrics

Implementation notes

Watcher consolidation

Race-safety with config reload

Per-IP counters

TCP keepalive

Throughput floor

Why throughput floor and idle timeout are both needed

Backward compatibility

Testing

Production validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoge123 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`[proxy]` configuration knobs

`[upstreams.<name>]` override