Skip to content

fix(traefik): finite readTimeout on entrypoints to stop fd-leak unreachability#108

Merged
max-tet merged 1 commit into
mainfrom
fix/traefik-read-timeout
Jun 21, 2026
Merged

fix(traefik): finite readTimeout on entrypoints to stop fd-leak unreachability#108
max-tet merged 1 commit into
mainfrom
fix/traefik-read-timeout

Conversation

@ClaydeCode

Copy link
Copy Markdown
Contributor

Summary

Follow-up to the #306 / shard tdyz60 incident (root-cause for the unreachability, which #309's log rotation does not address).

Traefik's static config (data/traefik.yml, data/traefik_no_ssl.yml) sets no connection timeouts, so it inherits traefik's defaults: readTimeout=0, writeTimeout=0 — both meaning "no timeout".

On a shard's public IP with 80/443/8883 open to the internet, background scanners (Shodan/Censys/masscan/exploit bots) continuously open TCP connections that never complete a request — silent connects, abandoned TLS handshakes, slowloris. With readTimeout=0 each is held open forever, consuming one file descriptor. They accumulate over days until traefik hits its open-file ceiling and accept() returns EMFILE (too many open files):

  • traefik can no longer accept any new connection → shard unreachable (process never exits, so restart: always never fires);
  • the EMFILE error is logged in a hot accept-retry loop → 38 GB in 3 days → root disk full → bricked the core upgrade (the original #306 symptom).

Change

Set readTimeout: "300s" via transport.respondingTimeouts on the http and https entrypoints in both static configs. Abandoned connections are now reaped instead of leaking fds.

  • 300s stays generous for slow/large uploads (personal-cloud file sync).
  • writeTimeout is intentionally left at default 0 so large downloads, SSE, and long-poll responses are not cut off.
  • idleTimeout unchanged (default 180s).

Scope / what this does NOT cover

  • MQTT (:8883) is a TCP entrypoint; respondingTimeouts are HTTP-only and do not apply. That fd vector is handled by a nofile ulimit on the traefik service in the companion controller PR (defense-in-depth; raises the ceiling so any residual leak degrades gracefully instead of bricking).
  • Does not change why traefik is reachable by scanners at all (inherent to a public server OS).

Test plan

  • data/traefik.yml and data/traefik_no_ssl.yml parse as valid YAML; entryPoints.{http,https}.transport.respondingTimeouts.readTimeout present, mqtt untouched.
  • _copy_traefik_static_config() renders the file with Jinja2 (only {{ acme_email }}); added keys contain no template syntax, so rendering is unaffected.

Recommended reading order

  1. data/traefik.yml
  2. data/traefik_no_ssl.yml

🤖 Generated with Claude Code

Traefik's static config set no connection timeouts, inheriting v2.x
defaults of readTimeout=0 / writeTimeout=0 ("no timeout"). On a shard's
public IP, internet scanners constantly open connections that never
complete a request (silent connects, abandoned TLS handshakes,
slowloris). With readTimeout=0 these are held open forever, each
consuming a file descriptor, until traefik hits its open-file ceiling
and accept() fails with EMFILE -- at which point the shard is unreachable
and the error is logged in a hot loop (the 38 GB log that filled root in
the tdyz60/#306 incident).

Set readTimeout=300s on the http and https entrypoints so abandoned
connections are reaped. 300s stays generous for slow/large uploads;
writeTimeout is intentionally left at default 0 so large downloads, SSE,
and long-poll responses are not cut off. The mqtt (8883) entrypoint is
TCP, where respondingTimeouts do not apply -- that vector is covered by a
nofile ulimit on the traefik service (separate controller PR).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants