fix(traefik): finite readTimeout on entrypoints to stop fd-leak unreachability by ClaydeCode · Pull Request #108 · FreeshardBase/freeshard

ClaydeCode · 2026-06-20T17:13:23Z

Summary

Follow-up to the #306 / shard tdyz60 incident (root-cause for the unreachability, which #309's log rotation does not address).

Traefik's static config (data/traefik.yml, data/traefik_no_ssl.yml) sets no connection timeouts, so it inherits traefik's defaults: readTimeout=0, writeTimeout=0 — both meaning "no timeout".

On a shard's public IP with 80/443/8883 open to the internet, background scanners (Shodan/Censys/masscan/exploit bots) continuously open TCP connections that never complete a request — silent connects, abandoned TLS handshakes, slowloris. With readTimeout=0 each is held open forever, consuming one file descriptor. They accumulate over days until traefik hits its open-file ceiling and accept() returns EMFILE (too many open files):

traefik can no longer accept any new connection → shard unreachable (process never exits, so restart: always never fires);
the EMFILE error is logged in a hot accept-retry loop → 38 GB in 3 days → root disk full → bricked the core upgrade (the original #306 symptom).

Change

Set readTimeout: "300s" via transport.respondingTimeouts on the http and https entrypoints in both static configs. Abandoned connections are now reaped instead of leaking fds.

300s stays generous for slow/large uploads (personal-cloud file sync).
writeTimeout is intentionally left at default 0 so large downloads, SSE, and long-poll responses are not cut off.
idleTimeout unchanged (default 180s).

Scope / what this does NOT cover

MQTT (:8883) is a TCP entrypoint; respondingTimeouts are HTTP-only and do not apply. That fd vector is handled by a nofile ulimit on the traefik service in the companion controller PR (defense-in-depth; raises the ceiling so any residual leak degrades gracefully instead of bricking).
Does not change why traefik is reachable by scanners at all (inherent to a public server OS).

Test plan

data/traefik.yml and data/traefik_no_ssl.yml parse as valid YAML; entryPoints.{http,https}.transport.respondingTimeouts.readTimeout present, mqtt untouched.
_copy_traefik_static_config() renders the file with Jinja2 (only {{ acme_email }}); added keys contain no template syntax, so rendering is unaffected.

Recommended reading order

data/traefik.yml
data/traefik_no_ssl.yml

🤖 Generated with Claude Code

Traefik's static config set no connection timeouts, inheriting v2.x defaults of readTimeout=0 / writeTimeout=0 ("no timeout"). On a shard's public IP, internet scanners constantly open connections that never complete a request (silent connects, abandoned TLS handshakes, slowloris). With readTimeout=0 these are held open forever, each consuming a file descriptor, until traefik hits its open-file ceiling and accept() fails with EMFILE -- at which point the shard is unreachable and the error is logged in a hot loop (the 38 GB log that filled root in the tdyz60/#306 incident). Set readTimeout=300s on the http and https entrypoints so abandoned connections are reaped. 300s stays generous for slow/large uploads; writeTimeout is intentionally left at default 0 so large downloads, SSE, and long-poll responses are not cut off. The mqtt (8883) entrypoint is TCP, where respondingTimeouts do not apply -- that vector is covered by a nofile ulimit on the traefik service (separate controller PR). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

max-tet merged commit 9ad875e into main Jun 21, 2026
6 checks passed

max-tet deleted the fix/traefik-read-timeout branch June 21, 2026 18:39

This was referenced Jun 23, 2026

Release image with #108 (finite readTimeout) — fd-leak fix is merged but in no released tag #111

Open

fix(traefik): use Host() instead of HostRegexp for dashboard router (v3 prep) #112

Closed

set version to 0.39.4 (release #108 readTimeout) #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(traefik): finite readTimeout on entrypoints to stop fd-leak unreachability#108

fix(traefik): finite readTimeout on entrypoints to stop fd-leak unreachability#108
max-tet merged 1 commit into
mainfrom
fix/traefik-read-timeout

ClaydeCode commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ClaydeCode commented Jun 20, 2026

Summary

Change

Scope / what this does NOT cover

Test plan

Recommended reading order

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants