fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232
Open
dolho wants to merge 1 commit into
Open
fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232dolho wants to merge 1 commit into
dolho wants to merge 1 commit into
Conversation
…althy (#1230) The prod backend Docker healthcheck intermittently tripped to `unhealthy` while the service was fully up serving /health 200s. With only 2 uvicorn workers, scheduled-batch windows make them GIL-contended; a /health probe then waits out the 10s timeout, and 3 consecutive aborts (Retries=3) flip the container to `unhealthy` until the batch eases (observed live on eu2). It is a false negative, but any consumer of Docker health (autoheal, depends_on: service_healthy, LB drain) can act on it and restart the backend mid-batch, orphaning in-flight executions. Relax the probe (config only, docker-compose.prod.yml backend): - timeout 10s -> 30s - start_period 10s -> 60s - retries 3 -> 5 A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min); transient load spikes no longer do. Other services' probes are left unchanged (not the GIL-contended 2-worker flapper). Related to #1230
|
Resolve by running |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
trinity-backendcontainer's Docker healthcheck intermittently trips tounhealthywhile the service is fully up and serving/health200s (observed live on eu2, prod compose). With only 2 uvicorn workers, scheduled-batch windows leave them GIL-contended; a/healthprobe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (retries: 3) flip the container tounhealthy; once the batch eases the next probe succeeds and it flips back.It's a false negative (service serves 200s throughout,
RestartCount=0), but a latent risk: any consumer of Docker health — autoheal sidecar, composedepends_on: condition: service_healthy, health-based LB drain — can act on the falseunhealthyand restart/evict the backend mid-batch, orphaning in-flight executions.Fix (config only —
docker-compose.prod.ymlbackend)timeoutstart_periodretriesintervalstays 30s. A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min); transient batch spikes no longer do. Scoped to the backend probe only — the scheduler/mcp/vector probes aren't the GIL-contended 2-worker flapper, so they're left unchanged (minimal scope).A dedicated lightweight liveness route (no DB/work touch) would be even more robust, as the issue notes, but the timeout bump alone eliminates the observed flapping; left as a possible follow-up.
Verification
docker-compose.prod.ymlparses; backend healthcheck resolves to{interval: 30s, timeout: 30s, retries: 5, start_period: 60s}.Related to #1230
🤖 Generated with Claude Code