Skip to content

fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232

Open
dolho wants to merge 1 commit into
devfrom
fix/1230-backend-healthcheck-timeout
Open

fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232
dolho wants to merge 1 commit into
devfrom
fix/1230-backend-healthcheck-timeout

Conversation

@dolho

@dolho dolho commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

The trinity-backend container's Docker healthcheck intermittently trips to unhealthy while the service is fully up and serving /health 200s (observed live on eu2, prod compose). With only 2 uvicorn workers, scheduled-batch windows leave them GIL-contended; a /health probe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (retries: 3) flip the container to unhealthy; once the batch eases the next probe succeeds and it flips back.

It's a false negative (service serves 200s throughout, RestartCount=0), but a latent risk: any consumer of Docker health — autoheal sidecar, compose depends_on: condition: service_healthy, health-based LB drain — can act on the false unhealthy and restart/evict the backend mid-batch, orphaning in-flight executions.

Fix (config only — docker-compose.prod.yml backend)

field before after
timeout 10s 30s
start_period 10s 60s
retries 3 5

interval stays 30s. A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min); transient batch spikes no longer do. Scoped to the backend probe only — the scheduler/mcp/vector probes aren't the GIL-contended 2-worker flapper, so they're left unchanged (minimal scope).

A dedicated lightweight liveness route (no DB/work touch) would be even more robust, as the issue notes, but the timeout bump alone eliminates the observed flapping; left as a possible follow-up.

Verification

docker-compose.prod.yml parses; backend healthcheck resolves to {interval: 30s, timeout: 30s, retries: 5, start_period: 60s}.

Related to #1230

🤖 Generated with Claude Code

…althy (#1230)

The prod backend Docker healthcheck intermittently tripped to `unhealthy`
while the service was fully up serving /health 200s. With only 2 uvicorn
workers, scheduled-batch windows make them GIL-contended; a /health probe
then waits out the 10s timeout, and 3 consecutive aborts (Retries=3) flip the
container to `unhealthy` until the batch eases (observed live on eu2). It is a
false negative, but any consumer of Docker health (autoheal, depends_on:
service_healthy, LB drain) can act on it and restart the backend mid-batch,
orphaning in-flight executions.

Relax the probe (config only, docker-compose.prod.yml backend):
- timeout      10s -> 30s
- start_period 10s -> 60s
- retries       3  -> 5

A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min);
transient load spikes no longer do. Other services' probes are left unchanged
(not the GIL-contended 2-worker flapper).

Related to #1230
@dolho dolho requested a review from vybe June 16, 2026 13:21
@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant