fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230) by dolho · Pull Request #1232 · Abilityai/trinity

dolho · 2026-06-16T13:17:58Z

Problem

The trinity-backend container's Docker healthcheck intermittently trips to unhealthy while the service is fully up and serving /health 200s (observed live on eu2, prod compose). With only 2 uvicorn workers, scheduled-batch windows leave them GIL-contended; a /health probe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (retries: 3) flip the container to unhealthy; once the batch eases the next probe succeeds and it flips back.

It's a false negative (service serves 200s throughout, RestartCount=0), but a latent risk: any consumer of Docker health — autoheal sidecar, compose depends_on: condition: service_healthy, health-based LB drain — can act on the false unhealthy and restart/evict the backend mid-batch, orphaning in-flight executions.

Fix (config only — `docker-compose.prod.yml` backend)

field	before	after
`timeout`	10s	30s
`start_period`	10s	60s
`retries`	3	5

interval stays 30s. A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min); transient batch spikes no longer do. Scoped to the backend probe only — the scheduler/mcp/vector probes aren't the GIL-contended 2-worker flapper, so they're left unchanged (minimal scope).

A dedicated lightweight liveness route (no DB/work touch) would be even more robust, as the issue notes, but the timeout bump alone eliminates the observed flapping; left as a possible follow-up.

Verification

docker-compose.prod.yml parses; backend healthcheck resolves to {interval: 30s, timeout: 30s, retries: 5, start_period: 60s}.

Related to #1230

🤖 Generated with Claude Code

…althy (#1230) The prod backend Docker healthcheck intermittently tripped to `unhealthy` while the service was fully up serving /health 200s. With only 2 uvicorn workers, scheduled-batch windows make them GIL-contended; a /health probe then waits out the 10s timeout, and 3 consecutive aborts (Retries=3) flip the container to `unhealthy` until the batch eases (observed live on eu2). It is a false negative, but any consumer of Docker health (autoheal, depends_on: service_healthy, LB drain) can act on it and restart the backend mid-batch, orphaning in-flight executions. Relax the probe (config only, docker-compose.prod.yml backend): - timeout 10s -> 30s - start_period 10s -> 60s - retries 3 -> 5 A genuine outage (all probes fail) still trips after ~5 intervals (~2.5 min); transient load spikes no longer do. Other services' probes are left unchanged (not the GIL-contended 2-worker flapper). Related to #1230

github-actions · 2026-06-17T11:11:41Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

dolho requested a review from vybe June 16, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232

fix(ci): relax backend healthcheck timeout so batch load doesn't flap it unhealthy (#1230)#1232
dolho wants to merge 1 commit into
devfrom
fix/1230-backend-healthcheck-timeout

dolho commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dolho commented Jun 16, 2026

Problem

Fix (config only — docker-compose.prod.yml backend)

Verification

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix (config only — `docker-compose.prod.yml` backend)