Summary
The trinity-backend container's Docker healthcheck intermittently trips to unhealthy while the service is fully up and serving /health 200s. Observed on the eu2 instance (prod compose, version c52714b6). It is a false negative caused by a too-strict probe, not an actual backend failure — but it is a latent risk anywhere an unhealthy state triggers a restart/eviction (autoheal, orchestrator, LB drain).
Healthcheck definition (current)
{"Test":["CMD","curl","-f","http://localhost:8000/health"],
"Interval":30s,"Timeout":10s,"StartPeriod":10s,"Retries":3}
Backend command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2 --timeout-worker-healthcheck 120
What happens
With only 2 uvicorn workers, during scheduled-batch windows the workers are busy (observed backend at ~47% CPU mid-batch) and GIL-contended. A /health probe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (Retries=3) flip the container to unhealthy; once the batch eases, the next probe succeeds and it flips back to healthy.
Evidence (eu2, captured live)
Probe log around a flap — three timeouts (exit=-1) then recovery (exit=0):
2026-06-16 12:21:58Z | exit=-1
2026-06-16 12:22:39Z | exit=-1
2026-06-16 12:23:19Z | exit=-1
2026-06-16 12:23:59Z | exit=0
2026-06-16 12:24:29Z | exit=0
At time of capture: docker ps showed Up 4 hours (unhealthy) while curl /health returned {"status":"healthy"}. Shortly after: Health=healthy FailingStreak=0. RestartCount=0, OOMKilled=false — the service never went down.
Impact
- Cosmetic today (service stays up;
/health returns 200 throughout).
- Risk: any consumer of Docker health state (autoheal sidecar, compose
depends_on: condition: service_healthy, external health-based routing) can act on the false unhealthy and restart/evict the backend mid-batch — which orphans in-flight executions (we already see "recovered on backend restart" rows after backend recreates).
- Misleads operators/monitoring that read container health rather than the
/health endpoint.
Suggested fix (config only, docker-compose.prod.yml)
Relax the probe so it tolerates transient load without hiding real outages:
timeout: 10s → 30s
start_period: 10s → 30–60s
- optionally
retries: 3 → 5
A dedicated lightweight liveness route (no DB/work touch) would be even more robust, but the timeout bump alone should eliminate the flapping.
Related
Filed by trinity-ops-agent from live eu2 observation (2026-06-16).
Summary
The
trinity-backendcontainer's Docker healthcheck intermittently trips tounhealthywhile the service is fully up and serving/health200s. Observed on the eu2 instance (prod compose, versionc52714b6). It is a false negative caused by a too-strict probe, not an actual backend failure — but it is a latent risk anywhere anunhealthystate triggers a restart/eviction (autoheal, orchestrator, LB drain).Healthcheck definition (current)
{"Test":["CMD","curl","-f","http://localhost:8000/health"], "Interval":30s,"Timeout":10s,"StartPeriod":10s,"Retries":3}Backend command:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2 --timeout-worker-healthcheck 120What happens
With only 2 uvicorn workers, during scheduled-batch windows the workers are busy (observed backend at ~47% CPU mid-batch) and GIL-contended. A
/healthprobe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (Retries=3) flip the container tounhealthy; once the batch eases, the next probe succeeds and it flips back tohealthy.Evidence (eu2, captured live)
Probe log around a flap — three timeouts (
exit=-1) then recovery (exit=0):At time of capture:
docker psshowedUp 4 hours (unhealthy)whilecurl /healthreturned{"status":"healthy"}. Shortly after:Health=healthy FailingStreak=0.RestartCount=0,OOMKilled=false— the service never went down.Impact
/healthreturns 200 throughout).depends_on: condition: service_healthy, external health-based routing) can act on the falseunhealthyand restart/evict the backend mid-batch — which orphans in-flight executions (we already see "recovered on backend restart" rows after backend recreates)./healthendpoint.Suggested fix (config only,
docker-compose.prod.yml)Relax the probe so it tolerates transient load without hiding real outages:
timeout: 10s → 30sstart_period: 10s → 30–60sretries: 3 → 5A dedicated lightweight liveness route (no DB/work touch) would be even more robust, but the timeout bump alone should eliminate the flapping.
Related
--timeout-worker-healthcheck); already applied here (0.37.0 + 120s) but does not address probe timeoutFiled by trinity-ops-agent from live eu2 observation (2026-06-16).