Skip to content

bug: backend Docker healthcheck flaps to "unhealthy" under batch load (10s timeout too tight for 2 workers) #1230

@obasilakis

Description

@obasilakis

Summary

The trinity-backend container's Docker healthcheck intermittently trips to unhealthy while the service is fully up and serving /health 200s. Observed on the eu2 instance (prod compose, version c52714b6). It is a false negative caused by a too-strict probe, not an actual backend failure — but it is a latent risk anywhere an unhealthy state triggers a restart/eviction (autoheal, orchestrator, LB drain).

Healthcheck definition (current)

{"Test":["CMD","curl","-f","http://localhost:8000/health"],
 "Interval":30s,"Timeout":10s,"StartPeriod":10s,"Retries":3}

Backend command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2 --timeout-worker-healthcheck 120

What happens

With only 2 uvicorn workers, during scheduled-batch windows the workers are busy (observed backend at ~47% CPU mid-batch) and GIL-contended. A /health probe then waits longer than the 10s timeout and is aborted. Three consecutive aborts (Retries=3) flip the container to unhealthy; once the batch eases, the next probe succeeds and it flips back to healthy.

Evidence (eu2, captured live)

Probe log around a flap — three timeouts (exit=-1) then recovery (exit=0):

2026-06-16 12:21:58Z | exit=-1
2026-06-16 12:22:39Z | exit=-1
2026-06-16 12:23:19Z | exit=-1
2026-06-16 12:23:59Z | exit=0
2026-06-16 12:24:29Z | exit=0

At time of capture: docker ps showed Up 4 hours (unhealthy) while curl /health returned {"status":"healthy"}. Shortly after: Health=healthy FailingStreak=0. RestartCount=0, OOMKilled=false — the service never went down.

Impact

  • Cosmetic today (service stays up; /health returns 200 throughout).
  • Risk: any consumer of Docker health state (autoheal sidecar, compose depends_on: condition: service_healthy, external health-based routing) can act on the false unhealthy and restart/evict the backend mid-batch — which orphans in-flight executions (we already see "recovered on backend restart" rows after backend recreates).
  • Misleads operators/monitoring that read container health rather than the /health endpoint.

Suggested fix (config only, docker-compose.prod.yml)

Relax the probe so it tolerates transient load without hiding real outages:

  • timeout: 10s → 30s
  • start_period: 10s → 30–60s
  • optionally retries: 3 → 5

A dedicated lightweight liveness route (no DB/work touch) would be even more robust, but the timeout bump alone should eliminate the flapping.

Related

Filed by trinity-ops-agent from live eu2 observation (2026-06-16).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions