bug: backend Docker healthcheck flaps to "unhealthy" under batch load (10s timeout too tight for 2 workers)

## Summary

The `trinity-backend` container's Docker healthcheck intermittently trips to **`unhealthy`** while the service is fully up and serving `/health` 200s. Observed on the **eu2** instance (prod compose, version `c52714b6`). It is a false negative caused by a too-strict probe, not an actual backend failure — but it is a latent risk anywhere an `unhealthy` state triggers a restart/eviction (autoheal, orchestrator, LB drain).

## Healthcheck definition (current)

```json
{"Test":["CMD","curl","-f","http://localhost:8000/health"],
 "Interval":30s,"Timeout":10s,"StartPeriod":10s,"Retries":3}
```
Backend command: `uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2 --timeout-worker-healthcheck 120`

## What happens

With only **2 uvicorn workers**, during scheduled-batch windows the workers are busy (observed backend at ~47% CPU mid-batch) and GIL-contended. A `/health` probe then waits longer than the **10s timeout** and is aborted. Three consecutive aborts (Retries=3) flip the container to `unhealthy`; once the batch eases, the next probe succeeds and it flips back to `healthy`.

## Evidence (eu2, captured live)

Probe log around a flap — three timeouts (`exit=-1`) then recovery (`exit=0`):
```
2026-06-16 12:21:58Z | exit=-1
2026-06-16 12:22:39Z | exit=-1
2026-06-16 12:23:19Z | exit=-1
2026-06-16 12:23:59Z | exit=0
2026-06-16 12:24:29Z | exit=0
```
At time of capture: `docker ps` showed `Up 4 hours (unhealthy)` while `curl /health` returned `{"status":"healthy"}`. Shortly after: `Health=healthy FailingStreak=0`. `RestartCount=0`, `OOMKilled=false` — the service never went down.

## Impact

- Cosmetic today (service stays up; `/health` returns 200 throughout).
- **Risk:** any consumer of Docker health state (autoheal sidecar, compose `depends_on: condition: service_healthy`, external health-based routing) can act on the false `unhealthy` and restart/evict the backend mid-batch — which orphans in-flight executions (we already see "recovered on backend restart" rows after backend recreates).
- Misleads operators/monitoring that read container health rather than the `/health` endpoint.

## Suggested fix (config only, `docker-compose.prod.yml`)

Relax the probe so it tolerates transient load without hiding real outages:
- `timeout: 10s → 30s`
- `start_period: 10s → 30–60s`
- optionally `retries: 3 → 5`

A dedicated lightweight liveness route (no DB/work touch) would be even more robust, but the timeout bump alone should eliminate the flapping.

## Related
- #105 / #131 — uvicorn worker recycle (`--timeout-worker-healthcheck`); already applied here (0.37.0 + 120s) but does not address probe timeout
- #408 — long-running backend→agent calls dropped on worker recycle
- #904 — agent OOM cascades into backend worker saturation (same load-induced-unresponsiveness family)

Filed by trinity-ops-agent from live eu2 observation (2026-06-16).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: backend Docker healthcheck flaps to "unhealthy" under batch load (10s timeout too tight for 2 workers) #1230

Summary

Healthcheck definition (current)

What happens

Evidence (eu2, captured live)

Impact

Suggested fix (config only, `docker-compose.prod.yml`)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: backend Docker healthcheck flaps to "unhealthy" under batch load (10s timeout too tight for 2 workers) #1230

Description

Summary

Healthcheck definition (current)

What happens

Evidence (eu2, captured live)

Impact

Suggested fix (config only, docker-compose.prod.yml)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Suggested fix (config only, `docker-compose.prod.yml`)