cc @beastoin
Note: the UID quoted below is the reporter's own account, shared intentionally — safe to leave in public.
Problem
backend/database/sync_jobs.py rewrites a job's status to failed with the error string "Job timed out (background worker likely died)" whenever time.time() - updated_at > STALE_THRESHOLD_SECONDS (600s) and the job is still in queued or processing.
# Stale-job detection: if processing for too long, mark failed
if job['status'] in ('queued', 'processing'):
updated_at = job.get('updated_at') or job.get('created_at', 0)
if time.time() - updated_at > STALE_THRESHOLD_SECONDS:
job['status'] = 'failed'
job['error'] = 'Job timed out (background worker likely died)'
That doesn't distinguish two very different situations:
- A job was actually picked up by a worker (
status='processing') and the worker died mid-pipeline. Correctly failed.
- A job is still
status='queued' — never picked up by any worker because the pools are saturated. Not a failure; the backend just hasn't had capacity.
Today both get the same failed + "background worker likely died" treatment. For case 2 the message is also factually wrong (no worker died — none was available).
Impact (client-side)
From a real user log (UID WmUjAoglahVanVihp1AD07ZZvbH2 — reporter's own account):
12:09:43 job ceee6bb3 → "queued", 0/0 segments
12:10:03 job 1a805f6d → "queued", 0/0
12:14:21 job ceee6bb3 → "queued" (~5 min in)
12:16:21 job ceee6bb3 → "queued"
12:18:22 job ceee6bb3 → "queued"
12:20:22 REVERT job 64457303 → "failed", error="Job timed out (background worker likely died)"
Pipeline reality at the time (also in service logs):
WARNING:utils.executors:executor_pool_health:
[{'name': 'storage', 'active_count': 96, 'max_workers': 96, 'queue_depth': 29, 'utilization_pct': 100.0},
{'name': 'postprocess', 'active_count': 24, 'max_workers': 24, 'queue_depth': 80, 'utilization_pct': 100.0}]
So uploads are accepted (202), assigned a jobId, sit in the queue because workers are saturated, and exactly at the 10-minute mark the stale guard flips them to failed. The client's reconciler reads failed and (until the PR below) reverted the WAL and bumped retry, surfacing as "Couldn't process — retrying". Users tap Retry → new job → same cycle → retry budget exhausted → red "Failed — tap Retry".
What this PR (#7451) does on the client
Shipped a workaround in feat/sync-rate-limit-handling so users stop seeing this as a failure:
- Reconciler detects the specific error string
"background worker likely died" and treats it as backend capacity, not a content failure.
- WAL reverts to
miss without bumping retryCount → row stays calm grey "Waiting to sync".
- Triggers the global rate-limit cooldown (10 min) so we stop submitting more jobs that will also stale out and add to the backlog.
- Status card surfaces a distinct message: "Omi servers are busy — your recordings will sync once capacity returns".
The client mitigation works, but it's pattern-matching a server error string. The right fix is on the backend.
Proposed backend fixes (order of impact)
-
Capacity — bump storage and postprocess pool sizes in utils/executors.py, or raise the backend-sync Cloud Run max-instance / concurrency. Saturation is the actual root cause.
-
Stale guard should distinguish queued vs. processing in database/sync_jobs.py:get_sync_job:
status == 'processing' and stale → keep as failed with the current "worker likely died" message (the original intent).
status == 'queued' and stale → either leave it queued (let the natural 24h TTL clean it up), or use a much longer threshold, or mark with a distinct status / error like 'queued_too_long' so the client can react differently. A genuinely never-picked-up job is not a job failure.
-
Fix the error string for the queued case even if you keep flipping to failed — something like 'Backend capacity exhausted before pickup' is honest and lets log readers find it.
-
Observability: log the stale-guard rewrites in get_sync_job (currently silent — the transition is invisible in stdout), and include the UID in the POST 202 accept log so per-user job traces are possible without waiting for completion.
Links
🤖 Generated with Claude Code
cc @beastoin
Problem
backend/database/sync_jobs.pyrewrites a job's status tofailedwith the error string"Job timed out (background worker likely died)"whenevertime.time() - updated_at > STALE_THRESHOLD_SECONDS(600s) and the job is still inqueuedorprocessing.That doesn't distinguish two very different situations:
status='processing') and the worker died mid-pipeline. Correctly failed.status='queued'— never picked up by any worker because the pools are saturated. Not a failure; the backend just hasn't had capacity.Today both get the same
failed+"background worker likely died"treatment. For case 2 the message is also factually wrong (no worker died — none was available).Impact (client-side)
From a real user log (UID
WmUjAoglahVanVihp1AD07ZZvbH2— reporter's own account):Pipeline reality at the time (also in service logs):
So uploads are accepted (202), assigned a
jobId, sit in the queue because workers are saturated, and exactly at the 10-minute mark the stale guard flips them tofailed. The client's reconciler readsfailedand (until the PR below) reverted the WAL and bumped retry, surfacing as "Couldn't process — retrying". Users tap Retry → new job → same cycle → retry budget exhausted → red "Failed — tap Retry".What this PR (#7451) does on the client
Shipped a workaround in
feat/sync-rate-limit-handlingso users stop seeing this as a failure:"background worker likely died"and treats it as backend capacity, not a content failure.misswithout bumpingretryCount→ row stays calm grey "Waiting to sync".The client mitigation works, but it's pattern-matching a server error string. The right fix is on the backend.
Proposed backend fixes (order of impact)
Capacity — bump
storageandpostprocesspool sizes inutils/executors.py, or raise thebackend-syncCloud Run max-instance / concurrency. Saturation is the actual root cause.Stale guard should distinguish queued vs. processing in
database/sync_jobs.py:get_sync_job:status == 'processing'and stale → keep asfailedwith the current "worker likely died" message (the original intent).status == 'queued'and stale → either leave it queued (let the natural 24h TTL clean it up), or use a much longer threshold, or mark with a distinct status / error like'queued_too_long'so the client can react differently. A genuinely never-picked-up job is not a job failure.Fix the error string for the queued case even if you keep flipping to
failed— something like'Backend capacity exhausted before pickup'is honest and lets log readers find it.Observability: log the stale-guard rewrites in
get_sync_job(currently silent — the transition is invisible in stdout), and include the UID in the POST 202 accept log so per-user job traces are possible without waiting for completion.Links
feat/sync-rate-limit-handling)feat(sync): reconciler detects backend-busy stale-guard fails; no retryCount bumpfeat(sync): manual sync_page status card picks message by rate-limit reason🤖 Generated with Claude Code