Skip to content

sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled) #7469

@mdmohsin7

Description

@mdmohsin7

cc @beastoin

Note: the UID quoted below is the reporter's own account, shared intentionally — safe to leave in public.

Problem

backend/database/sync_jobs.py rewrites a job's status to failed with the error string "Job timed out (background worker likely died)" whenever time.time() - updated_at > STALE_THRESHOLD_SECONDS (600s) and the job is still in queued or processing.

# Stale-job detection: if processing for too long, mark failed
if job['status'] in ('queued', 'processing'):
    updated_at = job.get('updated_at') or job.get('created_at', 0)
    if time.time() - updated_at > STALE_THRESHOLD_SECONDS:
        job['status'] = 'failed'
        job['error'] = 'Job timed out (background worker likely died)'

That doesn't distinguish two very different situations:

  1. A job was actually picked up by a worker (status='processing') and the worker died mid-pipeline. Correctly failed.
  2. A job is still status='queued' — never picked up by any worker because the pools are saturated. Not a failure; the backend just hasn't had capacity.

Today both get the same failed + "background worker likely died" treatment. For case 2 the message is also factually wrong (no worker died — none was available).

Impact (client-side)

From a real user log (UID WmUjAoglahVanVihp1AD07ZZvbH2 — reporter's own account):

12:09:43  job ceee6bb3 → "queued", 0/0 segments
12:10:03  job 1a805f6d → "queued", 0/0
12:14:21  job ceee6bb3 → "queued"  (~5 min in)
12:16:21  job ceee6bb3 → "queued"
12:18:22  job ceee6bb3 → "queued"
12:20:22  REVERT job 64457303 → "failed", error="Job timed out (background worker likely died)"

Pipeline reality at the time (also in service logs):

WARNING:utils.executors:executor_pool_health:
  [{'name': 'storage', 'active_count': 96, 'max_workers': 96, 'queue_depth': 29, 'utilization_pct': 100.0},
   {'name': 'postprocess', 'active_count': 24, 'max_workers': 24, 'queue_depth': 80, 'utilization_pct': 100.0}]

So uploads are accepted (202), assigned a jobId, sit in the queue because workers are saturated, and exactly at the 10-minute mark the stale guard flips them to failed. The client's reconciler reads failed and (until the PR below) reverted the WAL and bumped retry, surfacing as "Couldn't process — retrying". Users tap Retry → new job → same cycle → retry budget exhausted → red "Failed — tap Retry".

What this PR (#7451) does on the client

Shipped a workaround in feat/sync-rate-limit-handling so users stop seeing this as a failure:

  • Reconciler detects the specific error string "background worker likely died" and treats it as backend capacity, not a content failure.
  • WAL reverts to miss without bumping retryCount → row stays calm grey "Waiting to sync".
  • Triggers the global rate-limit cooldown (10 min) so we stop submitting more jobs that will also stale out and add to the backlog.
  • Status card surfaces a distinct message: "Omi servers are busy — your recordings will sync once capacity returns".

The client mitigation works, but it's pattern-matching a server error string. The right fix is on the backend.

Proposed backend fixes (order of impact)

  1. Capacity — bump storage and postprocess pool sizes in utils/executors.py, or raise the backend-sync Cloud Run max-instance / concurrency. Saturation is the actual root cause.

  2. Stale guard should distinguish queued vs. processing in database/sync_jobs.py:get_sync_job:

    • status == 'processing' and stale → keep as failed with the current "worker likely died" message (the original intent).
    • status == 'queued' and stale → either leave it queued (let the natural 24h TTL clean it up), or use a much longer threshold, or mark with a distinct status / error like 'queued_too_long' so the client can react differently. A genuinely never-picked-up job is not a job failure.
  3. Fix the error string for the queued case even if you keep flipping to failed — something like 'Backend capacity exhausted before pickup' is honest and lets log readers find it.

  4. Observability: log the stale-guard rewrites in get_sync_job (currently silent — the transition is invisible in stdout), and include the UID in the POST 202 accept log so per-user job traces are possible without waiting for completion.

Links

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't workingcaptureLayer: Audio recording, device pairing, BLEp1Priority: Critical (score 22-29)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions