sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled)

cc @beastoin

> Note: the UID quoted below is the reporter's own account, shared intentionally — safe to leave in public.

## Problem

`backend/database/sync_jobs.py` rewrites a job's status to `failed` with the error string `"Job timed out (background worker likely died)"` whenever `time.time() - updated_at > STALE_THRESHOLD_SECONDS` (600s) and the job is still in `queued` or `processing`.

```python
# Stale-job detection: if processing for too long, mark failed
if job['status'] in ('queued', 'processing'):
    updated_at = job.get('updated_at') or job.get('created_at', 0)
    if time.time() - updated_at > STALE_THRESHOLD_SECONDS:
        job['status'] = 'failed'
        job['error'] = 'Job timed out (background worker likely died)'
```

That doesn't distinguish two very different situations:

1. A job was actually picked up by a worker (`status='processing'`) and the worker died mid-pipeline. **Correctly failed.**
2. A job is still `status='queued'` — never picked up by any worker because the pools are saturated. **Not a failure; the backend just hasn't had capacity.**

Today both get the same `failed` + `"background worker likely died"` treatment. For case 2 the message is also factually wrong (no worker died — none was available).

## Impact (client-side)

From a real user log (UID `WmUjAoglahVanVihp1AD07ZZvbH2` — reporter's own account):

```
12:09:43  job ceee6bb3 → "queued", 0/0 segments
12:10:03  job 1a805f6d → "queued", 0/0
12:14:21  job ceee6bb3 → "queued"  (~5 min in)
12:16:21  job ceee6bb3 → "queued"
12:18:22  job ceee6bb3 → "queued"
12:20:22  REVERT job 64457303 → "failed", error="Job timed out (background worker likely died)"
```

Pipeline reality at the time (also in service logs):
```
WARNING:utils.executors:executor_pool_health:
  [{'name': 'storage', 'active_count': 96, 'max_workers': 96, 'queue_depth': 29, 'utilization_pct': 100.0},
   {'name': 'postprocess', 'active_count': 24, 'max_workers': 24, 'queue_depth': 80, 'utilization_pct': 100.0}]
```

So uploads are accepted (202), assigned a `jobId`, sit in the queue because workers are saturated, and exactly at the 10-minute mark the stale guard flips them to `failed`. The client's reconciler reads `failed` and (until the PR below) reverted the WAL and bumped retry, surfacing as **"Couldn't process — retrying"**. Users tap Retry → new job → same cycle → retry budget exhausted → red **"Failed — tap Retry"**.

## What this PR (#7451) does on the client

Shipped a workaround in `feat/sync-rate-limit-handling` so users stop seeing this as a failure:

- Reconciler detects the specific error string `"background worker likely died"` and treats it as backend capacity, **not** a content failure.
- WAL reverts to `miss` **without bumping `retryCount`** → row stays calm grey **"Waiting to sync"**.
- Triggers the global rate-limit cooldown (10 min) so we stop submitting more jobs that will also stale out and add to the backlog.
- Status card surfaces a distinct message: **"Omi servers are busy — your recordings will sync once capacity returns"**.

The client mitigation works, but it's pattern-matching a server error string. The right fix is on the backend.

## Proposed backend fixes (order of impact)

1. **Capacity** — bump `storage` and `postprocess` pool sizes in `utils/executors.py`, or raise the `backend-sync` Cloud Run max-instance / concurrency. Saturation is the actual root cause.

2. **Stale guard should distinguish queued vs. processing** in `database/sync_jobs.py:get_sync_job`:
   - **`status == 'processing'` and stale** → keep as `failed` with the current "worker likely died" message (the original intent).
   - **`status == 'queued'` and stale** → either leave it queued (let the natural 24h TTL clean it up), or use a much longer threshold, or mark with a distinct status / error like `'queued_too_long'` so the client can react differently. A genuinely never-picked-up job is not a job failure.

3. **Fix the error string for the queued case** even if you keep flipping to `failed` — something like `'Backend capacity exhausted before pickup'` is honest and lets log readers find it.

4. **Observability**: log the stale-guard rewrites in `get_sync_job` (currently silent — the transition is invisible in stdout), and include the UID in the POST 202 accept log so per-user job traces are possible without waiting for completion.

## Links

- Client PR mitigating this: #7451 (`feat/sync-rate-limit-handling`)
- Specific commits handling this case:
  - reconciler detection + cooldown: `feat(sync): reconciler detects backend-busy stale-guard fails; no retryCount bump`
  - UI surfacing: `feat(sync): manual sync_page status card picks message by rate-limit reason`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled) #7469

Problem

Impact (client-side)

What this PR (#7451) does on the client

Proposed backend fixes (order of impact)

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled) #7469

Description

Problem

Impact (client-side)

What this PR (#7451) does on the client

Proposed backend fixes (order of impact)

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions