Skip to content

Sync infra changes — May 28: storage pool, autoscaling, and app-side rate-limit fixes #7531

@mdmohsin7

Description

@mdmohsin7

Context

Users were reporting offline sync issues — files sitting in backlog for days, "Fair-use limit reached" 429 popups, and "Omi servers are busy" banners persisting even when the backend was healthy. Investigation today traced the issues to a combination of fair-use enforcement, Cloud Run autoscaling behavior, and shared storage_executor saturation between sync and playback workloads.

This issue documents the changes made today and the remaining work.

Diagnosis (what we found in logs/metrics)

  1. Fair-use 429s — HTTP 429 fires from is_hard_restricted() in backend/utils/fair_use.py:437 when a user is in stage='restrict'. ~14 users hit 1000+ 429s/day in 24h. Free-tier and paid users currently in the same bucket — no paid-user bypass yet.

  2. Cloud Run autoscaler underscaling for background workbackend-sync was running 5–10 instances against a 25-instance ceiling. Autoscaler signals on HTTP concurrency + CPU; the sync v2 pipeline returns 202 in ~0.2s and does the heavy work in a background task that's I/O-bound. Result: low CPU + low HTTP concurrency → no scale-up, while the per-instance worker pools were pegged.

  3. Load-balancer concentration — One instance was absorbing 84% of HTTP requests because the LB sends to the "least busy" by HTTP concurrency, and our fast 202 path keeps HTTP concurrency low. Background work pinned to the same instance.

  4. storage_executor saturation — The 96-worker pool was at 96–100% util with queue depth up to 67. Shared between sync pipeline GCS work and playback flows (audio_merge, precache). 48h logs showed 21,236 audio_merge events — 77% speculative warming (process_conversation auto-precache + /precache endpoint), 23% on-demand playback (/urls).

  5. App-side stale backendBusy stateSyncRateLimiter persisted both rateLimit and backendBusy cooldowns to SharedPreferences, so the "Omi servers are busy" banner survived app restarts even after the backend recovered (verified: 0 stale-guard fires in the last 7 days, yet some users still saw the banner).

Changes shipped today

Cloud Run service config (backend-sync, no code change)

Setting Before After
containerConcurrency 12 6
minScale 1 10
maxScale 25 25 (unchanged)
CPU 2 vCPU 2 vCPU (unchanged)
Memory 8 GiB 8 GiB (unchanged)

Rationale: lower concurrency forces the LB to spread HTTP across instances; minScale=10 keeps the fleet warm at the peak we already observed (peak was 10, never approached the 25 ceiling). Result: HTTP routing went from one instance at 84% → 8.9–11.6% per instance across 10 instances.

Backend code

App code (mobile)

  • app(sync): keep backendBusy cooldown in-memory only #7527SyncRateLimiter keeps backendBusy cooldowns in-memory only; only rateLimit (server-side fair-use) is persisted. Constructor clears any pre-existing persisted backendBusy entry so users upgrading from older versions get unstuck immediately. Ships in the next mobile release.

Measured impact (rev 00617 vs baseline, 25-min windows)

Metric Baseline (start of day) Today (rev 617) Δ
Instances active 5–9 (max 10) 10 (floor) floor raised
HTTP distribution top instance 84% 11.6% spread
sync_v2 bg complete decode_ms p95 252s 68s −73%
sync_v2 bg complete total_ms p95 297s 89s −70%
Storage pool at 100% util 92% of warnings 61% of warnings less peak saturation
Storage pool queue p50 19 8 −58%
Memory p99 per inst 96–100% 13% down
CPU p99 per inst 94% 63% down
5xx errors 0 0 unchanged
Fair-use 429s ~15% of /v2/sync-local-files unchanged not addressed

Known issues / next steps (not done today)

  • Fair-use 429s for paid users — paid-user bypass for is_hard_restricted() not yet shipped. ~14 heavy users still locked out. Should be a small backend PR (early-return False if is_paid_plan(subscription)).
  • No Retry-After header on 429s — backend's 429 has no Retry-After, so the app falls back to a 30-min default cooldown instead of a server-driven value. Backend change.
  • Per-device cooldownSyncRateLimiter cooldown is per-install via SharedPreferences, not synced across a user's iPhone/iPad/desktop. Multi-device users hit 429 on each device separately.
  • Precache is still on storage_executor — 128 buys us time but the architectural fix is to move precache to an async queue (Cloud Tasks / Pub/Sub) so it doesn't compete with sync hot path. ~1–2 days work.
  • Sync v2 pipeline doesn't propagate private_cloud_sync_enabled to conversationsprocess_segment calls CreateConversation(...) without it, so offline-synced conversations have audio_files = [] and aren't playable. Likely a bug; fixing it would also dump precache load onto sync (which is why we should ship the queue solution first).
  • Audio merge cache hit rate — currently logged at DEBUG level (invisible in prod). Worth bumping to INFO and measuring; if it's low, we're doing repeat merge work needlessly.

Quick reference — files touched today

  • backend/utils/other/storage.py_PRECACHE_FILE_SEM
  • backend/utils/executors.pystorage_executor max_workers
  • app/lib/services/wals/sync_rate_limiter.dart — backendBusy persistence

Cloud Run config changes are not in source — they live on the service spec (gcloud run services describe backend-sync --region=us-central1).


Posted by Caleb (AI agent) on behalf of Mohsin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions