fix(webapp): stop locked-version triggers failing on stale replica reads#3930
Conversation
A locked-version trigger such as triggerAndWait resolved the task's metadata from the read replica and, on a miss, threw a non-retryable "task not found on locked version" even though the task was registered. A read replica can return an empty result for a row that already exists on the primary, so this surfaced as intermittent, self-recovering trigger failures. The locked worker is already resolved on the primary in the same request, so the resolver now re-checks the primary when the replica returns no row, and only reports the task missing when the primary genuinely lacks it. This runs on the cache-miss path only and leaves the hot path unchanged.
|
WalkthroughThis PR fixes locked-version trigger failures caused by read-replica staleness. When DefaultQueueManager resolves task metadata on cache miss, it now queries the read replica first; if no row is found and a separate writer database client exists, it falls back to the writer and logs a staleness warning. A new private helper method 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
## Summary Adds observability to the task metadata cache that backs the trigger hot path. Follow-up to #3930, which made locked-version triggers fall back to the primary when the read replica returns no row; this makes the cache's effectiveness (and that fallback) measurable instead of inferred. ## What it emits A single bounded counter `task_meta_cache.resolve`, labeled by lookup path (`locked` / `current`) and the source that satisfied it (`cache` / `replica` / `writer` / `miss`): - `cache / total` is the cache hit rate (its inverse is how cold the cache runs). - `writer / total` is how often the read replica returned empty for a row the primary had (the condition #3930 recovers from). Labels are bounded, with no per-env / worker / slug cardinality. TRI-10873
Summary
triggerAndWait(and other locked-version triggers) could intermittently fail withTask '<id>' not found on locked version '<version>'for a task that was registered on that version. The failures came in bursts and recovered on their own, so a retry minutes later would succeed.Root cause
For a locked-version trigger, the queue resolver looks up the task's
BackgroundWorkerTaskmetadata from the read replica (behind a Redis cache). On a cache miss it queried the replica, and anullresult was treated as "task not registered" and turned into a non-retryable 422. A read replica can return an empty result for a row that already exists on the primary, so a momentarily-behind replica produced a false negative even though the locked worker (resolved on the primary in the same request) clearly had the task.Fix
On a cache miss, when the replica returns no row the resolver now re-checks the primary before concluding the task is missing. If the primary has the row it is used (and the cache is back-filled); the error fires only when the primary genuinely lacks it, which is the only case where the 422 is correct. The extra read happens on the cache-miss-and-replica-empty path only, so the hot path is unchanged.
Observability
Adds a
task_meta_cache.resolvecounter on the trigger path, labeled by lookup path (locked/current) and the source that satisfied it (cache/replica/writer/miss).cache / totalis the cache hit rate;writer / totalis how often the read replica returned empty for a row the primary had. Bounded labels only, no per-env / worker / slug cardinality.Verified with a unit test (replica stub vs. real primary) and end-to-end against a local streaming replica with replication paused to reproduce the stale read.
TRI-10868