Skip to content

fix(webapp): stop locked-version triggers failing on stale replica reads#3930

Merged
ericallam merged 1 commit into
mainfrom
fix/locked-version-trigger-stale-replica
Jun 12, 2026
Merged

fix(webapp): stop locked-version triggers failing on stale replica reads#3930
ericallam merged 1 commit into
mainfrom
fix/locked-version-trigger-stale-replica

Conversation

@ericallam

@ericallam ericallam commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

triggerAndWait (and other locked-version triggers) could intermittently fail with Task '<id>' not found on locked version '<version>' for a task that was registered on that version. The failures came in bursts and recovered on their own, so a retry minutes later would succeed.

Root cause

For a locked-version trigger, the queue resolver looks up the task's BackgroundWorkerTask metadata from the read replica (behind a Redis cache). On a cache miss it queried the replica, and a null result was treated as "task not registered" and turned into a non-retryable 422. A read replica can return an empty result for a row that already exists on the primary, so a momentarily-behind replica produced a false negative even though the locked worker (resolved on the primary in the same request) clearly had the task.

Fix

On a cache miss, when the replica returns no row the resolver now re-checks the primary before concluding the task is missing. If the primary has the row it is used (and the cache is back-filled); the error fires only when the primary genuinely lacks it, which is the only case where the 422 is correct. The extra read happens on the cache-miss-and-replica-empty path only, so the hot path is unchanged.

Observability

Adds a task_meta_cache.resolve counter on the trigger path, labeled by lookup path (locked / current) and the source that satisfied it (cache / replica / writer / miss). cache / total is the cache hit rate; writer / total is how often the read replica returned empty for a row the primary had. Bounded labels only, no per-env / worker / slug cardinality.

Verified with a unit test (replica stub vs. real primary) and end-to-end against a local streaming replica with replication paused to reproduce the stale read.

TRI-10868

A locked-version trigger such as triggerAndWait resolved the task's
metadata from the read replica and, on a miss, threw a non-retryable
"task not found on locked version" even though the task was registered.
A read replica can return an empty result for a row that already exists
on the primary, so this surfaced as intermittent, self-recovering
trigger failures.

The locked worker is already resolved on the primary in the same
request, so the resolver now re-checks the primary when the replica
returns no row, and only reports the task missing when the primary
genuinely lacks it. This runs on the cache-miss path only and leaves
the hot path unchanged.
@changeset-bot

changeset-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 87fa522

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR fixes locked-version trigger failures caused by read-replica staleness. When DefaultQueueManager resolves task metadata on cache miss, it now queries the read replica first; if no row is found and a separate writer database client exists, it falls back to the writer and logs a staleness warning. A new private helper method findLockedTaskRow encapsulates the Prisma query used by both replica and writer lookups. A new integration test validates that the fallback successfully triggers locked tasks while unregistered tasks still fail as expected.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main fix: addressing locked-version trigger failures caused by stale replica reads, which is the core purpose of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a comprehensive description covering root cause, fix, and verification approach that goes beyond the required template sections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/locked-version-trigger-stale-replica

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericallam ericallam marked this pull request as ready for review June 12, 2026 15:58

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@ericallam ericallam enabled auto-merge (squash) June 12, 2026 16:21
@ericallam ericallam merged commit 5232067 into main Jun 12, 2026
33 checks passed
@ericallam ericallam deleted the fix/locked-version-trigger-stale-replica branch June 12, 2026 16:29
ericallam added a commit that referenced this pull request Jun 12, 2026
## Summary

Adds observability to the task metadata cache that backs the trigger hot
path. Follow-up to #3930, which made locked-version triggers fall back
to the primary when the read replica returns no row; this makes the
cache's effectiveness (and that fallback) measurable instead of
inferred.

## What it emits

A single bounded counter `task_meta_cache.resolve`, labeled by lookup
path (`locked` / `current`) and the source that satisfied it (`cache` /
`replica` / `writer` / `miss`):

- `cache / total` is the cache hit rate (its inverse is how cold the
cache runs).
- `writer / total` is how often the read replica returned empty for a
row the primary had (the condition #3930 recovers from).

Labels are bounded, with no per-env / worker / slug cardinality.

TRI-10873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants