fix(webapp): stop locked-version triggers failing on stale replica reads by ericallam · Pull Request #3930 · triggerdotdev/trigger.dev

ericallam · 2026-06-12T15:57:38Z

Summary

triggerAndWait (and other locked-version triggers) could intermittently fail with Task '<id>' not found on locked version '<version>' for a task that was registered on that version. The failures came in bursts and recovered on their own, so a retry minutes later would succeed.

Root cause

For a locked-version trigger, the queue resolver looks up the task's BackgroundWorkerTask metadata from the read replica (behind a Redis cache). On a cache miss it queried the replica, and a null result was treated as "task not registered" and turned into a non-retryable 422. A read replica can return an empty result for a row that already exists on the primary, so a momentarily-behind replica produced a false negative even though the locked worker (resolved on the primary in the same request) clearly had the task.

Fix

On a cache miss, when the replica returns no row the resolver now re-checks the primary before concluding the task is missing. If the primary has the row it is used (and the cache is back-filled); the error fires only when the primary genuinely lacks it, which is the only case where the 422 is correct. The extra read happens on the cache-miss-and-replica-empty path only, so the hot path is unchanged.

Observability

Adds a task_meta_cache.resolve counter on the trigger path, labeled by lookup path (locked / current) and the source that satisfied it (cache / replica / writer / miss). cache / total is the cache hit rate; writer / total is how often the read replica returned empty for a row the primary had. Bounded labels only, no per-env / worker / slug cardinality.

Verified with a unit test (replica stub vs. real primary) and end-to-end against a local streaming replica with replication paused to reproduce the stale read.

TRI-10868

A locked-version trigger such as triggerAndWait resolved the task's metadata from the read replica and, on a miss, threw a non-retryable "task not found on locked version" even though the task was registered. A read replica can return an empty result for a row that already exists on the primary, so this surfaced as intermittent, self-recovering trigger failures. The locked worker is already resolved on the primary in the same request, so the resolver now re-checks the primary when the replica returns no row, and only reports the task missing when the primary genuinely lacks it. This runs on the cache-miss path only and leaves the hot path unchanged.

changeset-bot · 2026-06-12T15:57:52Z

⚠️ No Changeset found

Latest commit: 87fa522

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-12T15:57:54Z

Walkthrough

This PR fixes locked-version trigger failures caused by read-replica staleness. When DefaultQueueManager resolves task metadata on cache miss, it now queries the read replica first; if no row is found and a separate writer database client exists, it falls back to the writer and logs a staleness warning. A new private helper method findLockedTaskRow encapsulates the Prisma query used by both replica and writer lookups. A new integration test validates that the fallback successfully triggers locked tasks while unregistered tasks still fail as expected.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly summarizes the main fix: addressing locked-version trigger failures caused by stale replica reads, which is the core purpose of this PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request provides a comprehensive description covering root cause, fix, and verification approach that goes beyond the required template sections.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/locked-version-trigger-stale-replica

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

## Summary Adds observability to the task metadata cache that backs the trigger hot path. Follow-up to #3930, which made locked-version triggers fall back to the primary when the read replica returns no row; this makes the cache's effectiveness (and that fallback) measurable instead of inferred. ## What it emits A single bounded counter `task_meta_cache.resolve`, labeled by lookup path (`locked` / `current`) and the source that satisfied it (`cache` / `replica` / `writer` / `miss`): - `cache / total` is the cache hit rate (its inverse is how cold the cache runs). - `writer / total` is how often the read replica returned empty for a row the primary had (the condition #3930 recovers from). Labels are bounded, with no per-env / worker / slug cardinality. TRI-10873

ericallam marked this pull request as ready for review June 12, 2026 15:58

devin-ai-integration Bot reviewed Jun 12, 2026

View reviewed changes

ericallam enabled auto-merge (squash) June 12, 2026 16:21

myftija approved these changes Jun 12, 2026

View reviewed changes

ericallam merged commit 5232067 into main Jun 12, 2026
33 checks passed

ericallam deleted the fix/locked-version-trigger-stale-replica branch June 12, 2026 16:29

ericallam mentioned this pull request Jun 12, 2026

feat(webapp): add task metadata cache resolution metrics #3934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webapp): stop locked-version triggers failing on stale replica reads#3930

fix(webapp): stop locked-version triggers failing on stale replica reads#3930
ericallam merged 1 commit into
mainfrom
fix/locked-version-trigger-stale-replica

ericallam commented Jun 12, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ericallam commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Observability

Uh oh!

changeset-bot Bot commented Jun 12, 2026

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericallam commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading