You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags (#3889)
## Summary
When `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` is on,
`RunEngine.getSnapshotsSince` reads from the read replica. During write
spikes the replica can briefly lag, so the snapshot id a runner just
learned from the writer isn't visible there yet: the lookup threw, the
worker route returned a 500, and the runner waited for its next poll —
turning sub-second snapshot notifications into poll-interval latency
exactly when things are busiest. This PR makes the flag safe to enable:
a replica miss of the since snapshot gets one jittered retry on the
replica (most lag windows are shorter than the ~50–200ms wait, so the
writer is never touched), then falls back to the primary, observed via a
new `run_engine.snapshots_since.replica_miss` counter with an `outcome`
attribute (`replica_retry` vs `primary`). Only genuine misses — absent
on the primary too — remain errors.
## Design
- `getExecutionSnapshotsSince` now throws a typed
`ExecutionSnapshotNotFoundError` so the engine can distinguish the
expected lag miss from real failures. The message string is unchanged
and the error never leaves the engine.
- The recovery path only engages when the flag is on, a distinct replica
client is configured, and no transaction client was passed. With the
flag off, the path is behaviorally identical to before.
- Retry delay bounds are configurable
(`RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS`/`MAX_MS`, default
50/200; `MAX_MS=0` skips the replica retry and goes straight to the
primary).
- The warn log fires only when the primary serves the read (the writer
spill is the operationally interesting event); replica-retry recoveries
are counted but quiet. A permanently-missing snapshot id stays an
error-level failure with a `failedDuring` field, so lag metrics aren't
polluted by bogus ids.
- Stale-tail lag (replica has the since snapshot but not newer rows)
deliberately still returns the replica's view; the next poll catches up.
- The since-snapshot anchor lookup is now scoped to the polled run
(`where: { id, runId }`), so a snapshot id from a different run raises
not-found instead of silently anchoring a too-wide window of the run's
snapshots.
## Test plan
All vitest + testcontainers, no mocks. A new `schemaOnlyPrisma` fixture
(migrated-but-empty clone database) simulates a replica that hasn't
caught up, and a real in-memory OTel meter pins the counter semantics
per outcome.
- [x] Replica catches up during the jittered retry window → served by
the replica, `outcome=replica_retry` = 1, primary never consulted
- [x] Replica permanently missing the since snapshot → served by the
primary, `outcome=primary` = 1
- [x] Snapshot missing on both replica and primary → null, counter = 0
- [x] Replica has the since snapshot but lags by one → the replica's
view is served, no fallback (verified discriminating power: the test
fails if reads secretly hit the primary)
- [x] Flag off with a replica configured → primary serves the read
- [x] Transaction client provided → bypasses the replica entirely
- [x] Since snapshot belonging to a different run → null
- [x] Existing getSnapshotsSince + waitpoints suites green; run-engine,
testcontainers, and webapp typechecks pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Run snapshot polling no longer errors or pays extra latency when the database read replica hasn't yet replicated the snapshot the runner is polling from (`RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED`): the read is briefly retried on the replica and served from the primary if it still hasn't caught up. Polling also now rejects a since-snapshot id that doesn't belong to the run being polled.
0 commit comments