fix(server): recover from transient consensus errors during the fence by SebastianThiebaud · Pull Request #227 · prisma-risk/tsoracle

SebastianThiebaud · 2026-05-23T21:20:34Z

Problem

The test job intermittently fails with new leader's serving-state stream closed before reaching Serving (example run). It's a real availability bug, not just a flaky test.

The failover fence (run_leader_watch, fence.rs) propagates every ConsensusError from load_high_water/persist_high_water via ?, which terminates the leader-watch task. serve_with_* stops the server when that task ends (server.rs:331-338), so a single recoverable error during the volatile post-election window — a leadership flap (ForwardToLeader → NotLeader) or a momentary quorum blip (TransientDriver) — tears the new leader's tso server down. Because the node remains the raft leader, no survivor can serve either: the embedded cluster is stuck with no TSO until that process restarts. Standalone deployments recover via process restart; embedded users (the piggyback examples) have no restart, so the smoke test never reaches Serving and the watch channel closes under it.

The fail-stop was incidental, not a deliberate safety requirement: it's day-one ? propagation, and the follow-up commits (#29, #124) only hardened the fail-safe poisoning (don't keep publishing Serving while the watch is dead) — they never argued that transient errors should be fatal. Retrying preserves the documented monotonicity invariant.

Fix

ConsensusError already separates recoverable (NotLeader/Fenced/TransientDriver) from permanent (PermanentDriver). The Leader branch now honors that split:

TransientDriver → retry with bounded exponential backoff (≤ ~1.4s); the node is still the elected leader and no fresh leadership event is coming. An exhausted budget steps down to NotServing and awaits the next event (no teardown).
NotLeader / Fenced → step down to NotServing and continue the watch loop; the stream delivers the next state (and a fresh Leader event if we re-win).
PermanentDriver / allocator-invariant → still fatal: propagate so into_router poisons serving state and stops serving.

Serving stays NotServing until an attempt fully succeeds, so the "never publish Serving at a stale epoch" invariant holds on every path.

Tests

New FaultyDriver fake (mirrors StallableDriver) injects ConsensusError kinds on persist_high_water.
Deterministic regression tests: fence_retries_transient_persist_error_then_serves, fence_steps_down_on_not_leader_then_serves_next_election, and fence_permanent_error_terminates_watch (guards that permanent errors stay fatal).
Rewrote the failpoint test fence_aborted_after_load_does_not_advance_to_serving → fence_recovers_after_transient_load_error (the old name pinned the now-fixed fail-stop).
Full tsoracle-server suite (all features) green, including leadership_churn (back-to-back failovers), fence_yieldpoint, and serve_shutdown. openraft-piggyback smoke run 20/20.

Notes

Independent of chore(release): version crates independently to fix release-plz resolution #223 (version independence); both target main. The release-plz PR still needs chore(release): version crates independently to fix release-plz resolution #223 to resolve.
Docs updated: docs/key-subsystems.md (leader-watch error handling), docs/failpoint-testing.md (failpoint test row).

The failover fence (`run_leader_watch`) propagated every `ConsensusError` from `load_high_water`/`persist_high_water` via `?`, terminating the leader-watch task. Because `serve_with_*` stops the server when that task ends, a single recoverable error during the volatile post-election window tore the whole tso server down -- and since the node stays raft-leader, no survivor could serve either. Standalone deployments recover by process restart, but embedded users (e.g. the openraft-piggyback example) have no restart, so a momentary quorum blip or a leadership flap permanently disabled serving. This surfaced as the intermittent "new leader's serving-state stream closed before reaching Serving" failure in the piggyback smoke test. `ConsensusError` already classifies errors as recoverable (`NotLeader`/`Fenced`/`TransientDriver`) vs permanent (`PermanentDriver`). Honor that split in the Leader branch instead of failing fast on all: * TransientDriver -- retry the fence with bounded exponential backoff; the node is still the elected leader and no fresh leadership event is coming to re-drive it. An exhausted budget steps down to NotServing and awaits the next event rather than tearing the server down. * NotLeader / Fenced -- leadership moved under us; step down to NotServing and continue the watch loop so the stream delivers the next state. * PermanentDriver / allocator-invariant -- still fatal: propagate so into_router poisons serving state and stops serving. Serving stays NotServing until a fence attempt fully succeeds, so the "never publish Serving at a stale epoch" invariant the poisoning path protects holds on every path. Adds a FaultyDriver test fake plus deterministic regression tests for the transient-retry and NotLeader-step-down paths, a guard that permanent errors stay fatal, and rewrites the failpoint test that previously pinned the fail-stop behavior to assert recovery.

coveralls · 2026-05-23T21:26:08Z

Coverage Report for CI Build 26343849815

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage increased (+0.03%) to 94.71%

Details

Coverage increased (+0.03%) from the base build.
Patch coverage: 4 uncovered changes across 2 files (97 of 101 lines covered, 96.04%).
No coverage regressions found.

Uncovered Changes

File	Changed	Covered	%
crates/tsoracle-server/src/test_fakes.rs	45	42	93.33%
crates/tsoracle-server/src/fence.rs	56	55	98.21%

Coverage Regressions

No coverage regressions found.

Coverage Stats


Relevant Lines:	10776
Covered Lines:	10206
Line Coverage:	94.71%
Coverage Strength:	1516817.58 hits per line

💛 - Coveralls

SebastianThiebaud merged commit eab3648 into main May 23, 2026
22 checks passed

SebastianThiebaud deleted the fix/fence-transient-recovery branch May 23, 2026 21:27

prismarisk-public-release Bot mentioned this pull request May 23, 2026

chore: release #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): recover from transient consensus errors during the fence#227

fix(server): recover from transient consensus errors during the fence#227
SebastianThiebaud merged 1 commit into
mainfrom
fix/fence-transient-recovery

SebastianThiebaud commented May 23, 2026

Uh oh!

coveralls commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SebastianThiebaud commented May 23, 2026

Problem

Fix

Tests

Notes

Uh oh!

coveralls commented May 23, 2026

Coverage Report for CI Build 26343849815

Coverage increased (+0.03%) to 94.71%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants