Skip to content

fix(server): recover from transient consensus errors during the fence#227

Merged
SebastianThiebaud merged 1 commit into
mainfrom
fix/fence-transient-recovery
May 23, 2026
Merged

fix(server): recover from transient consensus errors during the fence#227
SebastianThiebaud merged 1 commit into
mainfrom
fix/fence-transient-recovery

Conversation

@SebastianThiebaud
Copy link
Copy Markdown
Contributor

Problem

The test job intermittently fails with new leader's serving-state stream closed before reaching Serving (example run). It's a real availability bug, not just a flaky test.

The failover fence (run_leader_watch, fence.rs) propagates every ConsensusError from load_high_water/persist_high_water via ?, which terminates the leader-watch task. serve_with_* stops the server when that task ends (server.rs:331-338), so a single recoverable error during the volatile post-election window — a leadership flap (ForwardToLeader → NotLeader) or a momentary quorum blip (TransientDriver) — tears the new leader's tso server down. Because the node remains the raft leader, no survivor can serve either: the embedded cluster is stuck with no TSO until that process restarts. Standalone deployments recover via process restart; embedded users (the piggyback examples) have no restart, so the smoke test never reaches Serving and the watch channel closes under it.

The fail-stop was incidental, not a deliberate safety requirement: it's day-one ? propagation, and the follow-up commits (#29, #124) only hardened the fail-safe poisoning (don't keep publishing Serving while the watch is dead) — they never argued that transient errors should be fatal. Retrying preserves the documented monotonicity invariant.

Fix

ConsensusError already separates recoverable (NotLeader/Fenced/TransientDriver) from permanent (PermanentDriver). The Leader branch now honors that split:

  • TransientDriver → retry with bounded exponential backoff (≤ ~1.4s); the node is still the elected leader and no fresh leadership event is coming. An exhausted budget steps down to NotServing and awaits the next event (no teardown).
  • NotLeader / Fenced → step down to NotServing and continue the watch loop; the stream delivers the next state (and a fresh Leader event if we re-win).
  • PermanentDriver / allocator-invariant → still fatal: propagate so into_router poisons serving state and stops serving.

Serving stays NotServing until an attempt fully succeeds, so the "never publish Serving at a stale epoch" invariant holds on every path.

Tests

  • New FaultyDriver fake (mirrors StallableDriver) injects ConsensusError kinds on persist_high_water.
  • Deterministic regression tests: fence_retries_transient_persist_error_then_serves, fence_steps_down_on_not_leader_then_serves_next_election, and fence_permanent_error_terminates_watch (guards that permanent errors stay fatal).
  • Rewrote the failpoint test fence_aborted_after_load_does_not_advance_to_servingfence_recovers_after_transient_load_error (the old name pinned the now-fixed fail-stop).
  • Full tsoracle-server suite (all features) green, including leadership_churn (back-to-back failovers), fence_yieldpoint, and serve_shutdown. openraft-piggyback smoke run 20/20.

Notes

The failover fence (`run_leader_watch`) propagated every `ConsensusError`
from `load_high_water`/`persist_high_water` via `?`, terminating the
leader-watch task. Because `serve_with_*` stops the server when that task
ends, a single recoverable error during the volatile post-election window
tore the whole tso server down -- and since the node stays raft-leader, no
survivor could serve either. Standalone deployments recover by process
restart, but embedded users (e.g. the openraft-piggyback example) have no
restart, so a momentary quorum blip or a leadership flap permanently
disabled serving. This surfaced as the intermittent "new leader's
serving-state stream closed before reaching Serving" failure in the
piggyback smoke test.

`ConsensusError` already classifies errors as recoverable
(`NotLeader`/`Fenced`/`TransientDriver`) vs permanent (`PermanentDriver`).
Honor that split in the Leader branch instead of failing fast on all:

  * TransientDriver -- retry the fence with bounded exponential backoff;
    the node is still the elected leader and no fresh leadership event is
    coming to re-drive it. An exhausted budget steps down to NotServing and
    awaits the next event rather than tearing the server down.
  * NotLeader / Fenced -- leadership moved under us; step down to NotServing
    and continue the watch loop so the stream delivers the next state.
  * PermanentDriver / allocator-invariant -- still fatal: propagate so
    into_router poisons serving state and stops serving.

Serving stays NotServing until a fence attempt fully succeeds, so the
"never publish Serving at a stale epoch" invariant the poisoning path
protects holds on every path.

Adds a FaultyDriver test fake plus deterministic regression tests for the
transient-retry and NotLeader-step-down paths, a guard that permanent
errors stay fatal, and rewrites the failpoint test that previously pinned
the fail-stop behavior to assert recovery.
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 26343849815

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage increased (+0.03%) to 94.71%

Details

  • Coverage increased (+0.03%) from the base build.
  • Patch coverage: 4 uncovered changes across 2 files (97 of 101 lines covered, 96.04%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
crates/tsoracle-server/src/test_fakes.rs 45 42 93.33%
crates/tsoracle-server/src/fence.rs 56 55 98.21%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 10776
Covered Lines: 10206
Line Coverage: 94.71%
Coverage Strength: 1516817.58 hits per line

💛 - Coveralls

@SebastianThiebaud SebastianThiebaud merged commit eab3648 into main May 23, 2026
22 checks passed
@SebastianThiebaud SebastianThiebaud deleted the fix/fence-transient-recovery branch May 23, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants