fix(server): recover from transient consensus errors during the fence#227
Merged
Conversation
The failover fence (`run_leader_watch`) propagated every `ConsensusError`
from `load_high_water`/`persist_high_water` via `?`, terminating the
leader-watch task. Because `serve_with_*` stops the server when that task
ends, a single recoverable error during the volatile post-election window
tore the whole tso server down -- and since the node stays raft-leader, no
survivor could serve either. Standalone deployments recover by process
restart, but embedded users (e.g. the openraft-piggyback example) have no
restart, so a momentary quorum blip or a leadership flap permanently
disabled serving. This surfaced as the intermittent "new leader's
serving-state stream closed before reaching Serving" failure in the
piggyback smoke test.
`ConsensusError` already classifies errors as recoverable
(`NotLeader`/`Fenced`/`TransientDriver`) vs permanent (`PermanentDriver`).
Honor that split in the Leader branch instead of failing fast on all:
* TransientDriver -- retry the fence with bounded exponential backoff;
the node is still the elected leader and no fresh leadership event is
coming to re-drive it. An exhausted budget steps down to NotServing and
awaits the next event rather than tearing the server down.
* NotLeader / Fenced -- leadership moved under us; step down to NotServing
and continue the watch loop so the stream delivers the next state.
* PermanentDriver / allocator-invariant -- still fatal: propagate so
into_router poisons serving state and stops serving.
Serving stays NotServing until a fence attempt fully succeeds, so the
"never publish Serving at a stale epoch" invariant the poisoning path
protects holds on every path.
Adds a FaultyDriver test fake plus deterministic regression tests for the
transient-retry and NotLeader-step-down paths, a guard that permanent
errors stay fatal, and rewrites the failpoint test that previously pinned
the fail-stop behavior to assert recovery.
Coverage Report for CI Build 26343849815Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage increased (+0.03%) to 94.71%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
testjob intermittently fails withnew leader's serving-state stream closed before reaching Serving(example run). It's a real availability bug, not just a flaky test.The failover fence (
run_leader_watch,fence.rs) propagates everyConsensusErrorfromload_high_water/persist_high_watervia?, which terminates the leader-watch task.serve_with_*stops the server when that task ends (server.rs:331-338), so a single recoverable error during the volatile post-election window — a leadership flap (ForwardToLeader → NotLeader) or a momentary quorum blip (TransientDriver) — tears the new leader's tso server down. Because the node remains the raft leader, no survivor can serve either: the embedded cluster is stuck with no TSO until that process restarts. Standalone deployments recover via process restart; embedded users (the piggyback examples) have no restart, so the smoke test never reachesServingand the watch channel closes under it.The fail-stop was incidental, not a deliberate safety requirement: it's day-one
?propagation, and the follow-up commits (#29, #124) only hardened the fail-safe poisoning (don't keep publishingServingwhile the watch is dead) — they never argued that transient errors should be fatal. Retrying preserves the documented monotonicity invariant.Fix
ConsensusErroralready separates recoverable (NotLeader/Fenced/TransientDriver) from permanent (PermanentDriver). The Leader branch now honors that split:TransientDriver→ retry with bounded exponential backoff (≤ ~1.4s); the node is still the elected leader and no fresh leadership event is coming. An exhausted budget steps down toNotServingand awaits the next event (no teardown).NotLeader/Fenced→ step down toNotServingand continue the watch loop; the stream delivers the next state (and a freshLeaderevent if we re-win).PermanentDriver/ allocator-invariant → still fatal: propagate sointo_routerpoisons serving state and stops serving.Serving stays
NotServinguntil an attempt fully succeeds, so the "never publishServingat a stale epoch" invariant holds on every path.Tests
FaultyDriverfake (mirrorsStallableDriver) injectsConsensusErrorkinds onpersist_high_water.fence_retries_transient_persist_error_then_serves,fence_steps_down_on_not_leader_then_serves_next_election, andfence_permanent_error_terminates_watch(guards that permanent errors stay fatal).fence_aborted_after_load_does_not_advance_to_serving→fence_recovers_after_transient_load_error(the old name pinned the now-fixed fail-stop).tsoracle-serversuite (all features) green, includingleadership_churn(back-to-back failovers),fence_yieldpoint, andserve_shutdown. openraft-piggyback smoke run 20/20.Notes
main. The release-plz PR still needs chore(release): version crates independently to fix release-plz resolution #223 to resolve.docs/key-subsystems.md(leader-watch error handling),docs/failpoint-testing.md(failpoint test row).