ring: clamp far-future heartbeat timestamps to prevent permanently stuck instances#942
Open
domgoodwin wants to merge 1 commit intografana:mainfrom
Open
ring: clamp far-future heartbeat timestamps to prevent permanently stuck instances#942domgoodwin wants to merge 1 commit intografana:mainfrom
domgoodwin wants to merge 1 commit intografana:mainfrom
Conversation
Contributor
|
Hi! 👋 We recently merged #947 which changed the CI configuration for unit tests. To get CI passing on this PR, please rebase on the latest Thanks! |
…uck instances
A crash can corrupt a ring instance's heartbeat timestamp to a
far-future value (e.g. year 2034). When this happens, the instance
becomes permanently stuck in the gossip ring because:
- The merge logic uses "highest timestamp wins", so no legitimate
heartbeat (using time.Now()) can ever override the corrupted entry
- The localCAS path marks missing instances as LEFT with the current
time, but other nodes reject this via gossip since their copy has the
higher corrupted timestamp
- IsHeartbeatHealthy treats future timestamps as healthy (the time
difference is negative, always within the timeout), so the instance
can never be considered unhealthy or evicted
Fix by clamping timestamps that exceed now + 1 hour (maxAllowedClockDrift)
to the current time in both the merge function and IsHeartbeatHealthy.
The merge fix is self-healing: each node independently clamps corrupted
entries and gossips the correction, allowing normal operations to resume
on the next heartbeat cycle.
af765bf to
ec87a4c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does:
A crash can corrupt a ring instance's heartbeat timestamp to a
far-future value (e.g. year 2034). When this happens, the instance
becomes permanently stuck in the gossip ring because:
Fix by clamping timestamps that exceed now + 1 hour (maxAllowedClockDrift)
to the current time in both the merge function and IsHeartbeatHealthy.
The merge fix is self-healing: each node independently clamps corrupted
entries and gossips the correction, allowing normal operations to resume
on the next heartbeat cycle.
Which issue(s) this PR fixes:
Fixes grafana/mimir#13810
That issue is the same issue we encountered on our cluster: we had an
instance with a heartbeat timestamp in 2034 in UP status, the only way
we could fix it was by having a fake member join, emit a DOWN status for
the broken instance with a timestamp into the future (to beat out the 2034
year).
Checklist
Note
Medium Risk
Changes ring gossip merge and heartbeat health semantics by clamping future timestamps; mistakes could cause instances to be marked unhealthy or allow stale state to win, but the change is bounded by a 1h drift window and covered by targeted tests.
Overview
Prevents corrupted far-future heartbeat timestamps from permanently winning ring gossip merges by introducing
maxAllowedClockDrift(1h) and clamping any timestamp beyondnow + driftback tonow.mergeWithTimenow self-heals local corrupted entries (and gossips the correction via the returned change set) and clamps incoming entries before doing the "highest timestamp wins" comparison;IsHeartbeatHealthynow treats far-future heartbeats as unhealthy instead of always healthy.Adds unit tests covering clamping behavior, merge outcomes (including
localCASmarking missing instancesLEFT), and acceptance of small legitimate clock skew.Reviewed by Cursor Bugbot for commit ec87a4c. Bugbot is set up for automated code reviews on this repo. Configure here.