Skip to content

ring: clamp far-future heartbeat timestamps to prevent permanently stuck instances#942

Open
domgoodwin wants to merge 1 commit intografana:mainfrom
domgoodwin:fix-gossip-ring-future-timestamp
Open

ring: clamp far-future heartbeat timestamps to prevent permanently stuck instances#942
domgoodwin wants to merge 1 commit intografana:mainfrom
domgoodwin:fix-gossip-ring-future-timestamp

Conversation

@domgoodwin
Copy link
Copy Markdown

@domgoodwin domgoodwin commented Mar 26, 2026

What this PR does:
A crash can corrupt a ring instance's heartbeat timestamp to a
far-future value (e.g. year 2034). When this happens, the instance
becomes permanently stuck in the gossip ring because:

  • The merge logic uses "highest timestamp wins", so no legitimate heartbeat (using time.Now()) can ever override the corrupted entry
  • The localCAS path marks missing instances as LEFT with the current time, but other nodes reject this via gossip since their copy has the higher corrupted timestamp
  • IsHeartbeatHealthy treats future timestamps as healthy (the time difference is negative, always within the timeout), so the instance can never be considered unhealthy or evicted

Fix by clamping timestamps that exceed now + 1 hour (maxAllowedClockDrift)
to the current time in both the merge function and IsHeartbeatHealthy.
The merge fix is self-healing: each node independently clamps corrupted
entries and gossips the correction, allowing normal operations to resume
on the next heartbeat cycle.

Which issue(s) this PR fixes:

Fixes grafana/mimir#13810

That issue is the same issue we encountered on our cluster: we had an
instance with a heartbeat timestamp in 2034 in UP status, the only way
we could fix it was by having a fake member join, emit a DOWN status for
the broken instance with a timestamp into the future (to beat out the 2034
year).

Checklist

  • Tests updated

Note

Medium Risk
Changes ring gossip merge and heartbeat health semantics by clamping future timestamps; mistakes could cause instances to be marked unhealthy or allow stale state to win, but the change is bounded by a 1h drift window and covered by targeted tests.

Overview
Prevents corrupted far-future heartbeat timestamps from permanently winning ring gossip merges by introducing maxAllowedClockDrift (1h) and clamping any timestamp beyond now + drift back to now.

mergeWithTime now self-heals local corrupted entries (and gossips the correction via the returned change set) and clamps incoming entries before doing the "highest timestamp wins" comparison; IsHeartbeatHealthy now treats far-future heartbeats as unhealthy instead of always healthy.

Adds unit tests covering clamping behavior, merge outcomes (including localCAS marking missing instances LEFT), and acceptance of small legitimate clock skew.

Reviewed by Cursor Bugbot for commit ec87a4c. Bugbot is set up for automated code reviews on this repo. Configure here.

@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Mar 26, 2026

CLA assistant check
All committers have signed the CLA.

@pracucci
Copy link
Copy Markdown
Contributor

pracucci commented Apr 7, 2026

Hi! 👋 We recently merged #947 which changed the CI configuration for unit tests. To get CI passing on this PR, please rebase on the latest main branch:

git fetch origin main
git rebase origin/main
git push --force-with-lease

Thanks!

…uck instances

  A crash can corrupt a ring instance's heartbeat timestamp to a
  far-future value (e.g. year 2034). When this happens, the instance
  becomes permanently stuck in the gossip ring because:

  - The merge logic uses "highest timestamp wins", so no legitimate
    heartbeat (using time.Now()) can ever override the corrupted entry
  - The localCAS path marks missing instances as LEFT with the current
    time, but other nodes reject this via gossip since their copy has the
    higher corrupted timestamp
  - IsHeartbeatHealthy treats future timestamps as healthy (the time
    difference is negative, always within the timeout), so the instance
    can never be considered unhealthy or evicted

  Fix by clamping timestamps that exceed now + 1 hour (maxAllowedClockDrift)
  to the current time in both the merge function and IsHeartbeatHealthy.
  The merge fix is self-healing: each node independently clamps corrupted
  entries and gossips the correction, allowing normal operations to resume
  on the next heartbeat cycle.
@domgoodwin domgoodwin force-pushed the fix-gossip-ring-future-timestamp branch from af765bf to ec87a4c Compare April 14, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Persistent ghost instance in the member ring

2 participants