Skip to content

feat: bimodal_clock severity — surface flaky-RTC nodes instead of hiding as 'No Clock' #845

@Kpa-clawbot

Description

@Kpa-clawbot

Symptom

Some nodes emit adverts with bimodal clocks: most timestamps are roughly correct (skew ±60s) but a meaningful fraction are wildly off (often −56M to −60M sec ≈ −1.8 yr) — almost certainly a firmware bug where the RTC isn't initialized after a reset and the node briefly transmits with epoch-near-zero timestamps before re-syncing.

Currently #789's Theil-Sen + recent-window median handles isolated outliers, but when ≥50% of the recent window is broken, the median IS broken and the node is reported as severity: "no_clock". The actually-good adverts are hidden, and operators just see "No Clock" for a node that demonstrably has a working clock most of the time.

Repro

  • Staging node f81d265c03c5c1b2508d172633e2160eabe3126ef05538c20d94c8c1f2783a61
  • 31,013 samples, recent samples (chronological): −56,877,071s (broken), −6.8s, −60,497,935s, −59,813,698s, −13.6s
  • API: severity:"no_clock", recentMedianSkewSec:-56877071.6, medianSkewSec:-7.8 (long-window median IS sane)

Operator need

Operators need to know "this node has a flaky clock" — not have it hidden behind "No Clock". The bad-timestamp adverts can confuse downstream analysis (path attribution, time-windowed queries) so visibility is the point.

Proposed fix — new severity tier bimodal_clock

Backend (cmd/server/clock_skew.go)

  • After Theil-Sen + recent-median, compute goodFraction = fraction of recent samples with |skew| < 3600s.
  • Classification (in order):
    • no_clockgoodFraction < 0.10 (essentially never has a real clock)
    • bimodal_clock0.10 ≤ goodFraction < 0.80 AND at least one sample with |skew| > 3600s in recent window (NEW)
    • severe, warn, ok — as today, computed from the median of the GOOD samples (|skew| < 3600s) so operators see the real working-clock skew
  • Add goodFraction and recentBadSampleCount to /api/nodes/{pk}/clock-skew response
  • Re-classify per-hash detection in clock_skew.go:351 and fleet/node aggregations at :422-423

Frontend (public/roles.js)

  • New CSS class skew-badge--bimodal_clock (color: amber, like a warning, distinct from severe red)
  • SKEW_SEVERITY_LABELS.bimodal_clock = 'Bimodal'
  • In formatSkew/badge rendering: show the GOOD-sample skew (e.g. -7s) with the bimodal badge so operators see "node clock is mostly fine, here's the real skew" plus a flag "but X% of adverts have nonsense timestamps"
  • Tooltip: "Bimodal clock: NN% of recent adverts have nonsense timestamps. Median of good samples: -7s"
  • In node-detail clock card: add a line "⚠️ N of last M adverts had nonsense timestamps (likely RTC reset)"

Acceptance

  • Node f81d265c…3a61 on staging shows badge Bimodal with −7s (good-sample median) instead of No Clock
  • Hover shows good fraction + bad count
  • True no-clock nodes (no good samples ever) still show No Clock
  • All-good-clock nodes still show OK

Out of scope

  • Auto-detecting WHICH firmware version causes this (correlate by node firmware_version/role + report)
  • Filtering bimodal-bad samples out of downstream time-windowed analytics — separate issue

Tests

  • Unit test bimodal scenario with 60% bad / 40% good in window → severity bimodal_clock, displayed skew = good-median
  • 5% good / 95% bad → still no_clock
  • 95% good / 5% bad → ok (single outlier doesn't trigger bimodal)

References: #789 (Theil-Sen + recentMedianSkewSec)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions