Symptom
Some nodes emit adverts with bimodal clocks: most timestamps are roughly correct (skew ±60s) but a meaningful fraction are wildly off (often −56M to −60M sec ≈ −1.8 yr) — almost certainly a firmware bug where the RTC isn't initialized after a reset and the node briefly transmits with epoch-near-zero timestamps before re-syncing.
Currently #789's Theil-Sen + recent-window median handles isolated outliers, but when ≥50% of the recent window is broken, the median IS broken and the node is reported as severity: "no_clock". The actually-good adverts are hidden, and operators just see "No Clock" for a node that demonstrably has a working clock most of the time.
Repro
- Staging node
f81d265c03c5c1b2508d172633e2160eabe3126ef05538c20d94c8c1f2783a61
- 31,013 samples, recent samples (chronological):
−56,877,071s (broken), −6.8s, −60,497,935s, −59,813,698s, −13.6s
- API:
severity:"no_clock", recentMedianSkewSec:-56877071.6, medianSkewSec:-7.8 (long-window median IS sane)
Operator need
Operators need to know "this node has a flaky clock" — not have it hidden behind "No Clock". The bad-timestamp adverts can confuse downstream analysis (path attribution, time-windowed queries) so visibility is the point.
Proposed fix — new severity tier bimodal_clock
Backend (cmd/server/clock_skew.go)
- After Theil-Sen + recent-median, compute
goodFraction = fraction of recent samples with |skew| < 3600s.
- Classification (in order):
no_clock — goodFraction < 0.10 (essentially never has a real clock)
bimodal_clock — 0.10 ≤ goodFraction < 0.80 AND at least one sample with |skew| > 3600s in recent window (NEW)
severe, warn, ok — as today, computed from the median of the GOOD samples (|skew| < 3600s) so operators see the real working-clock skew
- Add
goodFraction and recentBadSampleCount to /api/nodes/{pk}/clock-skew response
- Re-classify per-hash detection in
clock_skew.go:351 and fleet/node aggregations at :422-423
Frontend (public/roles.js)
- New CSS class
skew-badge--bimodal_clock (color: amber, like a warning, distinct from severe red)
SKEW_SEVERITY_LABELS.bimodal_clock = 'Bimodal'
- In
formatSkew/badge rendering: show the GOOD-sample skew (e.g. -7s) with the bimodal badge so operators see "node clock is mostly fine, here's the real skew" plus a flag "but X% of adverts have nonsense timestamps"
- Tooltip:
"Bimodal clock: NN% of recent adverts have nonsense timestamps. Median of good samples: -7s"
- In node-detail clock card: add a line "⚠️ N of last M adverts had nonsense timestamps (likely RTC reset)"
Acceptance
- Node
f81d265c…3a61 on staging shows badge Bimodal with −7s (good-sample median) instead of No Clock
- Hover shows good fraction + bad count
- True no-clock nodes (no good samples ever) still show
No Clock
- All-good-clock nodes still show
OK
Out of scope
- Auto-detecting WHICH firmware version causes this (correlate by node
firmware_version/role + report)
- Filtering bimodal-bad samples out of downstream time-windowed analytics — separate issue
Tests
- Unit test bimodal scenario with 60% bad / 40% good in window → severity
bimodal_clock, displayed skew = good-median
- 5% good / 95% bad → still
no_clock
- 95% good / 5% bad →
ok (single outlier doesn't trigger bimodal)
References: #789 (Theil-Sen + recentMedianSkewSec)
Symptom
Some nodes emit adverts with bimodal clocks: most timestamps are roughly correct (skew ±60s) but a meaningful fraction are wildly off (often −56M to −60M sec ≈ −1.8 yr) — almost certainly a firmware bug where the RTC isn't initialized after a reset and the node briefly transmits with epoch-near-zero timestamps before re-syncing.
Currently #789's Theil-Sen + recent-window median handles isolated outliers, but when ≥50% of the recent window is broken, the median IS broken and the node is reported as
severity: "no_clock". The actually-good adverts are hidden, and operators just see "No Clock" for a node that demonstrably has a working clock most of the time.Repro
f81d265c03c5c1b2508d172633e2160eabe3126ef05538c20d94c8c1f2783a61−56,877,071s(broken),−6.8s,−60,497,935s,−59,813,698s,−13.6sseverity:"no_clock",recentMedianSkewSec:-56877071.6,medianSkewSec:-7.8(long-window median IS sane)Operator need
Operators need to know "this node has a flaky clock" — not have it hidden behind "No Clock". The bad-timestamp adverts can confuse downstream analysis (path attribution, time-windowed queries) so visibility is the point.
Proposed fix — new severity tier
bimodal_clockBackend (
cmd/server/clock_skew.go)goodFraction= fraction of recent samples with|skew| < 3600s.no_clock—goodFraction < 0.10(essentially never has a real clock)bimodal_clock—0.10 ≤ goodFraction < 0.80AND at least one sample with|skew| > 3600sin recent window (NEW)severe,warn,ok— as today, computed from the median of the GOOD samples (|skew| < 3600s) so operators see the real working-clock skewgoodFractionandrecentBadSampleCountto/api/nodes/{pk}/clock-skewresponseclock_skew.go:351and fleet/node aggregations at:422-423Frontend (
public/roles.js)skew-badge--bimodal_clock(color: amber, like a warning, distinct fromseverered)SKEW_SEVERITY_LABELS.bimodal_clock = 'Bimodal'formatSkew/badge rendering: show the GOOD-sample skew (e.g.-7s) with the bimodal badge so operators see "node clock is mostly fine, here's the real skew" plus a flag "but X% of adverts have nonsense timestamps""Bimodal clock: NN% of recent adverts have nonsense timestamps. Median of good samples: -7s"Acceptance
f81d265c…3a61on staging shows badgeBimodalwith−7s(good-sample median) instead ofNo ClockNo ClockOKOut of scope
firmware_version/role + report)Tests
bimodal_clock, displayed skew = good-medianno_clockok(single outlier doesn't trigger bimodal)References: #789 (Theil-Sen + recentMedianSkewSec)