From f872fd90bf5d7e719fc38ddca35d765ee3b243ef Mon Sep 17 00:00:00 2001 From: you Date: Fri, 24 Apr 2026 23:18:15 +0000 Subject: [PATCH 01/12] docs: clock-skew classifier redesign spec --- docs/clock-skew-redesign.md | 241 ++++++++++++++++++++++++++++++++++++ 1 file changed, 241 insertions(+) create mode 100644 docs/clock-skew-redesign.md diff --git a/docs/clock-skew-redesign.md b/docs/clock-skew-redesign.md new file mode 100644 index 00000000..c03df095 --- /dev/null +++ b/docs/clock-skew-redesign.md @@ -0,0 +1,241 @@ +# Clock Skew Classifier — Redesign + +**Status:** spec, pre-implementation +**Supersedes:** parts of #690 / #789 / #845 / PR #894 +**Date drafted:** 2026-04-24 + +## Problem + +The current classifier (`cmd/server/clock_skew.go`) uses windowed medians, hysteresis, "good fraction" floors, and a 365-day `no_clock` threshold. It produces: + +- False `no_clock` flags on nodes whose clocks are working today but had garbage timestamps in recent samples. +- Symmetric severity bands that conflate "clock at firmware default" with "operator set the clock wrong by a year" — completely different operator actions required. +- Compounding over-engineering as each operator complaint added a new tier or window. + +The actual physical reality of these devices is much simpler than the classifier assumes. + +## Hardware reality + +Most MeshCore nodes have **no auto-updating RTC**. There are two hardware paths: + +1. **Volatile RTC nodes** (`firmware/src/helpers/ArduinoHelpers.h:11` — `VolatileRTCClock`): + - On boot, `base_time` is hardcoded to a firmware-build constant (currently `1715770351` = 2024-05-15 20:52:31 UTC). + - `getCurrentTime()` returns `base_time + millis()/1000`. + - On reboot the value snaps back to the constant. + - User must manually sync via companion app (`set time` CLI invokes `setCurrentTime(...)`) to set a real wall-clock time, which then ticks until the next reboot. + +2. **Hardware-RTC nodes** (`firmware/src/helpers/AutoDiscoverRTCClock.cpp` — DS3231 / RV3028 / PCF8563): + - Real-time chip with battery backup. Holds the time across reboots. + - Behaves correctly once set; no default-snap behavior. + +The `set time RESET` CLI command (`firmware/src/helpers/CommonCLI.cpp:215`) explicitly calls `setCurrentTime(1715770351)` regardless of hardware — so even hardware-RTC nodes can be deliberately reset to the default epoch. + +**Therefore every node is in exactly one of these states:** + +| State | Description | +|---|---| +| **Default / never set** | RTC is at a firmware-default epoch + ticking up since the last boot. | +| **Set, drifting normally** | RTC was synced; small skew accumulating at ~0.8s/day per #789 reports. | +| **Set, drifted past tolerance** | Like above but skew has grown beyond what's useful. | +| **Wrong** | Operator-set incorrect time, or genuine RTC malfunction not matching any known default. | + +There is no "bimodal RTC bug" — what looked bimodal in #845 is just a sequence of `defaulted → user sync → reboot → defaulted again`. The "bad" timestamps are not noise; they're a constant (the default epoch + a small uptime). + +## Production data analysis (2026-04-24) + +### 00id.net (this deployment, 416 nodes, commit `abd9c46`) + +`lastSkewSec` (advert_ts − observed_ts) distribution: + +| Bucket | Count | Pct | +|---|---:|---:| +| OK ≤15s | 90 | 22% | +| Degrading ≤60s | 93 | 22% | +| Degraded ≤10m | 13 | 3% | +| off ≤1d | 5 | 1% | +| off ≤1y | 110 | 26% | +| absurd >1y | 105 | 25% | + +Per-node `lastAdvertTS` raw timestamp distribution shows a sharp default cluster: + +``` ++0 days count=19 samples=114969 ← exactly at 1715770351 (just rebooted) ++1d count=9 samples=24766 ++2d count=7 samples=58101 ++3d count=2 samples=360 +... ← decay through ~110 days ++113d count=2 samples=53776 +``` + +103 of 416 nodes (25%) have `lastAdvertTS` between `1715770351` and `1715770351 + 730 days`, consistent with the volatile-RTC-default-ticking-up pattern. + +A second cluster of 5 nodes has `lastAdvertTS = 1672531542 ≈ 1672531200 + 5min` = **2023-01-01 00:00:00 UTC** + small uptime. This is a *different* firmware-default epoch from an older firmware version. + +### Cascadia (analyzer.cascadiamesh.org, 433 nodes in 5000-packet sample, commit `111b03c` v3.5.0) + +ADVERT timestamp by year-month: + +``` +1970-01 1 ← epoch zero (ESP32 native fallback OR ancient firmware) +2021-01 1 ← possible third default epoch +2023-01 2 ← old firmware default (matches 00id) +2024-05 60 ← current VolatileRTCClock + days uptime +2024-06 39 ← same default + weeks uptime +2024-07 21 +2024-08 10 +2024-09 2 +2024-10 1 +2024-11 2 ← decays out as fewer nodes have multi-month uptime since reboot +2025-10 1 ← pre-current-now miscellany +2025-11 2 +2026-03 4 +2026-04 285 ← currently set clocks (this is "now-ish") +2027-04 1 ← operator set wrong by ~1 year (typo?) +2067-12 1 ← operator set wildly wrong / corrupted RTC +``` + +Confirms the model: ~67% of nodes have a current clock, ~32% are at known firmware defaults at varying uptime offsets, ~3 outliers represent genuine misconfigurations. + +## Known firmware default epochs + +These are the values discovered in production data so far: + +| Epoch (unix) | UTC | Source | +|---:|---|---| +| `0` | 1970-01-01 | Likely ESP32 boot when no RTC initialization runs (`time(NULL)` returns 0). | +| `1609459200` | 2021-01-01 | Speculation — single-sample evidence, validate as more data arrives. | +| `1672531200` | 2023-01-01 | Older firmware `VolatileRTCClock::base_time` value. | +| `1715770351` | 2024-05-15 20:52:31 | **Current** `VolatileRTCClock` constructor + `set time RESET` CLI. | + +Treat the table as data, not fixed code. New firmware versions will introduce new defaults; expect to add to the list over time. + +## Reconciliation with #690 — the four timestamps + +#690 lists three timestamps; in practice there are four signals worth distinguishing: + +| Signal | Source | Used for | +|---|---|---| +| `advert_ts` | Inside MeshCore packet, set by sending node | Per-node classification (THE signal). | +| `mqtt_envelope_ts` | Set by observer when it forwards via MQTT | Observer-side calibration only — *not* a direct node-skew signal because observer clock can itself be wrong. | +| `corescope_received_ts` | Wall clock when CoreScope ingested the message | Reference "now"; calibration cross-check. | +| `same_packet_across_observers` | Multiple observers seeing the same hash | Phase 2 calibration (triangulation). | + +**Inputs flow:** + +1. **Phase 2 (existing, kept):** for each packet hash seen by ≥2 observers, compute each observer's deviation from the per-packet median observed_ts → `observerOffset`. This is the triangulation #690 calls for ("Same packet observed by more than one (ideally 3+) observers gives good indication if one observer is off"). Observer offsets are the calibration table. +2. **Per-advert correction (existing, kept):** `correctedSkew = (advert_ts - observed_ts) + observerOffset[observer_id]`. If no calibration exists for an observer, fall back to raw skew with `calibrated: false`. +3. **Default detection (new):** runs on RAW `advert_ts`, not corrected. The firmware default is a fixed wall-clock value; observer offsets are seconds-to-minutes scale and cannot move `advert_ts` from 2024 to 2026. Default check is independent of calibration. +4. **Severity classification (new):** if `is_default(advert_ts)` → `default`; else classify by `|correctedSkew|` band. + +This keeps everything #690 asks for (observer detection, bias subtraction, triangulation), and adds the firmware-default cluster as a new pre-empting tier. + +## UI: explain WHY (#690 requirement) + +The classifier alone doesn't satisfy #690's "present on the UI why clock skew is obvious or suspected." The evidence panel from PR #906 (per-hash observer breakdown showing raw vs corrected skew per observer) is the WHY. + +For each per-node clock card the UI must show: + +- **Tier badge** (default / ok / degrading / degraded / wrong) + magnitude. +- **Plain-English reason line**: e.g. "Last advert at 2024-05-15 + 3.2 days uptime — matches firmware default (volatile RTC, not yet user-set)" or "Last advert −12s vs wall clock — within OK tolerance." +- **Calibration footnote**: "Skew corrected using observer X offset +1.7s (computed from 412 multi-observer packets)" or "Single-observer measurement, no calibration available." +- **Evidence accordion** (PR #906 shape, retained): for the most recent N hashes, each observer's raw vs corrected skew + the observer's offset. + +For the per-observer page (also from PR #906): show the observer's offset, the multi-observer sample count, and a tier badge using the same scale (treating `|observerOffset|` as the skew). + +## Proposed classifier + +Per-advert classification, no windowing: + +```python +DEFAULT_EPOCHS = [0, 1609459200, 1672531200, 1715770351] +MAX_PLAUSIBLE_UPTIME_SEC = 730 * 86400 # 2 years + +def is_default(ts): + return any(d <= ts <= d + MAX_PLAUSIBLE_UPTIME_SEC for d in DEFAULT_EPOCHS) + +def classify(advert_ts, corrected_skew_sec): + if is_default(advert_ts): + return "default" # gray + abs_skew = abs(corrected_skew_sec) + if abs_skew <= 15: return "ok" # green + if abs_skew <= 60: return "degrading" # yellow + if abs_skew <= 600: return "degraded" # orange + return "wrong" # red +``` + +`corrected_skew_sec` is the observer-bias-subtracted skew per Phase 2 calibration. Default detection is independent of calibration (runs on raw `advert_ts`). + +Per-node state = classification of the node's most-recent advert (per hash, picking the most recent observation across all observers). No medians, no good-fraction, no hysteresis. + +## Severity tier definitions + +| Tier | Condition | Color | UI label | Meaning | +|---|---|---|---|---| +| `default` | Advert ts within `[default, default + 2y]` of any known epoch | Gray | "Default" | Volatile RTC at firmware boot constant; never set or rebooted and not re-synced. | +| `ok` | abs(skew) ≤ 15s | Green | "OK" | Working clock. | +| `degrading` | 15s < abs(skew) ≤ 60s | Yellow | "Degrading" | Real but accumulating drift. | +| `degraded` | 60s < abs(skew) ≤ 600s | Orange | "Degraded" | Off by minutes — needs re-sync. | +| `wrong` | abs(skew) > 600s and not `default` | Red | "Wrong" | Operator-set error or RTC malfunction. | + +## What this kills + +- The 365-day `no_clock` threshold and the entire `recentSkewWindow{Count,Sec}` machinery. +- The hysteresis / `goodFraction` / `longTermGoodFraction` logic from PR #894. +- The proposed `bimodal_clock` tier from #845 — the pattern is not bimodal, it's defaulted vs set. +- All Theil-Sen drift calculations as classifier inputs (drift remains a derived display value). + +## What this preserves + +- **Phase 2 observer calibration** (`calibrateObservers()`) — kept verbatim. It's what powers the "subtract observer bias" requirement from #690 and provides the triangulation evidence the UI needs. +- **Drift display** (computed but not classifying). +- **PR #906 evidence UI** — orthogonal to the classifier; it is in fact the implementation of #690's "explain WHY" requirement. Only label strings change to match the new tier names. +- **`/api/observers/clock-skew`** — unchanged shape. + +## API impact + +`/api/nodes/{pubkey}/clock-skew` response changes: + +- `severity` enum: `default | ok | degrading | degraded | wrong` (no more `no_clock | severe | warn | absurd`). +- New field `defaultEpoch` (int, optional): if `severity == "default"`, the matched epoch. +- Drop fields: `recentMedianSkewSec`, `goodFraction`, `recentBadSampleCount`, `longTermGoodFraction`. +- Keep: `lastSkewSec`, `medianSkewSec`, `meanSkewSec`, `driftPerDaySec`, `sampleCount`, `calibrated`, `lastAdvertTS`, `lastObservedTS`, `nodeName`, `nodeRole`. + +`/api/nodes/clock-skew` (fleet) shape unchanged except severity enum values. + +## UI impact + +- New CSS classes `skew-badge--default`, `skew-badge--degrading`, `skew-badge--degraded`, `skew-badge--wrong`. Drop `--no_clock`, `--severe`, `--warn`, `--absurd`, `--bimodal_clock`. +- Tooltip text updated per tier. +- "Default" badge tooltip should explain the clock is at firmware default plus uptime since boot, and the operator hasn't set it yet (or hasn't re-set it since the last reboot). + +## Migration + +Single PR replaces the classifier in `clock_skew.go` and updates the frontend badges/labels. No database schema change, no data migration — all per-call computation. + +## Open issues to close + +- **#789** (median hides corrected clocks) — resolved by per-advert classification. +- **#845** (bimodal_clock tier) — replaced by `default` tier; the pattern that motivated it is correctly captured. +- **PR #894** — close without merging; this design supersedes Option C entirely. +- **#690** UI completion (PR #906) — keeps moving in parallel; only label updates needed. + +## Validation plan + +1. Hand-run the classifier against a snapshot of `/api/nodes/clock-skew` from 00id and cascadia. Confirm: + - All 103 00id "absurd" nodes reclassify as `default`. + - All 5 cascadia 2023-01 nodes reclassify as `default`. + - The 2027 / 2067 cascadia outliers reclassify as `wrong`. + - The 285 cascadia 2026-04 nodes reclassify as `ok` (or `degrading` if drift exceeds 15s). +2. Add per-tier unit tests in `cmd/server/clock_skew_test.go`. +3. Add a regression test for each known default epoch (synthesize advert at `default + 0s`, `default + 1d`, `default + 2y - 1s` → all classify as `default`). +4. Edge cases: + - `advert_ts == 0` → matches default epoch 0. + - `advert_ts == 1715770351 + 731 days` → no longer matches (uptime cap exceeded) — should fall through to time-based classification, likely `wrong`. + - Future timestamps beyond `now + 600s` → `wrong`. + +## Out of scope (follow-ups) + +- Per-firmware-version known-default lookup (when `firmware_version` field becomes reliable on adverts). +- Reboot-count / flakiness indicator ("this node has hit default N times in last 30d"). +- Auto-discovery of new default epochs from clustering analysis (could detect a 4th default emerging in the wild). +- Filtering defaulted-clock adverts out of time-windowed analytics queries (separate spec — affects path attribution). From 545df2788d67e07af0f76b6814f0e5fc057d9b27 Mon Sep 17 00:00:00 2001 From: you Date: Fri, 24 Apr 2026 23:20:12 +0000 Subject: [PATCH 02/12] feat: replace clock skew classifier with default-detection model --- cmd/server/clock_skew.go | 372 +++++++++++++-------------------------- 1 file changed, 124 insertions(+), 248 deletions(-) diff --git a/cmd/server/clock_skew.go b/cmd/server/clock_skew.go index 0b2c363b..62ccdc99 100644 --- a/cmd/server/clock_skew.go +++ b/cmd/server/clock_skew.go @@ -12,20 +12,28 @@ import ( type SkewSeverity string const ( - SkewOK SkewSeverity = "ok" // < 5 min - SkewWarning SkewSeverity = "warning" // 5 min – 1 hour - SkewCritical SkewSeverity = "critical" // 1 hour – 30 days - SkewAbsurd SkewSeverity = "absurd" // > 30 days - SkewNoClock SkewSeverity = "no_clock" // > 365 days — uninitialized RTC - SkewBimodalClock SkewSeverity = "bimodal_clock" // mixed good+bad recent samples (flaky RTC) + SkewDefault SkewSeverity = "default" // firmware-default epoch + uptime + SkewOK SkewSeverity = "ok" // |skew| <= 15s + SkewDegrading SkewSeverity = "degrading" // 15s < |skew| <= 60s + SkewDegraded SkewSeverity = "degraded" // 60s < |skew| <= 600s + SkewWrong SkewSeverity = "wrong" // |skew| > 600s and not default ) +// Known firmware default epochs. Nodes with advert_ts in +// [epoch, epoch + maxPlausibleUptimeSec] are classified as "default". +// See docs/clock-skew-redesign.md for provenance of each value. +var defaultEpochs = []int64{0, 1609459200, 1672531200, 1715770351} + // Default thresholds in seconds. const ( - skewThresholdWarnSec = 5 * 60 // 5 minutes - skewThresholdCriticalSec = 60 * 60 // 1 hour - skewThresholdAbsurdSec = 30 * 24 * 3600 // 30 days - skewThresholdNoClockSec = 365 * 24 * 3600 // 365 days — uninitialized RTC + // maxPlausibleUptimeSec caps how far past a default epoch we still + // consider "default + uptime ticking". 730 days ≈ 2 years. + maxPlausibleUptimeSec = 730 * 86400 + + // Severity band boundaries (absolute skew in seconds). + skewThresholdOKSec = 15 + skewThresholdDegradingSec = 60 + skewThresholdDegradedSec = 600 // minDriftSamples is the minimum number of advert transmissions needed // to compute a meaningful linear drift rate. @@ -35,54 +43,43 @@ const ( // drift rates (> 1 day/day) indicate insufficient or outlier samples. maxReasonableDriftPerDay = 86400.0 - // recentSkewWindowCount is the number of most-recent advert samples - // used to derive the "current" skew for severity classification (see - // issue #789). The all-time median is poisoned by historical bad - // samples (e.g. a node that was off and then GPS-corrected); severity - // must reflect current health, not lifetime statistics. - recentSkewWindowCount = 5 - - // recentSkewWindowSec bounds the recent-window in time as well: only - // samples from the last N seconds count as "recent" for severity. - // The effective window is min(recentSkewWindowCount, samples in 1h). - recentSkewWindowSec = 3600 - - // bimodalSkewThresholdSec is the absolute skew threshold (1 hour) - // above which a sample is considered "bad" — likely firmware emitting - // a nonsense timestamp from an uninitialized RTC, not real drift. - // Chosen to match the warning/critical severity boundary: real clock - // drift rarely exceeds 1 hour, while epoch-0 RTCs produce ~1.7B sec. - bimodalSkewThresholdSec = 3600.0 - // maxPlausibleSkewJumpSec is the largest skew change between - // consecutive samples that we treat as physical drift. Anything larger - // (e.g. a GPS sync that jumps the clock by minutes/days) is rejected - // as an outlier when computing drift. Real microcontroller drift is - // fractions of a second per advert; 60s is a generous safety factor. + // consecutive samples that we treat as physical drift. maxPlausibleSkewJumpSec = 60.0 // theilSenMaxPoints caps the number of points fed to Theil-Sen - // regression (O(n²) in pairs). For nodes with thousands of samples we - // keep the most-recent points, which are also the most relevant for - // current drift. + // regression (O(n²) in pairs). theilSenMaxPoints = 200 ) -// classifySkew maps absolute skew (seconds) to a severity level. -// Float64 comparison is safe: inputs are rounded to 1 decimal via round(), -// and thresholds are integer multiples of 60 — no rounding artifacts. -func classifySkew(absSkewSec float64) SkewSeverity { +// isDefaultEpoch returns true if the raw advert timestamp falls within +// [epoch, epoch + maxPlausibleUptimeSec] for any known firmware default. +// If matched, returns the matched epoch; otherwise returns 0. +func isDefaultEpoch(advertTS int64) (bool, int64) { + for _, epoch := range defaultEpochs { + if advertTS >= epoch && advertTS <= epoch+maxPlausibleUptimeSec { + return true, epoch + } + } + return false, 0 +} + +// classifySkew maps a raw advert timestamp and corrected absolute skew +// to a severity level. Default detection runs on the raw advert_ts +// (independent of observer calibration). +func classifySkew(advertTS int64, absCorrectedSkewSec float64) (SkewSeverity, int64) { + if ok, epoch := isDefaultEpoch(advertTS); ok { + return SkewDefault, epoch + } switch { - case absSkewSec >= skewThresholdNoClockSec: - return SkewNoClock - case absSkewSec >= skewThresholdAbsurdSec: - return SkewAbsurd - case absSkewSec >= skewThresholdCriticalSec: - return SkewCritical - case absSkewSec >= skewThresholdWarnSec: - return SkewWarning + case absCorrectedSkewSec <= skewThresholdOKSec: + return SkewOK, 0 + case absCorrectedSkewSec <= skewThresholdDegradingSec: + return SkewDegrading, 0 + case absCorrectedSkewSec <= skewThresholdDegradedSec: + return SkewDegraded, 0 default: - return SkewOK + return SkewWrong, 0 } } @@ -90,38 +87,35 @@ func classifySkew(absSkewSec float64) SkewSeverity { // skewSample is a single raw skew measurement from one advert observation. type skewSample struct { - advertTS int64 // node's advert Unix timestamp - observedTS int64 // observation Unix timestamp - observerID string // which observer saw this - hash string // transmission hash (for multi-observer grouping) + advertTS int64 // node's advert Unix timestamp + observedTS int64 // observation Unix timestamp + observerID string // which observer saw this + hash string // transmission hash (for multi-observer grouping) } // ObserverCalibration holds the computed clock offset for an observer. type ObserverCalibration struct { ObserverID string `json:"observerID"` - OffsetSec float64 `json:"offsetSec"` // positive = observer clock ahead - Samples int `json:"samples"` // number of multi-observer packets used + OffsetSec float64 `json:"offsetSec"` // positive = observer clock ahead + Samples int `json:"samples"` // number of multi-observer packets used } // NodeClockSkew is the API response for a single node's clock skew data. type NodeClockSkew struct { - Pubkey string `json:"pubkey"` - MeanSkewSec float64 `json:"meanSkewSec"` // corrected mean skew (positive = node ahead) - MedianSkewSec float64 `json:"medianSkewSec"` // corrected median skew - LastSkewSec float64 `json:"lastSkewSec"` // most recent corrected skew - RecentMedianSkewSec float64 `json:"recentMedianSkewSec"` // median across most-recent samples (drives severity, see #789) - DriftPerDaySec float64 `json:"driftPerDaySec"` // linear drift rate (sec/day) - Severity SkewSeverity `json:"severity"` - SampleCount int `json:"sampleCount"` - Calibrated bool `json:"calibrated"` // true if observer calibration was applied - LastAdvertTS int64 `json:"lastAdvertTS"` // most recent advert timestamp - LastObservedTS int64 `json:"lastObservedTS"` // most recent observation timestamp - Samples []SkewSample `json:"samples,omitempty"` // time-series for sparklines - GoodFraction float64 `json:"goodFraction"` // fraction of recent samples with |skew| <= 1h - RecentBadSampleCount int `json:"recentBadSampleCount"` // count of recent samples with |skew| > 1h - RecentSampleCount int `json:"recentSampleCount"` // total recent samples in window - NodeName string `json:"nodeName,omitempty"` // populated in fleet responses - NodeRole string `json:"nodeRole,omitempty"` // populated in fleet responses + Pubkey string `json:"pubkey"` + MeanSkewSec float64 `json:"meanSkewSec"` // corrected mean skew (positive = node ahead) + MedianSkewSec float64 `json:"medianSkewSec"` // corrected median skew + LastSkewSec float64 `json:"lastSkewSec"` // most recent corrected skew + DriftPerDaySec float64 `json:"driftPerDaySec"` // linear drift rate (sec/day) + Severity SkewSeverity `json:"severity"` + SampleCount int `json:"sampleCount"` + Calibrated bool `json:"calibrated"` // true if observer calibration was applied + LastAdvertTS int64 `json:"lastAdvertTS"` // most recent advert timestamp + LastObservedTS int64 `json:"lastObservedTS"` // most recent observation timestamp + DefaultEpoch *int64 `json:"defaultEpoch,omitempty"` // matched epoch when severity=default + Samples []SkewSample `json:"samples,omitempty"` // time-series for sparklines + NodeName string `json:"nodeName,omitempty"` // populated in fleet responses + NodeRole string `json:"nodeRole,omitempty"` // populated in fleet responses } // SkewSample is a single (timestamp, skew) point for sparkline rendering. @@ -130,28 +124,26 @@ type SkewSample struct { SkewSec float64 `json:"skew"` // corrected skew in seconds } -// txSkewResult maps tx hash → per-transmission skew stats. This is an -// intermediate result keyed by hash (not pubkey); the store maps hash → pubkey -// when building the final per-node view. +// txSkewResult maps tx hash → per-transmission skew stats. type txSkewResult = map[string]*NodeClockSkew // ── Clock Skew Engine ────────────────────────────────────────────────────────── // ClockSkewEngine computes and caches clock skew data for nodes and observers. type ClockSkewEngine struct { - mu sync.RWMutex - observerOffsets map[string]float64 // observerID → calibrated offset (seconds) - observerSamples map[string]int // observerID → number of multi-observer packets used - nodeSkew txSkewResult - lastComputed time.Time - computeInterval time.Duration + mu sync.RWMutex + observerOffsets map[string]float64 // observerID → calibrated offset (seconds) + observerSamples map[string]int // observerID → number of multi-observer packets used + nodeSkew txSkewResult + lastComputed time.Time + computeInterval time.Duration } func NewClockSkewEngine() *ClockSkewEngine { return &ClockSkewEngine{ - observerOffsets: make(map[string]float64), + observerOffsets: make(map[string]float64), observerSamples: make(map[string]int), - nodeSkew: make(txSkewResult), + nodeSkew: make(txSkewResult), computeInterval: 30 * time.Second, } } @@ -188,7 +180,6 @@ func (e *ClockSkewEngine) Recompute(store *PacketStore) { // Swap results under brief write lock. e.mu.Lock() - // Re-check: another goroutine may have computed while we were working. if time.Since(e.lastComputed) < e.computeInterval { e.mu.Unlock() return @@ -214,13 +205,13 @@ func collectSamples(store *PacketStore) []skewSample { if decoded == nil { continue } - // Extract advert timestamp from decoded JSON. advertTS := extractTimestamp(decoded) - if advertTS <= 0 { + if advertTS < 0 { continue } - // Sanity: skip timestamps before year 2020 or after year 2100. - if advertTS < 1577836800 || advertTS > 4102444800 { + // Allow epoch 0 and above (needed for default-epoch detection). + // Upper bound: year 2100. + if advertTS > 4102444800 { continue } @@ -242,7 +233,6 @@ func collectSamples(store *PacketStore) []skewSample { // extractTimestamp gets the Unix timestamp from a decoded ADVERT payload. func extractTimestamp(decoded map[string]interface{}) int64 { - // Try payload.timestamp first (nested in "payload" key). if payload, ok := decoded["payload"]; ok { if pm, ok := payload.(map[string]interface{}); ok { if ts := jsonNumber(pm, "timestamp"); ts > 0 { @@ -250,7 +240,6 @@ func extractTimestamp(decoded map[string]interface{}) int64 { } } } - // Fallback: top-level timestamp. if ts := jsonNumber(decoded, "timestamp"); ts > 0 { return ts } @@ -281,7 +270,6 @@ func parseISO(s string) int64 { } t, err := time.Parse(time.RFC3339, s) if err != nil { - // Try with fractional seconds. t, err = time.Parse("2006-01-02T15:04:05.999999999Z07:00", s) if err != nil { return 0 @@ -295,19 +283,16 @@ func parseISO(s string) int64 { // calibrateObservers computes each observer's clock offset using multi-observer // packets. Returns offset map and sample count map. func calibrateObservers(samples []skewSample) (map[string]float64, map[string]int) { - // Group observations by packet hash. byHash := make(map[string][]skewSample) for _, s := range samples { byHash[s.hash] = append(byHash[s.hash], s) } - // For each multi-observer packet, compute per-observer deviation from median. - deviations := make(map[string][]float64) // observerID → list of deviations + deviations := make(map[string][]float64) for _, group := range byHash { if len(group) < 2 { - continue // single-observer packet, can't calibrate + continue } - // Compute median observation timestamp for this packet. obsTimes := make([]float64, len(group)) for i, s := range group { obsTimes[i] = float64(s.observedTS) @@ -319,7 +304,6 @@ func calibrateObservers(samples []skewSample) (map[string]float64, map[string]in } } - // Each observer's offset = median of its deviations. offsets := make(map[string]float64, len(deviations)) counts := make(map[string]int, len(deviations)) for obsID, devs := range deviations { @@ -333,8 +317,6 @@ func calibrateObservers(samples []skewSample) (map[string]float64, map[string]in // computeNodeSkew calculates corrected skew statistics for each node. func computeNodeSkew(samples []skewSample, obsOffsets map[string]float64) txSkewResult { - // Compute corrected skew per sample, grouped by hash (each hash = one - // node's advert transmission). The caller maps hash → pubkey via byNode. type correctedSample struct { skew float64 observedTS int64 @@ -349,8 +331,6 @@ func computeNodeSkew(samples []skewSample, obsOffsets map[string]float64) txSkew rawSkew := float64(s.advertTS - s.observedTS) corrected := rawSkew if hasCal { - // Observer offset = obs_ts - median(all_obs_ts). If observer is ahead, - // its obs_ts is inflated, making raw_skew too low. Add offset to correct. corrected = rawSkew + obsOffset } byHash[s.hash] = append(byHash[s.hash], correctedSample{ @@ -361,10 +341,7 @@ func computeNodeSkew(samples []skewSample, obsOffsets map[string]float64) txSkew hashAdvertTS[s.hash] = s.advertTS } - // Each hash represents one advert from one node. Compute median corrected - // skew per hash (across multiple observers). - - result := make(map[string]*NodeClockSkew) // keyed by hash for now + result := make(map[string]*NodeClockSkew) for hash, cs := range byHash { skews := make([]float64, len(cs)) for i, c := range cs { @@ -373,7 +350,6 @@ func computeNodeSkew(samples []skewSample, obsOffsets map[string]float64) txSkew medSkew := median(skews) meanSkew := mean(skews) - // Find latest observation. var latestObsTS int64 var anyCal bool for _, c := range cs { @@ -385,17 +361,25 @@ func computeNodeSkew(samples []skewSample, obsOffsets map[string]float64) txSkew } } - absMedian := math.Abs(medSkew) - result[hash] = &NodeClockSkew{ + lastCorrectedSkew := cs[len(cs)-1].skew + advTS := hashAdvertTS[hash] + severity, matchedEpoch := classifySkew(advTS, math.Abs(lastCorrectedSkew)) + + ncs := &NodeClockSkew{ MeanSkewSec: round(meanSkew, 1), MedianSkewSec: round(medSkew, 1), - LastSkewSec: round(cs[len(cs)-1].skew, 1), - Severity: classifySkew(absMedian), + LastSkewSec: round(lastCorrectedSkew, 1), + Severity: severity, SampleCount: len(cs), Calibrated: anyCal, - LastAdvertTS: hashAdvertTS[hash], + LastAdvertTS: advTS, LastObservedTS: latestObsTS, } + if severity == SkewDefault { + ep := matchedEpoch + ncs.DefaultEpoch = &ep + } + result[hash] = ncs } return result } @@ -457,124 +441,45 @@ func (s *PacketStore) getNodeClockSkewLocked(pubkey string) *NodeClockSkew { medSkew := median(allSkews) meanSkew := mean(allSkews) - // Severity is derived from RECENT samples only (issue #789). The - // all-time median is poisoned by historical bad data — a node that - // was off for hours and then GPS-corrected can have median = -59M sec - // while its current skew is -0.8s. Operators need severity to reflect - // current health, so they trust the dashboard. - // - // Sort tsSkews by time and take the last recentSkewWindowCount samples - // (or all samples within recentSkewWindowSec of the latest, whichever - // gives FEWER samples — we want the more-current view; a chatty node - // can fit dozens of samples in 1h, in which case the count cap wins). - sort.Slice(tsSkews, func(i, j int) bool { return tsSkews[i].ts < tsSkews[j].ts }) + // Classify using the most recent advert's raw timestamp and + // the most recent corrected skew. No windowing or median-driven + // severity — per-advert classification per the spec. + severity, matchedEpoch := classifySkew(lastAdvTS, math.Abs(lastSkew)) - recentSkew := lastSkew - var recentVals []float64 - if n := len(tsSkews); n > 0 { - latestTS := tsSkews[n-1].ts - // Index-based window: last K samples. - startByCount := n - recentSkewWindowCount - if startByCount < 0 { - startByCount = 0 - } - // Time-based window: samples newer than latestTS - windowSec. - startByTime := n - 1 - for i := n - 1; i >= 0; i-- { - if latestTS-tsSkews[i].ts <= recentSkewWindowSec { - startByTime = i - } else { - break - } - } - // Pick the narrower (larger-index) of the two windows — the most - // current view of the node's clock health. - start := startByCount - if startByTime > start { - start = startByTime - } - recentVals = make([]float64, 0, n-start) - for i := start; i < n; i++ { - recentVals = append(recentVals, tsSkews[i].skew) - } - if len(recentVals) > 0 { - recentSkew = median(recentVals) - } - } - - // ── Bimodal detection (#845) ───────────────────────────────────────── - // Split recent samples into "good" (|skew| <= 1h, real clock) and - // "bad" (|skew| > 1h, firmware nonsense from uninitialized RTC). - // Classification order (first match wins): - // no_clock — goodFraction < 0.10 (essentially no real clock) - // bimodal_clock — 0.10 <= goodFraction < 0.80 AND badCount > 0 - // ok/warn/etc. — goodFraction >= 0.80 (normal, outliers filtered) - var goodSamples []float64 - for _, v := range recentVals { - if math.Abs(v) <= bimodalSkewThresholdSec { - goodSamples = append(goodSamples, v) - } - } - recentSampleCount := len(recentVals) - recentBadCount := recentSampleCount - len(goodSamples) - var goodFraction float64 - if recentSampleCount > 0 { - goodFraction = float64(len(goodSamples)) / float64(recentSampleCount) - } - - var severity SkewSeverity - if goodFraction < 0.10 { - // Essentially no real clock — classify as no_clock regardless - // of the raw skew magnitude. - severity = SkewNoClock - } else if goodFraction < 0.80 && recentBadCount > 0 { - // Bimodal: use median of GOOD samples as the "real" skew. - severity = SkewBimodalClock - if len(goodSamples) > 0 { - recentSkew = median(goodSamples) - } - } else { - // Normal path: if there are good samples, use their median - // (filters out rare outliers in ≥80% good case). - if len(goodSamples) > 0 && recentBadCount > 0 { - recentSkew = median(goodSamples) - } - severity = classifySkew(math.Abs(recentSkew)) - } - - // For no_clock / bimodal_clock nodes, skip drift when data is unreliable. + // Drift: display only, not a classifier input. var drift float64 - if severity != SkewNoClock && severity != SkewBimodalClock && len(tsSkews) >= minDriftSamples { + if severity != SkewDefault && len(tsSkews) >= minDriftSamples { drift = computeDrift(tsSkews) - // Cap physically impossible drift rates. if math.Abs(drift) > maxReasonableDriftPerDay { drift = 0 } } - // Build sparkline samples from tsSkews (already sorted by time above). + // Build sparkline samples. + sort.Slice(tsSkews, func(i, j int) bool { return tsSkews[i].ts < tsSkews[j].ts }) samples := make([]SkewSample, len(tsSkews)) for i, p := range tsSkews { samples[i] = SkewSample{Timestamp: p.ts, SkewSec: round(p.skew, 1)} } - return &NodeClockSkew{ - Pubkey: pubkey, - MeanSkewSec: round(meanSkew, 1), - MedianSkewSec: round(medSkew, 1), - LastSkewSec: round(lastSkew, 1), - RecentMedianSkewSec: round(recentSkew, 1), - DriftPerDaySec: round(drift, 2), - Severity: severity, - SampleCount: totalSamples, - Calibrated: anyCal, - LastAdvertTS: lastAdvTS, - LastObservedTS: lastObsTS, - Samples: samples, - GoodFraction: round(goodFraction, 2), - RecentBadSampleCount: recentBadCount, - RecentSampleCount: recentSampleCount, + result := &NodeClockSkew{ + Pubkey: pubkey, + MeanSkewSec: round(meanSkew, 1), + MedianSkewSec: round(medSkew, 1), + LastSkewSec: round(lastSkew, 1), + DriftPerDaySec: round(drift, 2), + Severity: severity, + SampleCount: totalSamples, + Calibrated: anyCal, + LastAdvertTS: lastAdvTS, + LastObservedTS: lastObsTS, + Samples: samples, + } + if severity == SkewDefault { + ep := matchedEpoch + result.DefaultEpoch = &ep } + return result } // GetFleetClockSkew returns clock skew data for all nodes that have skew data. @@ -583,7 +488,6 @@ func (s *PacketStore) GetFleetClockSkew() []*NodeClockSkew { s.mu.RLock() defer s.mu.RUnlock() - // Build name/role lookup from DB cache (requires s.mu held). allNodes, _ := s.getCachedNodesAndPM() nameMap := make(map[string]nodeInfo, len(allNodes)) for _, ni := range allNodes { @@ -596,12 +500,10 @@ func (s *PacketStore) GetFleetClockSkew() []*NodeClockSkew { if cs == nil { continue } - // Enrich with node name/role. if ni, ok := nameMap[pubkey]; ok { cs.NodeName = ni.Name cs.NodeRole = ni.Role } - // Omit samples in fleet response (too much data). cs.Samples = nil results = append(results, cs) } @@ -626,7 +528,6 @@ func (s *PacketStore) GetObserverCalibrations() []ObserverCalibration { Samples: s.clockSkew.observerSamples[obsID], }) } - // Sort by absolute offset descending. sort.Slice(result, func(i, j int) bool { return math.Abs(result[i].OffsetSec) > math.Abs(result[j].OffsetSec) }) @@ -667,38 +568,20 @@ type tsSkewPair struct { } // computeDrift estimates linear drift in seconds per day from time-ordered -// (timestamp, skew) pairs. Issue #789: a single GPS-correction event (huge -// skew jump in seconds) used to dominate ordinary least squares and produce -// absurd drift like 1.7M sec/day. We now: -// -// 1. Drop pairs whose consecutive skew jump exceeds maxPlausibleSkewJumpSec -// (clock corrections, not physical drift). This protects both OLS-style -// consumers and Theil-Sen. -// 2. Use Theil-Sen regression — the slope is the median of all pairwise -// slopes, naturally robust to remaining outliers (breakdown point ~29%). -// -// For very small samples after filtering we fall back to a simple slope -// between first and last calibrated samples. +// (timestamp, skew) pairs using Theil-Sen regression with outlier filtering. func computeDrift(pairs []tsSkewPair) float64 { if len(pairs) < 2 { return 0 } - // Sort by timestamp. sort.Slice(pairs, func(i, j int) bool { return pairs[i].ts < pairs[j].ts }) - // Time span too short? Skip. spanSec := float64(pairs[len(pairs)-1].ts - pairs[0].ts) - if spanSec < 3600 { // need at least 1 hour of data + if spanSec < 3600 { return 0 } - // Outlier filter: drop samples where the skew jumps more than - // maxPlausibleSkewJumpSec from the running "stable" baseline. - // We anchor on the first sample, then accept each subsequent point - // that's within the threshold of the most recent accepted point — - // this preserves a slow drift while rejecting correction events. filtered := make([]tsSkewPair, 0, len(pairs)) filtered = append(filtered, pairs[0]) for i := 1; i < len(pairs); i++ { @@ -707,30 +590,23 @@ func computeDrift(pairs []tsSkewPair) float64 { filtered = append(filtered, pairs[i]) } } - // If the filter killed too much (e.g. unstable node), fall back to the - // raw series so we at least produce *something* — it'll be capped by - // maxReasonableDriftPerDay downstream. if len(filtered) < 2 || float64(filtered[len(filtered)-1].ts-filtered[0].ts) < 3600 { filtered = pairs } - // Cap point count for Theil-Sen (O(n²) on pairs). Keep most-recent. if len(filtered) > theilSenMaxPoints { filtered = filtered[len(filtered)-theilSenMaxPoints:] } - return theilSenSlope(filtered) * 86400 // sec/sec → sec/day + return theilSenSlope(filtered) * 86400 } -// theilSenSlope returns the Theil-Sen estimator: median of all pairwise -// slopes (yj - yi) / (tj - ti) for i < j. Naturally robust to outliers. -// Pairs must be sorted by timestamp ascending. +// theilSenSlope returns the Theil-Sen estimator: median of all pairwise slopes. func theilSenSlope(pairs []tsSkewPair) float64 { n := len(pairs) if n < 2 { return 0 } - // Pre-allocate: n*(n-1)/2 pairs. slopes := make([]float64, 0, n*(n-1)/2) for i := 0; i < n; i++ { for j := i + 1; j < n; j++ { From 2c675f5ab2607d9d25f47049b8eb9d7831b496da Mon Sep 17 00:00:00 2001 From: you Date: Fri, 24 Apr 2026 23:25:35 +0000 Subject: [PATCH 03/12] test: cover default-detection classifier tiers and edge cases --- cmd/server/clock_skew.go | 11 +- cmd/server/clock_skew_test.go | 819 +++++++++------------------------- 2 files changed, 212 insertions(+), 618 deletions(-) diff --git a/cmd/server/clock_skew.go b/cmd/server/clock_skew.go index 62ccdc99..ad4a5992 100644 --- a/cmd/server/clock_skew.go +++ b/cmd/server/clock_skew.go @@ -56,11 +56,18 @@ const ( // [epoch, epoch + maxPlausibleUptimeSec] for any known firmware default. // If matched, returns the matched epoch; otherwise returns 0. func isDefaultEpoch(advertTS int64) (bool, int64) { + // Find the largest epoch <= advertTS (closest match). Since ranges + // overlap, picking the closest avoids attributing a 2023-firmware + // node's timestamp to the 2024 epoch. + bestEpoch := int64(-1) for _, epoch := range defaultEpochs { - if advertTS >= epoch && advertTS <= epoch+maxPlausibleUptimeSec { - return true, epoch + if epoch <= advertTS && epoch > bestEpoch { + bestEpoch = epoch } } + if bestEpoch >= 0 && advertTS <= bestEpoch+maxPlausibleUptimeSec { + return true, bestEpoch + } return false, 0 } diff --git a/cmd/server/clock_skew_test.go b/cmd/server/clock_skew_test.go index 4fb79bb8..04a5a2b8 100644 --- a/cmd/server/clock_skew_test.go +++ b/cmd/server/clock_skew_test.go @@ -9,34 +9,125 @@ import ( // ── classifySkew ─────────────────────────────────────────────────────────────── -func TestClassifySkew(t *testing.T) { +func TestClassify_Default_AllKnownEpochs(t *testing.T) { + // Each known default epoch at +0s, +1d uptime → default. + for _, epoch := range defaultEpochs { + for _, uptimeSec := range []int64{0, 86400} { + advTS := epoch + uptimeSec + sev, _ := classifySkew(advTS, 999999) // skew irrelevant for default + if sev != SkewDefault { + t.Errorf("classifySkew(epoch=%d + %ds) = %v, want default", epoch, uptimeSec, sev) + } + } + } + // Also test at 729d for the most recent epoch (no overlap issue). + advTS := defaultEpochs[len(defaultEpochs)-1] + 729*86400 + sev, matched := classifySkew(advTS, 999999) + if sev != SkewDefault { + t.Errorf("classifySkew(latest epoch + 729d) = %v, want default", sev) + } + if matched != defaultEpochs[len(defaultEpochs)-1] { + t.Errorf("matched = %d, want %d", matched, defaultEpochs[len(defaultEpochs)-1]) + } +} + +func TestClassify_Default_BeyondUptimeCap(t *testing.T) { + // 731 days past the LATEST epoch → NOT default. + latestEpoch := defaultEpochs[len(defaultEpochs)-1] + advTS := latestEpoch + 731*86400 + sev, _ := classifySkew(advTS, 5) + if sev == SkewDefault { + t.Errorf("classifySkew(latestEpoch + 731d) = default, should fall through") + } +} + +func TestClassify_OK(t *testing.T) { + // Use a timestamp outside all default-epoch ranges. + advTS := int64(1900000000) // 2030-03 — well past any default+730d tests := []struct { - absSkew float64 - expected SkewSeverity + skew float64 + want SkewSeverity }{ {0, SkewOK}, - {60, SkewOK}, // 1 min - {299, SkewOK}, // just under 5 min - {300, SkewWarning}, // exactly 5 min - {1800, SkewWarning}, // 30 min - {3599, SkewWarning}, // just under 1 hour - {3600, SkewCritical}, // exactly 1 hour - {86400, SkewCritical}, // 1 day - {2592000 - 1, SkewCritical}, // just under 30 days - {2592000, SkewAbsurd}, // exactly 30 days - {86400 * 365 - 1, SkewAbsurd}, // just under 365 days - {86400 * 365, SkewNoClock}, // exactly 365 days - {86400 * 365 * 10, SkewNoClock}, // 10 years (epoch-0 style) + {15, SkewOK}, + {15.0, SkewOK}, } for _, tc := range tests { - got := classifySkew(tc.absSkew) - if got != tc.expected { - t.Errorf("classifySkew(%v) = %v, want %v", tc.absSkew, got, tc.expected) + sev, _ := classifySkew(advTS, tc.skew) + if sev != tc.want { + t.Errorf("classifySkew(advTS, %v) = %v, want %v", tc.skew, sev, tc.want) } } } -// ── median ───────────────────────────────────────────────────────────────────── +func TestClassify_Degrading(t *testing.T) { + advTS := int64(1900000000) + tests := []float64{16, 30, 60} + for _, skew := range tests { + sev, _ := classifySkew(advTS, skew) + if sev != SkewDegrading { + t.Errorf("classifySkew(advTS, %v) = %v, want degrading", skew, sev) + } + } +} + +func TestClassify_Degraded(t *testing.T) { + advTS := int64(1900000000) + tests := []float64{61, 300, 600} + for _, skew := range tests { + sev, _ := classifySkew(advTS, skew) + if sev != SkewDegraded { + t.Errorf("classifySkew(advTS, %v) = %v, want degraded", skew, sev) + } + } +} + +func TestClassify_Wrong(t *testing.T) { + advTS := int64(1900000000) + tests := []float64{601, 3600, 86400 * 365} + for _, skew := range tests { + sev, _ := classifySkew(advTS, skew) + if sev != SkewWrong { + t.Errorf("classifySkew(advTS, %v) = %v, want wrong", skew, sev) + } + } +} + +func TestClassify_FutureWrong(t *testing.T) { + // Advert_ts in 2030 + 700s offset — not a default epoch. + advTS := int64(1900000700) + sev, _ := classifySkew(advTS, 700) + if sev != SkewWrong { + t.Errorf("classifySkew(future, 700) = %v, want wrong", sev) + } +} + +// ── isDefaultEpoch ───────────────────────────────────────────────────────────── + +func TestIsDefaultEpoch_Boundaries(t *testing.T) { + // Exactly at epoch → true. + ok, ep := isDefaultEpoch(1715770351) + if !ok || ep != 1715770351 { + t.Errorf("isDefaultEpoch(1715770351) = %v, %d", ok, ep) + } + // At epoch + maxPlausibleUptimeSec → true. + ok, ep = isDefaultEpoch(1715770351 + maxPlausibleUptimeSec) + if !ok { + t.Error("expected true at epoch + maxPlausibleUptimeSec") + } + // Just past → false. + ok, _ = isDefaultEpoch(1715770351 + maxPlausibleUptimeSec + 1) + if ok { + t.Error("expected false past max uptime") + } + // Epoch 0 → true. + ok, ep = isDefaultEpoch(0) + if !ok || ep != 0 { + t.Errorf("isDefaultEpoch(0) = %v, %d", ok, ep) + } +} + +// ── median / mean ────────────────────────────────────────────────────────────── func TestMedian(t *testing.T) { tests := []struct { @@ -99,7 +190,6 @@ func TestParseISO(t *testing.T) { // ── extractTimestamp ──────────────────────────────────────────────────────────── func TestExtractTimestamp(t *testing.T) { - // Nested payload.timestamp decoded := map[string]interface{}{ "payload": map[string]interface{}{ "timestamp": float64(1776340800), @@ -110,7 +200,6 @@ func TestExtractTimestamp(t *testing.T) { t.Errorf("extractTimestamp (nested) = %v, want 1776340800", got) } - // Top-level timestamp decoded2 := map[string]interface{}{ "timestamp": float64(1776340900), } @@ -119,7 +208,6 @@ func TestExtractTimestamp(t *testing.T) { t.Errorf("extractTimestamp (top-level) = %v, want 1776340900", got2) } - // No timestamp decoded3 := map[string]interface{}{"foo": "bar"} got3 := extractTimestamp(decoded3) if got3 != 0 { @@ -130,7 +218,6 @@ func TestExtractTimestamp(t *testing.T) { // ── calibrateObservers ───────────────────────────────────────────────────────── func TestCalibrateObservers_SingleObserver(t *testing.T) { - // Single-observer packets can't calibrate — should return empty. samples := []skewSample{ {advertTS: 1000, observedTS: 1000, observerID: "obs1", hash: "h1"}, {advertTS: 2000, observedTS: 2000, observerID: "obs1", hash: "h2"}, @@ -142,10 +229,6 @@ func TestCalibrateObservers_SingleObserver(t *testing.T) { } func TestCalibrateObservers_MultiObserver(t *testing.T) { - // Packet h1 seen by 3 observers: obs1 at t=100, obs2 at t=110, obs3 at t=100. - // Median observation = 100. obs1=0, obs2=+10, obs3=0 - // Packet h2 seen by 3 observers: obs1 at t=200, obs2 at t=210, obs3 at t=200. - // Median observation = 200. obs1=0, obs2=+10, obs3=0 samples := []skewSample{ {advertTS: 100, observedTS: 100, observerID: "obs1", hash: "h1"}, {advertTS: 100, observedTS: 110, observerID: "obs2", hash: "h1"}, @@ -169,75 +252,41 @@ func TestCalibrateObservers_MultiObserver(t *testing.T) { // ── computeNodeSkew ──────────────────────────────────────────────────────────── func TestComputeNodeSkew_BasicCorrection(t *testing.T) { - // Validates observer offset correction direction. - // - // Setup: node is 60s ahead, obs1 accurate, obs2 is 10s ahead. - // With 2 observers, median obs_ts = 1005. - // obs1 offset = 1000 - 1005 = -5 - // obs2 offset = 1010 - 1005 = +5 - // Correction: corrected = raw_skew + obsOffset - // obs1: raw=60, corrected = 60 + (-5) = 55 - // obs2: raw=50, corrected = 50 + 5 = 55 - // Both converge to 55 (not exact 60 because with only 2 observers, - // the median can't fully distinguish which observer is drifted). - samples := []skewSample{ - // Same packet seen by accurate obs1 and obs2 (+10s ahead) {advertTS: 1060, observedTS: 1000, observerID: "obs1", hash: "h1"}, {advertTS: 1060, observedTS: 1010, observerID: "obs2", hash: "h1"}, } offsets, _ := calibrateObservers(samples) - // median obs = 1005, obs1 offset = -5, obs2 offset = +5 - // So the median approach finds obs2 is +5 ahead (relative to median) - - // Now compute node skew with those offsets: nodeSkew := computeNodeSkew(samples, offsets) cs, ok := nodeSkew["h1"] if !ok { t.Fatal("expected skew data for hash h1") } - // With only 2 observers, median obs_ts = 1005. - // obs1 offset = 1000-1005 = -5, obs2 offset = 1010-1005 = +5 - // raw from obs1 = 60, corrected = 60 + (-5) = 55 - // raw from obs2 = 50, corrected = 50 + 5 = 55 - // median = 55 if cs.MedianSkewSec != 55 { t.Errorf("median skew = %v, want 55", cs.MedianSkewSec) } } func TestComputeNodeSkew_ThreeObservers(t *testing.T) { - // Node is exactly 60s ahead. obs1 accurate, obs2 accurate, obs3 +30s ahead. - // advertTS = 1060, real time = 1000 samples := []skewSample{ {advertTS: 1060, observedTS: 1000, observerID: "obs1", hash: "h1"}, {advertTS: 1060, observedTS: 1000, observerID: "obs2", hash: "h1"}, {advertTS: 1060, observedTS: 1030, observerID: "obs3", hash: "h1"}, } offsets, _ := calibrateObservers(samples) - // median obs_ts = 1000. obs1=0, obs2=0, obs3=+30 - if offsets["obs3"] != 30 { - t.Errorf("obs3 offset = %v, want 30", offsets["obs3"]) - } - nodeSkew := computeNodeSkew(samples, offsets) cs := nodeSkew["h1"] if cs == nil { t.Fatal("expected skew data for h1") } - // raw from obs1 = 60, corrected = 60 + 0 = 60 - // raw from obs2 = 60, corrected = 60 + 0 = 60 - // raw from obs3 = 30, corrected = 30 + 30 = 60 - // All three converge to 60. if cs.MedianSkewSec != 60 { - t.Errorf("median skew = %v, want 60 (node is 60s ahead)", cs.MedianSkewSec) + t.Errorf("median skew = %v, want 60", cs.MedianSkewSec) } } // ── computeDrift ─────────────────────────────────────────────────────────────── func TestComputeDrift_Stable(t *testing.T) { - // Constant skew = no drift. pairs := []tsSkewPair{ {ts: 0, skew: 60}, {ts: 7200, skew: 60}, @@ -245,21 +294,19 @@ func TestComputeDrift_Stable(t *testing.T) { } drift := computeDrift(pairs) if drift != 0 { - t.Errorf("drift = %v, want 0 for stable skew", drift) + t.Errorf("drift = %v, want 0", drift) } } func TestComputeDrift_LinearDrift(t *testing.T) { - // 1 second drift per hour = 24 sec/day. pairs := []tsSkewPair{ {ts: 0, skew: 0}, {ts: 3600, skew: 1}, {ts: 7200, skew: 2}, } drift := computeDrift(pairs) - expected := 24.0 - if math.Abs(drift-expected) > 0.1 { - t.Errorf("drift = %v, want ~%v", drift, expected) + if math.Abs(drift-24.0) > 0.1 { + t.Errorf("drift = %v, want ~24", drift) } } @@ -271,7 +318,6 @@ func TestComputeDrift_TooFewSamples(t *testing.T) { } func TestComputeDrift_TooShortSpan(t *testing.T) { - // Less than 1 hour apart. pairs := []tsSkewPair{ {ts: 0, skew: 0}, {ts: 1800, skew: 10}, @@ -281,6 +327,34 @@ func TestComputeDrift_TooShortSpan(t *testing.T) { } } +func TestDriftRejectsCorrectionJump(t *testing.T) { + pairs := []tsSkewPair{} + for i := 0; i < 12; i++ { + ts := int64(i) * 300 + skew := float64(i) * (1.0 / 24.0) + pairs = append(pairs, tsSkewPair{ts: ts, skew: skew}) + } + pairs = append(pairs, tsSkewPair{ts: 3600 + 12*300, skew: 1000}) + drift := computeDrift(pairs) + if math.Abs(drift) > 100 { + t.Errorf("drift = %v, expected small", drift) + } +} + +func TestTheilSenMatchesOLSWhenClean(t *testing.T) { + pairs := []tsSkewPair{} + for i := 0; i < 20; i++ { + pairs = append(pairs, tsSkewPair{ + ts: int64(i) * 600, + skew: float64(i) * (600.0 / 3600.0), + }) + } + drift := computeDrift(pairs) + if math.Abs(drift-24.0) > 0.25 { + t.Errorf("drift = %v, want ~24", drift) + } +} + // ── jsonNumber ───────────────────────────────────────────────────────────────── func TestJsonNumber(t *testing.T) { @@ -309,35 +383,39 @@ func TestJsonNumber(t *testing.T) { // ── Integration: GetNodeClockSkew via PacketStore ────────────────────────────── +// formatInt64 is a test helper to format int64 as string for JSON embedding. +func formatInt64(n int64) string { + return fmt.Sprintf("%d", n) +} + func TestGetNodeClockSkew_Integration(t *testing.T) { ps := NewPacketStore(nil, nil) - // Simulate two ADVERT transmissions for the same node, seen by 2 observers each. - // Node "AABB" has clock 120s ahead. pt := 4 // ADVERT + // Use a base time outside all default-epoch ranges. + base := int64(1900000000) // 2030 tx1 := &StoreTx{ Hash: "hash1", PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":1700002320}}`, // obs=1700002200, node ahead by 120s + DecodedJSON: `{"payload":{"timestamp":` + formatInt64(base+10) + `}}`, Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: "2023-11-14T22:50:00Z"}, // 1700002200 - {ObserverID: "obs2", Timestamp: "2023-11-14T22:50:00Z"}, // 1700002200 + {ObserverID: "obs1", Timestamp: time.Unix(base, 0).UTC().Format(time.RFC3339)}, + {ObserverID: "obs2", Timestamp: time.Unix(base, 0).UTC().Format(time.RFC3339)}, }, } tx2 := &StoreTx{ Hash: "hash2", PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":1700005920}}`, // obs=1700005800, node ahead by 120s + DecodedJSON: `{"payload":{"timestamp":` + formatInt64(base+3610) + `}}`, Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: "2023-11-14T23:50:00Z"}, // 1700005800 - {ObserverID: "obs2", Timestamp: "2023-11-14T23:50:00Z"}, // 1700005800 + {ObserverID: "obs1", Timestamp: time.Unix(base+3600, 0).UTC().Format(time.RFC3339)}, + {ObserverID: "obs2", Timestamp: time.Unix(base+3600, 0).UTC().Format(time.RFC3339)}, }, } ps.mu.Lock() ps.byNode["AABB"] = []*StoreTx{tx1, tx2} ps.byPayloadType[4] = []*StoreTx{tx1, tx2} - // Force recompute by setting interval to 0. ps.clockSkew.computeInterval = 0 ps.mu.Unlock() @@ -348,19 +426,9 @@ func TestGetNodeClockSkew_Integration(t *testing.T) { if result.Pubkey != "AABB" { t.Errorf("pubkey = %q, want AABB", result.Pubkey) } - // Both transmissions show 120s skew, so median should be 120. - if result.MedianSkewSec != 120 { - t.Errorf("median skew = %v, want 120", result.MedianSkewSec) - } - if result.SampleCount < 2 { - t.Errorf("sample count = %v, want >= 2", result.SampleCount) - } + // Both transmissions show ~10s skew → ok. if result.Severity != SkewOK { - t.Errorf("severity = %v, want ok (120s < 5min)", result.Severity) - } - // Drift should be ~0 since skew is constant. - if math.Abs(result.DriftPerDaySec) > 1 { - t.Errorf("drift = %v, want ~0 for constant skew", result.DriftPerDaySec) + t.Errorf("severity = %v, want ok", result.Severity) } } @@ -372,585 +440,104 @@ func TestGetNodeClockSkew_NoData(t *testing.T) { } } -// ── Sanity check tests (#XXX — clock skew crazy stats) ──────────────────────── - -func TestGetNodeClockSkew_NoClock_EpochZero(t *testing.T) { - // Node with epoch-0 timestamp produces huge skew → no_clock severity, drift=0. - ps := NewPacketStore(nil, nil) - pt := 4 // ADVERT - - // Epoch-ish advert: advertTS near start of 2020, observed in 2023 → |skew| > 365 days - var txs []*StoreTx - baseObs := int64(1700000000) // ~Nov 2023 - for i := 0; i < 6; i++ { - obsTS := baseObs + int64(i)*7200 - tx := &StoreTx{ - Hash: "epoch-h" + string(rune('0'+i)), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":1577836800}}`, // Jan 1 2020 — valid but way off - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - - ps.mu.Lock() - ps.byNode["EPOCH"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - result := ps.GetNodeClockSkew("EPOCH") - if result == nil { - t.Fatal("expected clock skew result for epoch-0 node") - } - if result.Severity != SkewNoClock { - t.Errorf("severity = %v, want no_clock", result.Severity) - } - if result.DriftPerDaySec != 0 { - t.Errorf("drift = %v, want 0 for no_clock node", result.DriftPerDaySec) - } -} - -func TestGetNodeClockSkew_TooFewSamplesForDrift(t *testing.T) { - // Node with only 2 advert samples → drift should not be computed. - ps := NewPacketStore(nil, nil) - pt := 4 - - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 2; i++ { - obsTS := baseObs + int64(i)*7200 - advTS := obsTS + 120 // 120s ahead - tx := &StoreTx{ - Hash: "few-h" + string(rune('0'+i)), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - - ps.mu.Lock() - ps.byNode["FEWSAMP"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - result := ps.GetNodeClockSkew("FEWSAMP") - if result == nil { - t.Fatal("expected clock skew result") - } - if result.DriftPerDaySec != 0 { - t.Errorf("drift = %v, want 0 for 2-sample node (minimum is %d)", result.DriftPerDaySec, minDriftSamples) - } -} - -func TestGetNodeClockSkew_AbsurdDriftCapped(t *testing.T) { - // Node with wildly varying skew producing |drift| > 86400 s/day → drift capped to 0. - ps := NewPacketStore(nil, nil) - pt := 4 - - // Create 6 samples with extreme skew variation to produce absurd drift. - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 6; i++ { - obsTS := baseObs + int64(i)*3600 - // Alternate between huge positive and negative skew offsets - skewOffset := int64(50000 * (1 - 2*(i%2))) // +50000 or -50000 - advTS := obsTS + skewOffset - tx := &StoreTx{ - Hash: "wild-h" + string(rune('0'+i)), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - - ps.mu.Lock() - ps.byNode["WILD"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - result := ps.GetNodeClockSkew("WILD") - if result == nil { - t.Fatal("expected clock skew result") - } - if math.Abs(result.DriftPerDaySec) > maxReasonableDriftPerDay { - t.Errorf("drift = %v, should be capped (|drift| > %v)", result.DriftPerDaySec, maxReasonableDriftPerDay) - } -} - -func TestGetNodeClockSkew_NormalNodeWithDrift(t *testing.T) { - // Normal node with 6 samples and consistent linear drift → drift computed correctly. - ps := NewPacketStore(nil, nil) - pt := 4 - - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 6; i++ { - obsTS := baseObs + int64(i)*7200 // every 2 hours - // Drift: 1 sec/hour = 24 sec/day - advTS := obsTS + 120 + int64(i) // skew grows by 1s per sample (2h apart) - tx := &StoreTx{ - Hash: "norm-h" + string(rune('0'+i)), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - - ps.mu.Lock() - ps.byNode["NORMAL"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - result := ps.GetNodeClockSkew("NORMAL") - if result == nil { - t.Fatal("expected clock skew result") - } - if result.Severity != SkewOK { - t.Errorf("severity = %v, want ok", result.Severity) - } - // 1s per 7200s = 12 s/day - if result.DriftPerDaySec == 0 { - t.Error("expected non-zero drift for linearly drifting node") - } - if math.Abs(result.DriftPerDaySec) > maxReasonableDriftPerDay { - t.Errorf("drift = %v, should be reasonable", result.DriftPerDaySec) - } -} - -// formatInt64 is a test helper to format int64 as string for JSON embedding. -func formatInt64(n int64) string { - return fmt.Sprintf("%d", n) -} - -// ── #789: Recent-window severity & robust drift ─────────────────────────────── - -// TestSeverityUsesRecentNotMedian: 100 historical bad samples (skew=-60s, -// each ~5min apart) followed by 5 fresh good samples (skew=-1s). All-time -// median is still huge-ish but recent-window severity must reflect the -// current healthy state. -func TestSeverityUsesRecentNotMedian(t *testing.T) { - ps := NewPacketStore(nil, nil) - pt := 4 - - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 105; i++ { - obsTS := baseObs + int64(i)*300 // 5 min apart - var skew int64 = -60 - if i >= 100 { - skew = -1 // good samples at the tail - } - advTS := obsTS + skew - tx := &StoreTx{ - Hash: fmt.Sprintf("recent-h%03d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - ps.mu.Lock() - ps.byNode["RECENT"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - r := ps.GetNodeClockSkew("RECENT") - if r == nil { - t.Fatal("nil result") - } - if r.Severity != SkewOK { - t.Errorf("severity = %v, want ok (recent samples are healthy)", r.Severity) - } - if math.Abs(r.RecentMedianSkewSec) > 5 { - t.Errorf("recentMedianSkewSec = %v, want ~-1", r.RecentMedianSkewSec) - } - // Historical median should still be retained for context. - if math.Abs(r.MedianSkewSec) < 30 { - t.Errorf("medianSkewSec = %v, expected historical median to remain large", r.MedianSkewSec) - } -} - -// TestDriftRejectsCorrectionJump: 30 minutes of clean linear drift, then a -// single 60-second skew jump. The pre-jump slope should win — drift must -// not be catastrophically inflated by the correction event. -func TestDriftRejectsCorrectionJump(t *testing.T) { - pairs := []tsSkewPair{} - // 30 min of stable, ~12 sec/day drift: 1s per 7200s. - for i := 0; i < 12; i++ { - ts := int64(i) * 300 - skew := float64(i) * (1.0 / 24.0) // ~0.04s per 5min step → 12 s/day - pairs = append(pairs, tsSkewPair{ts: ts, skew: skew}) - } - // Wait an hour, then a single 1000-sec correction jump (clearly outlier). - pairs = append(pairs, tsSkewPair{ts: 3600 + 12*300, skew: 1000}) - - drift := computeDrift(pairs) - // Without rejection this would be ~ (1000-0)/(end-0) * 86400 = enormous. - if math.Abs(drift) > 100 { - t.Errorf("drift = %v, expected small (~12 s/day), correction jump should be filtered", drift) - } -} - -// TestTheilSenMatchesOLSWhenClean: on clean linear data Theil-Sen should -// produce essentially the OLS answer. -func TestTheilSenMatchesOLSWhenClean(t *testing.T) { - // 1 sec drift per hour = 24 sec/day, 20 evenly-spaced samples. - pairs := []tsSkewPair{} - for i := 0; i < 20; i++ { - pairs = append(pairs, tsSkewPair{ - ts: int64(i) * 600, - skew: float64(i) * (600.0 / 3600.0), - }) - } - drift := computeDrift(pairs) - if math.Abs(drift-24.0) > 0.25 { // ~1% - t.Errorf("drift = %v, want ~24", drift) - } -} - -// TestReporterScenario_789: reproduce the exact scenario from issue #789. -// Reporter saw mean=-52565156, median=-59063561, last=-0.8, sample count -// 1662, drift +1793549.9 s/day, severity=absurd. After the fix, severity -// must be ok (recent samples are healthy) and drift must be sane. -func TestReporterScenario_789(t *testing.T) { +func TestGetNodeClockSkew_DefaultEpochNode(t *testing.T) { + // Node with advert_ts at the current firmware default epoch → default severity. ps := NewPacketStore(nil, nil) pt := 4 - baseObs := int64(1700000000) - var txs []*StoreTx - // 1657 samples with the bad ~-683-day skew (the historical poison), - // then 5 freshly corrected samples at -0.8s — totals 1662. - for i := 0; i < 1662; i++ { - obsTS := baseObs + int64(i)*60 // 1 min apart - var skew int64 - if i < 1657 { - skew = -59063561 // ~ -683 days - } else { - skew = -1 // corrected (rounded; reporter saw -0.8) - } - advTS := obsTS + skew - tx := &StoreTx{ - Hash: fmt.Sprintf("rep-%04d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - ps.mu.Lock() - ps.byNode["REPNODE"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - r := ps.GetNodeClockSkew("REPNODE") - if r == nil { - t.Fatal("nil result") - } - // Severity must reflect current health, not the all-time median. - if r.Severity != SkewOK && r.Severity != SkewWarning { - t.Errorf("severity = %v, want ok/warning (recent samples are healthy)", r.Severity) - } - if math.Abs(r.RecentMedianSkewSec) > 5 { - t.Errorf("recentMedianSkewSec = %v, want near 0", r.RecentMedianSkewSec) - } - // Drift must not be absurd. The historical jump is one event between - // the 1657th and 1658th sample; outlier rejection must contain it. - if math.Abs(r.DriftPerDaySec) > maxReasonableDriftPerDay { - t.Errorf("drift = %v, must be <= cap %v", r.DriftPerDaySec, maxReasonableDriftPerDay) - } - // And it should be close to zero (stable historical + stable corrected). - if math.Abs(r.DriftPerDaySec) > 1000 { - t.Errorf("drift = %v, expected near zero after outlier rejection", r.DriftPerDaySec) - } - // Historical median is preserved as context. - if math.Abs(r.MedianSkewSec) < 1e6 { - t.Errorf("medianSkewSec = %v, expected historical poison preserved as context", r.MedianSkewSec) + now := time.Now().Unix() + advTS := int64(1715770351 + 86400) // default + 1 day uptime + tx := &StoreTx{ + Hash: "default-h1", + PayloadType: &pt, + DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, + Observations: []*StoreObs{ + {ObserverID: "obs1", Timestamp: time.Unix(now, 0).UTC().Format(time.RFC3339)}, + }, } -} -// TestBimodalClock_845: 60% good samples → bimodal_clock severity. -func TestBimodalClock_845(t *testing.T) { - ps := NewPacketStore(nil, nil) - pt := 4 - - baseObs := int64(1700000000) - var txs []*StoreTx - // 6 good samples (-5s each), 4 bad samples (-50000000s each) = 60% good - // Interleave so the recent window (last 5) captures both good and bad. - skews := []int64{-5, -5, -50000000, -5, -50000000, -5, -50000000, -5, -50000000, -5} - for i := 0; i < 10; i++ { - obsTS := baseObs + int64(i)*60 - advTS := obsTS + skews[i] - tx := &StoreTx{ - Hash: fmt.Sprintf("bimodal-%04d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } ps.mu.Lock() - ps.byNode["BIMODAL"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } + ps.byNode["DEFNODE"] = []*StoreTx{tx} + ps.byPayloadType[4] = []*StoreTx{tx} ps.clockSkew.computeInterval = 0 ps.mu.Unlock() - r := ps.GetNodeClockSkew("BIMODAL") + r := ps.GetNodeClockSkew("DEFNODE") if r == nil { - t.Fatal("nil result") + t.Fatal("expected result") } - if r.Severity != SkewBimodalClock { - t.Errorf("severity = %v, want bimodal_clock", r.Severity) + if r.Severity != SkewDefault { + t.Errorf("severity = %v, want default", r.Severity) } - if math.Abs(r.RecentMedianSkewSec-(-5)) > 1 { - t.Errorf("recentMedianSkewSec = %v, want ≈ -5 (median of good samples)", r.RecentMedianSkewSec) - } - if r.GoodFraction < 0.5 || r.GoodFraction > 0.7 { - t.Errorf("goodFraction = %v, want ~0.6", r.GoodFraction) - } - if r.RecentBadSampleCount < 1 { - t.Errorf("recentBadSampleCount = %v, want > 0", r.RecentBadSampleCount) + if r.DefaultEpoch == nil || *r.DefaultEpoch != 1715770351 { + t.Errorf("defaultEpoch = %v, want 1715770351", r.DefaultEpoch) } } -// TestAllBad_NoClock_845: all samples bad → no_clock. -func TestAllBad_NoClock_845(t *testing.T) { +func TestGetNodeClockSkew_WrongNode(t *testing.T) { + // Node with advert_ts far from any default and large skew → wrong. ps := NewPacketStore(nil, nil) pt := 4 - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 10; i++ { - obsTS := baseObs + int64(i)*60 - advTS := obsTS - 50000000 - tx := &StoreTx{ - Hash: fmt.Sprintf("allbad-%04d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - ps.mu.Lock() - ps.byNode["ALLBAD"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - r := ps.GetNodeClockSkew("ALLBAD") - if r == nil { - t.Fatal("nil result") - } - if r.Severity != SkewNoClock { - t.Errorf("severity = %v, want no_clock", r.Severity) - } -} - -// TestMostlyGood_OK_845: 90% good 10% bad → ok (outlier filtered). -func TestMostlyGood_OK_845(t *testing.T) { - ps := NewPacketStore(nil, nil) - pt := 4 - - baseObs := int64(1700000000) - var txs []*StoreTx - // 9 good at -5s, 1 bad at -50000000s - for i := 0; i < 10; i++ { - obsTS := baseObs + int64(i)*60 - var skew int64 - if i < 9 { - skew = -5 - } else { - skew = -50000000 - } - advTS := obsTS + skew - tx := &StoreTx{ - Hash: fmt.Sprintf("mostly-%04d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } - ps.mu.Lock() - ps.byNode["MOSTLY"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - r := ps.GetNodeClockSkew("MOSTLY") - if r == nil { - t.Fatal("nil result") - } - // 90% good → normal classification path, median of good samples = -5s → ok - if r.Severity != SkewOK { - t.Errorf("severity = %v, want ok", r.Severity) - } - if math.Abs(r.RecentMedianSkewSec-(-5)) > 1 { - t.Errorf("recentMedianSkewSec = %v, want ≈ -5", r.RecentMedianSkewSec) - } -} - -// TestSingleSample_845: one good sample → ok. -func TestSingleSample_845(t *testing.T) { - ps := NewPacketStore(nil, nil) - pt := 4 - obsTS := int64(1700000000) - advTS := obsTS - 30 // 30s skew + now := int64(1900000000) // 2030, outside default ranges + advTS := now + 86400 // 1 day ahead tx := &StoreTx{ - Hash: "single-0001", + Hash: "wrong-h1", PayloadType: &pt, DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, + {ObserverID: "obs1", Timestamp: time.Unix(now, 0).UTC().Format(time.RFC3339)}, }, } - ps.mu.Lock() - ps.byNode["SINGLE"] = []*StoreTx{tx} - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - ps.clockSkew.computeInterval = 0 - ps.mu.Unlock() - - r := ps.GetNodeClockSkew("SINGLE") - if r == nil { - t.Fatal("nil result") - } - if r.Severity != SkewOK { - t.Errorf("severity = %v, want ok", r.Severity) - } - if r.RecentSampleCount != 1 { - t.Errorf("recentSampleCount = %d, want 1", r.RecentSampleCount) - } - if r.GoodFraction != 1.0 { - t.Errorf("goodFraction = %v, want 1.0", r.GoodFraction) - } -} -// TestFiftyFifty_Bimodal_845: 50% good / 50% bad → bimodal_clock. -func TestFiftyFifty_Bimodal_845(t *testing.T) { - ps := NewPacketStore(nil, nil) - pt := 4 - baseObs := int64(1700000000) - var txs []*StoreTx - for i := 0; i < 10; i++ { - obsTS := baseObs + int64(i)*60 - var skew int64 - if i%2 == 0 { - skew = -10 - } else { - skew = -50000000 - } - tx := &StoreTx{ - Hash: fmt.Sprintf("fifty-%04d", i), - PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(obsTS+skew) + `}}`, - Observations: []*StoreObs{ - {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, - }, - } - txs = append(txs, tx) - } ps.mu.Lock() - ps.byNode["FIFTY"] = txs - for _, tx := range txs { - ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) - } + ps.byNode["WRONGNODE"] = []*StoreTx{tx} + ps.byPayloadType[4] = []*StoreTx{tx} ps.clockSkew.computeInterval = 0 ps.mu.Unlock() - r := ps.GetNodeClockSkew("FIFTY") + r := ps.GetNodeClockSkew("WRONGNODE") if r == nil { - t.Fatal("nil result") + t.Fatal("expected result") } - if r.Severity != SkewBimodalClock { - t.Errorf("severity = %v, want bimodal_clock", r.Severity) - } - if r.GoodFraction < 0.4 || r.GoodFraction > 0.6 { - t.Errorf("goodFraction = %v, want ~0.5", r.GoodFraction) + if r.Severity != SkewWrong { + t.Errorf("severity = %v, want wrong", r.Severity) } } -// TestAllGood_OK_845: all samples good → ok, no bimodal. -func TestAllGood_OK_845(t *testing.T) { +func TestGetNodeClockSkew_TooFewSamplesForDrift(t *testing.T) { ps := NewPacketStore(nil, nil) pt := 4 - baseObs := int64(1700000000) + + now := int64(1900000000) // 2030, outside default ranges var txs []*StoreTx - for i := 0; i < 10; i++ { - obsTS := baseObs + int64(i)*60 + for i := 0; i < 2; i++ { + obsTS := now + int64(i)*7200 + advTS := obsTS + 10 tx := &StoreTx{ - Hash: fmt.Sprintf("allgood-%04d", i), + Hash: "few-h" + string(rune('0'+i)), PayloadType: &pt, - DecodedJSON: `{"payload":{"timestamp":` + formatInt64(obsTS-3) + `}}`, + DecodedJSON: `{"payload":{"timestamp":` + formatInt64(advTS) + `}}`, Observations: []*StoreObs{ {ObserverID: "obs1", Timestamp: time.Unix(obsTS, 0).UTC().Format(time.RFC3339)}, }, } txs = append(txs, tx) } + ps.mu.Lock() - ps.byNode["ALLGOOD"] = txs + ps.byNode["FEWSAMP"] = txs for _, tx := range txs { ps.byPayloadType[4] = append(ps.byPayloadType[4], tx) } ps.clockSkew.computeInterval = 0 ps.mu.Unlock() - r := ps.GetNodeClockSkew("ALLGOOD") - if r == nil { - t.Fatal("nil result") - } - if r.Severity != SkewOK { - t.Errorf("severity = %v, want ok", r.Severity) - } - if r.GoodFraction != 1.0 { - t.Errorf("goodFraction = %v, want 1.0", r.GoodFraction) + result := ps.GetNodeClockSkew("FEWSAMP") + if result == nil { + t.Fatal("expected clock skew result") } - if r.RecentBadSampleCount != 0 { - t.Errorf("recentBadSampleCount = %v, want 0", r.RecentBadSampleCount) + if result.DriftPerDaySec != 0 { + t.Errorf("drift = %v, want 0 for 2-sample node", result.DriftPerDaySec) } } From 0acbac6fde2ea90eceb9a61727bb1604284228a6 Mon Sep 17 00:00:00 2001 From: you Date: Fri, 24 Apr 2026 23:27:16 +0000 Subject: [PATCH 04/12] ui: rename clock skew tiers to default/ok/degrading/degraded/wrong --- public/analytics.js | 10 +++++----- public/nodes.js | 12 +++--------- public/roles.js | 30 +++++++++++------------------- public/style.css | 13 ++++++------- 4 files changed, 25 insertions(+), 40 deletions(-) diff --git a/public/analytics.js b/public/analytics.js index 36fe90c6..f358deb0 100644 --- a/public/analytics.js +++ b/public/analytics.js @@ -3495,12 +3495,12 @@ function destroy() { _analyticsData = {}; _channelData = null; if (_ngState && _ }); // Summary - var counts = { ok: 0, warning: 0, critical: 0, absurd: 0 }; + var counts = { ok: 0, degrading: 0, degraded: 0, wrong: 0, default: 0 }; data.forEach(function(n) { if (counts[n.severity] !== undefined) counts[n.severity]++; }); // Filter buttons (also serve as summary — no separate stats pills needed) - var filterColors = { ok: 'var(--status-green)', warning: 'var(--status-yellow)', critical: 'var(--status-orange)', absurd: 'var(--status-purple)', no_clock: 'var(--text-muted)' }; - var filters = ['all', 'ok', 'warning', 'critical', 'absurd', 'no_clock']; + var filterColors = { ok: 'var(--status-green)', degrading: 'var(--status-yellow)', degraded: 'var(--status-orange)', wrong: 'var(--status-red)', default: 'var(--text-muted)' }; + var filters = ['all', 'ok', 'degrading', 'degraded', 'wrong', 'default']; var filterHtml = '
' + filters.map(function(f) { var dot = f !== 'all' ? '' : ''; return '