Skip to content

perf(nodes): batch relay stats to fix O(N×M) /api/nodes regression#1164

Open
efiten wants to merge 3 commits intoKpa-clawbot:masterfrom
efiten:fix/nodes-relay-stats-batch
Open

perf(nodes): batch relay stats to fix O(N×M) /api/nodes regression#1164
efiten wants to merge 3 commits intoKpa-clawbot:masterfrom
efiten:fix/nodes-relay-stats-batch

Conversation

@efiten
Copy link
Copy Markdown
Contributor

@efiten efiten commented May 7, 2026

Problem

handleNodes enriches each repeater/room node by calling GetRepeaterRelayInfo and GetRepeaterUsefulnessScore per node inside a loop. GetRepeaterUsefulnessScore acquires s.mu.RLock() and then iterates all byPayloadType entries to compute the non-advert denominator — once per node.

On a deployment with ~1500 repeater/room nodes and ~145K transmissions in memory, this is ~220M iterations per /api/nodes request, plus ~3000 separate lock acquisitions. Response times of 18–44 seconds have been observed in production, especially during startup backfill when write-lock contention compounds the issue.

Fix

Add GetRepeaterNodeStatsBatch(pubkeys []string, windowHours float64) map[string]RepeaterNodeStats to repeater_usefulness.go:

  • Takes one s.mu.RLock() for the entire node list
  • Computes the non-advert denominator once (shared across all nodes)
  • Snapshots byPathHop slice headers for all requested pubkeys under that single lock
  • Processes timestamps and counts outside the lock

Update handleNodes to collect repeater/room pubkeys first, call the batch method once, and apply results.

Complexity: O(M + N) instead of O(N × M) per request (M = total transmissions, N = repeater nodes).

GetRepeaterRelayInfo and GetRepeaterUsefulnessScore are unchanged — they are still correct for single-node calls (e.g. handleNodeDetail).

Test plan

  • go build ./cmd/server passes
  • /api/nodes response is correct (relay_active, relay_count_1h/24h, usefulness_score fields present for repeaters)
  • No change in output for /api/nodes/{pubkey} (uses existing single-node methods)
  • CI passes

🤖 Generated with Claude Code

handleNodes looped over repeater/room nodes calling GetRepeaterRelayInfo +
GetRepeaterUsefulnessScore per node. GetRepeaterUsefulnessScore iterates ALL
byPayloadType entries (denominator) under a separate RLock on every call,
giving O(N × totalTx) complexity — ~220M iterations per request on a node
with 1500 repeaters and 145K transmissions.

Add GetRepeaterNodeStatsBatch: takes ONE RLock, computes the non-advert
denominator once, snapshots byPathHop slice headers for all requested
pubkeys, then processes everything outside the lock. Update handleNodes to
collect repeater/room pubkeys first, call the batch method once, and apply
results — reducing /api/nodes relay enrichment from O(N × M) to O(M + N).

GetRepeaterRelayInfo and GetRepeaterUsefulnessScore are unchanged for the
single-node detail endpoint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten added a commit to efiten/meshcore-analyzer that referenced this pull request May 7, 2026
handleNodes looped over repeater/room nodes calling GetRepeaterRelayInfo +
GetRepeaterUsefulnessScore per node. GetRepeaterUsefulnessScore iterates ALL
byPayloadType entries under a separate RLock on every call — O(N × totalTx)
per request (~220M iterations with 1500 nodes and 145K transmissions).

Add GetRepeaterNodeStatsBatch: ONE RLock, denominator computed once,
byPathHop headers snapshotted for all nodes, processing outside the lock.
Reduces /api/nodes relay enrichment from O(N × M) to O(M + N).

Upstream PR: Kpa-clawbot#1164

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent review — PR #1164 (perf: batch relay stats)

Thanks for the perf work — the diagnosis is right, the denominator-once + single-RLock shape is the correct fix, and the EnrichNodes-style call site change in routes.go is clean. I'd ship this with the items below addressed.

Must-fix

1. Data race: slice headers snapshotted under RLock, elements read after RUnlock.

In GetRepeaterNodeStatsBatch:

s.mu.RLock()
// ...
snaps[pk] = nodeSnap{txList: s.byPathHop[key]}
// ...
s.mu.RUnlock()

// ... outside lock ...
for _, tx := range snap.txList {  // racy
    if tx != nil { uniq[tx.ID] = struct{}{} }
}

Only the slice header is copied under the lock; the backing array is shared with writers. Writers DO mutate the backing array in place:

  • removeTxFromSlice (store.go:2886): idx[key] = append(list[:i], list[i+1:]...) — this shifts elements left in the same backing array. A reader iterating the snapshot will see torn / shifted / duplicated *StoreTx entries.
  • append(s.byPathHop[key], tx) (store.go:619, 1630, 1996, 2117) — at-or-over-capacity grows in place into spare cap, which is invisible to the snapshot length but doesn't trigger a copy of existing slots, so still safe by itself; but combined with the remove path above it's not.

Compare with the existing GetRepeaterRelayInfo (repeater_liveness.go), which does collect(txList) / collect(prefixList) before s.mu.RUnlock() — i.e. it reads every tx.ID / tx.PayloadType / tx.FirstSeen it needs while still holding the read lock and only the resulting scratch slice escapes. Please mirror that: extract the per-node (ts, pt, isInTxList) tuples into a per-pubkey []entry (or a single flat slice keyed by pubkey index) inside the RLock, then release and process. Keeps the perf win — the denominator is still computed once and the lock is still held only once across the whole node set — but stops crossing the lock boundary with mutable state.

This will also be flagged by go test -race on a contention-heavy fixture, which is part of why a real test (item 2) matters.

2. No behavior tests for the new public API.

Per AGENTS.md TDD rule, this is a behavior change — GetRepeaterNodeStatsBatch is a new public method on PacketStore. The "Test plan" in the PR body is go build + "CI passes", which are not behavior tests. Required, in cmd/server/repeater_usefulness_test.go (or a new repeater_batch_test.go):

  • Parity test: build a PacketStore with a representative mix (one node with full-key hops only, one with prefix-only hops, one with both, one with no hops, one repeater that sees only adverts). Call GetRepeaterNodeStatsBatch(pubkeys, w). For each pubkey, assert result[pk].Info equals GetRepeaterRelayInfo(pk, w) field-by-field and result[pk].Score equals GetRepeaterUsefulnessScore(pk).
  • Empty input: GetRepeaterNodeStatsBatch(nil, 24) and (...[]string{}, 24) return non-nil empty map.
  • Nodes with no byPathHop entries appear in the result map with zero Info and zero Score (this is what routes.go's if stats, ok := relayStats[pk]; ok relies on, and it currently always succeeds because the loop unconditionally writes result[pk] — that's the right behavior, please pin it with a test so a future refactor can't regress it).
  • -race test that hammers GetRepeaterNodeStatsBatch from one goroutine while another calls a mutating store method (e.g. an upsert that triggers addTxToPathHopIndex / removeTxFromPathHopIndex). With the current snapshot-then-release shape this should fail; with the fix from item 1 it should pass.

The red→green sequence on the parity test should also be visible in the commit history per AGENTS.md "TDD IS MANDATORY".

3. uniq map is dead work.

uniq := make(map[int]struct{}, len(snap.txList)+len(snap.prefixList))
for _, tx := range snap.txList   { if tx != nil { uniq[tx.ID] = struct{}{} } }
for _, tx := range snap.prefixList{ if tx != nil { uniq[tx.ID] = struct{}{} } }
entries := make([]entry, 0, len(uniq))
seen := make(map[int]bool, len(uniq))
collect := func(list []*StoreTx) { /* ... uses `seen`, not `uniq` ... */ }
collect(snap.txList); collect(snap.prefixList)

uniq is built only to size entries / seen, then seen re-does the exact same dedup. Either drop uniq and oversize entries/seen with len(snap.txList)+len(snap.prefixList) (the existing GetRepeaterRelayInfo accepts that ~2× over-allocation in exchange for one fewer pass), or drop seen and key off uniq. Two passes of dedup per node, in a hot path you're explicitly trying to deflate, is the wrong direction.

4. Type and helper duplication with GetRepeaterRelayInfo.

type entry struct { ts string; pt int }, the collect closure, the cutoff math (cutoff1h, cutoff24h, windowCutoff), and the latest-tracking + RelayCount1h/24h accumulation are copy-pasted verbatim from repeater_liveness.go. After fix #1 you'll have to keep two copies of the snapshot-read shape in lockstep. Please extract a small helper, e.g.

func computeRelayInfoFromEntries(entries []entry, now time.Time, windowHours float64) RepeaterRelayInfo

and call it from both GetRepeaterRelayInfo and the batch method. Same applies to the prefix-lookup logic (if len(key) >= 2 { prefix := key[:2]; if prefix != key { ... } }) — that's now in two places; one helper.

5. Doc comment claim doesn't match implementation.

The comment says "Snapshots byPathHop slice headers for all requested pubkeys under that single lock — Processes timestamps and counts outside the lock." That description IS the bug from item 1. After the fix, please rewrite to: "Copies the per-node (timestamp, payload-type) tuples under the single read lock; lock is released before parsing/scoring." (And the doc-comment paragraph in the PR body should match.)

Out-of-scope (not in this diff, file separately if you agree)

  • GetRepeaterUsefulnessScore recomputes totalNonAdvert on every call — even single-node detail pages eat O(byPayloadType) per call. The TODO already mentions memoization off store invalidation; worth its own issue.
  • byPathHop keying by 1-byte raw prefix is acknowledged-lossy in repeater_liveness.go's comment and the same caveat now applies to the batch method's prefix lookup. Disambiguation via the path-listing endpoint's resolved-path SQL post-filter is referenced but not done here. Track separately.

Verdict

Right diagnosis, right shape, but the lock-release ordering as written is unsafe and there are no behavior tests gating it. Once items 1–3 land (and ideally 4–5), happy to re-review.

efiten and others added 2 commits May 9, 2026 17:59
GetRepeaterNodeStatsBatch over ~1900 repeaters takes 20-30s on large
datasets. Add GetRepeaterNodeStatsBatchCached with a 30s TTL so repeated
/api/nodes requests (map view, live view) serve from cache after the
first cold computation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GetRepeaterNodeStatsBatchCached: TTL was 30s but cold compute takes
~25s on live data (~1900 repeaters, 142K packets) → ~82% CPU duty
cycle. 300s TTL drops it to ~8% (one cold pass per 5 minutes).

asyncPersistResolvedPathsAndEdges: obs resolved_path updates now
written in pages of 200 rows per transaction (was one big transaction)
with a 5ms yield between pages, limiting SQLite write-lock hold time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants