You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revision history: v1 → v2 (folds in expert critique comment 1 and implementer review comment 2). Strikethroughs removed for readability; see comments for the original v1 critique threads.
Problem
Heap profiling on a representative server (388K observations, 37K transmissions) shows the in-memory packet store consumes ~630 MB. The dominant cost (per pprof inuse_space):
Source
Heap
%
PacketStore.Load()
445 MB
71%
↳ unmarshalResolvedPath (JSON decode of resolved_path column)
205 MB
33%
↳ sqlite.columnText (raw column reads)
100 MB
16%
↳ reflect.New (json reflection)
41 MB
7%
IngestNewObservations (steady state)
123 MB
20%
IngestNewFromDB
48 MB
8%
Subpath/path-hop indexes (combined)
~10 MB
1.5%
Extrapolating to a database with 1.66M observations (real user report in #791): roughly 1 GB just to deserialize and hold ResolvedPath slices on every StoreTx/StoreObs — directly responsible for OOM kills during startup on memory-constrained deployments.
Why ResolvedPath is so expensive
StoreTx.ResolvedPath and StoreObs.ResolvedPath are []*string — one full hex pubkey per hop, parallel-aligned with PathJSON. Loading 1M+ observations means:
1M+ JSON arrays decoded via encoding/json reflection
1M × ~5 hops × ~64 bytes per hex string = ~320 MB of small string allocations
Each []*string slice header is ~24 bytes, plus pointer slots
Held in heap for the lifetime of the process
Complete Audit of ResolvedPath / resolved_path Consumers
#
Location
Function
What it does
Refactor strategy
1
store.go:553–580
addToByNode
Indexes relay nodes from resolved hops into byNode
Decode-window: extract pubkeys at ingest before discarding
Remove ResolvedPath []*string from StoreTx and StoreObs. Replace with three narrowly-scoped data structures and one code-flow rule:
A) Membership index (forward)
// resolvedPubkeyIndex maps an 8-byte xxhash64 of a resolved pubkey// to the list of transmission IDs whose resolved path contains it.resolvedPubkeyIndexmap[uint64][]int// pubkeyHash → []txID
Built once during Load() from the same SQL data (decode → hash → insert → discard the strings).
Replaces nodeInResolvedPath(tx, pk) with txID in resolvedPubkeyIndex[hash(pk)] followed by collision-safety check.
Coverage: indexes pubkeys from every observation's resolved_path on the tx, not just the "best" observation. This matches current nodeInResolvedPath semantics (it scans all obs).
B) Reverse map for eviction (NEW — addresses critique #6)
// resolvedPubkeyReverse maps txID to the set of pubkey hashes it was// indexed under. Required for clean removal on eviction / path change.resolvedPubkeyReversemap[int][]uint64// txID → []pubkeyHash
Populated at the same time as the forward index.
removeTxFromResolvedPubkeyIndex(txID) reads this, deletes from forward index, deletes the reverse entry. O(hops) per tx.
C) Decode-window discipline (NEW — addresses gotchas #1–#5, #7, #10)
Rule:resolved_path JSON bytes flow through ingest exactly once, and during that one decode window:
The JSON is parsed into a temporary []*string (stack-allocated when small).
All consumers are fed in order:addToByNode, touchRelayLastSeen, addTxToPathHopIndex (resolved-pubkey branch), resolvedPubkeyIndex insert, resolvedPubkeyReverse insert, broadcast map (raw JSON bytes), persist batch (raw JSON bytes).
The []*string is dropped at the end of the function — never lands on StoreTx/StoreObs.
This single rule makes the refactor coherent: every consumer that needs resolved pubkeys gets them at the only time they're guaranteed cheap to produce.
D) On-demand SQL fetch for cold-path API responses
For txToMap/obsToMap calls that serve packets that came in via Load() (not the live ingest path), do a SQL lookup:
SELECT id, resolved_path FROM observations WHERE id IN (?, ?, ?, ...)
Single batch query per API request, returning ≤500 rows.
Probability: ~1 in 4 billion per pair at 1M unique pubkeys. In practice, expect zero collisions.
Detection + safety: when resolvedPubkeyIndex[h] returns candidates, the /api/nodes/{pubkey}/paths handler does a single batched SQL query against the resolved_path column to confirm the exact pubkey appears in each candidate's resolved path. Same query, same code path as the on-demand SQL fetch (D).
No collision lists, no secondary hash, no open addressing. The SQL safety net handles it.
Code: confirmResolvedPathContains(txIDs []int, pk string) []int — one helper, one query, called only on the "Paths through node" hot path.
G) Schema migration
No schema change required. The resolved_path SQLite column stays — it's still the source of truth for ingest-time resolution, the on-demand decode path, and the collision-safety check.
Replacing JSON with protobuf/CBOR for the resolved_path column. The point is to stop decoding the column en masse at startup, after which the on-disk format becomes irrelevant for OOM. JSON is fine for the hundreds of rows decoded per API request.
Changing the neighbor affinity graph or backfill resolution logic.
Removing the byPathHop raw-byte half (graceful degradation when resolved_path is NULL is preserved).
Required tests (revised)
Unit (cmd/server/store_test.go, routes_test.go, neighbor_persist_test.go)
TestResolvedPubkeyIndex_BuildFromLoad — fixture with known pubkeys → after Load(), forward + reverse maps are consistent.
TestResolvedPubkeyIndex_HashCollision — crafted-vector test with two different pubkeys producing the same xxhash64. Confirm /api/nodes/{pubkey}/paths returns only the true match (collision-safety SQL filters the false candidate).
TestResolvedPubkeyIndex_IngestUpdate — after IngestNewObservations, both maps reflect new entries; no ResolvedPath field exists on the struct.
TestResolvedPubkeyIndex_RemoveOnEvict — eviction removes all entries via reverse map; forward map has no orphan txIDs.
TestResolvedPubkeyIndex_PerObsCoverage — fixture where best obs and a non-best obs have different resolved paths; both sets of pubkeys appear in the index.
TestStoreTx_NoResolvedPathField — compile-time guard on struct shape.
NEW: TestAddToByNode_WithoutResolvedPathField — relay nodes still indexed into byNode post-refactor.
NEW: TestWebSocketBroadcast_IncludesResolvedPath — broadcast maps still carry resolved_path for the frontend (decode-window capture works).
NEW: TestBackfill_UpdatesIndexAndByPathHop — backfill writes to SQL AND populates the new forward/reverse index AND updates byPathHop resolved-key entries; verify with a fixture starting from all-NULL.
NEW: TestBackfill_RemoveOldOnReBackfill — if an obs is backfilled twice (e.g., graph improved), reverse map enables removing old hash entries before adding new ones.
Endpoint (routes_test.go)
TestPathsThroughNode_PrecisionAfterRefactor — identical results before/after on a prefix-collision fixture.
TestPathsThroughNode_NilResolvedPathFallback — packets with NULL resolved_path still returned via raw-byte index match.
TestPathsThroughNode_CollisionSafety — when the hash index returns candidates due to a (crafted) collision, the SQL safety check filters the false candidate.
TestPacketsAPI_OnDemandResolvedPath — /api/packets includes resolved_path for cold packets via SQL fetch.
TestPacketsAPI_OnDemandResolvedPath_LRUHit — second request for the same packets uses the LRU cache (verify with cache hit counter).
TestPacketsAPI_OnDemandResolvedPath_Empty — NULL resolved_path in DB returns null (or omitted) in API.
TestLivePolling_ResolvedPathFromBroadcast — live polling responses include resolved_path from the in-flight broadcast cache, no SQL hit.
TestLivePolling_LRUUnderConcurrentIngest — under 100 concurrent live polls + ingest writes, p95 latency < 50ms.
TestNodeDetailPage_RelayCount — Node Detail "relayed N packets" count matches pre-refactor.
Memory / performance
BenchmarkLoad_BeforeAfter — 100K observations fixture; runtime.MemStats.HeapAlloc after Load(). Target: ≥80% reduction (was ≥50% in v1; revised based on revised budget).
BenchmarkResolvedPubkeyIndex_Memory — actual heap of the new structures at realistic 50K and 500K unique-pubkey distributions; verify within revised budget.
BenchmarkPathsThroughNode_Latency — query node with ~5K candidates. Target: equal or faster.
BenchmarkPacketsAPI_FirstPage — /api/packets?limit=100 end-to-end. Target: <20 ms regression.
BenchmarkLivePolling_UnderIngest — sustained live polling at 1 Hz under continuous ingest at production rates. Target: p99 < 100 ms.
On Node Detail → Paths tab: open a node known to have prefix collisions. Verify only that node's actual paths appear.
On Live page: verify hop names render correctly without flicker on new packets.
On Packets browser: filter by node, verify resolved hop column shows correct names.
After running for 1h: verify backfill log shows "[store] async resolved_path backfill complete" with no errors and the index reflects backfilled entries.
Risks (revised)
Hash collisions: ~1 in 4 billion at 1M keys. Handled by the SQL safety check on the only hot lookup path. Risk: low.
On-demand SQL latency under live polling: mitigated by mandatory LRU cache. Risk: medium until benchmarked.
Concurrent index update: the forward + reverse maps share s.mu with everything else in the store. No new locking surface, but the backfill path now does more work under the write lock — verify backfill batch sizes don't cause noticeable API hiccups. Risk: low–medium.
Backfill re-indexing: addressed by the reverse map. Risk: low.
Subtle correctness bug in production: addressed by feature flag (see acceptance criteria).
/api/nodes/{pubkey}/paths returns byte-identical results before and after on the regression fixture, including the prefix-collision fixture.
BenchmarkLoad shows ≥80% heap reduction.
Feature flaguseResolvedPathIndex bool (default true in v3.6.0; default-off path falls back to the current per-StoreTx behavior for one release as a rollback safety net).
No new force-pushes, no schema migration required, no changes to firmware assumptions.
No new ResolvedPath field on StoreTx / StoreObs (compile-time guard test).
Estimated effort
8–12h for a senior Go developer familiar with the codebase (revised from v1's optimistic 4–6h):
4h: Core refactor (field removal, Load() index build, all decode-window consumers wired)
2h: On-demand SQL fetch + LRU cache for API endpoints
1h: Backfill refactor + reverse-map maintenance
1h: Memory accounting helper updates
2h: Unit + endpoint + collision tests
1h: Integration testing, benchmarks, edge cases (NULL resolved_path during backfill window, feature flag toggle)
Problem
Heap profiling on a representative server (388K observations, 37K transmissions) shows the in-memory packet store consumes ~630 MB. The dominant cost (per
pprof inuse_space):PacketStore.Load()unmarshalResolvedPath(JSON decode ofresolved_pathcolumn)sqlite.columnText(raw column reads)reflect.New(json reflection)IngestNewObservations(steady state)IngestNewFromDBExtrapolating to a database with 1.66M observations (real user report in #791): roughly 1 GB just to deserialize and hold
ResolvedPathslices on everyStoreTx/StoreObs— directly responsible for OOM kills during startup on memory-constrained deployments.Why
ResolvedPathis so expensiveStoreTx.ResolvedPathandStoreObs.ResolvedPathare[]*string— one full hex pubkey per hop, parallel-aligned withPathJSON. Loading 1M+ observations means:encoding/jsonreflection[]*stringslice header is ~24 bytes, plus pointer slotsComplete Audit of
ResolvedPath/resolved_pathConsumersstore.go:553–580addToByNodebyNodestore.go:603–645touchRelayLastSeenlast_seenin DB (#660 / #755 area)store.go:513pickBestObservationpropagationobs→txstore.go:832–833, 2177, 2213, 2237txToMap/obsToMapstore.go:1572, 1622, 1830–1831, 1921IngestNewObservations/IngestNewFromDBresolved_pathJSON straight into broadcast map and persist batch — never store on structstore.go:2305–2336nodeInResolvedPathstore.go:2441–2456addTxToPathHopIndexbyPathHopbyPathHopat ingest before discarding strings; also feed the newresolvedPubkeyIndexstore.go:2469–2490removeTxFromPathHopIndextxID → []uint64 indexedHashesso eviction can find what to removeroutes.go:2211, 2235mapSliceToStoreTxs/mapSliceToObservationsneighbor_persist.go:384, 451, 493backfillResolvedPathsAsyncobs.ResolvedPathback to struct + persists to SQLbyPathHopresolved-key entries; never touches structconfig.go:95BackfillResolvedPathHoursconfigmain.go:154–161ensureResolvedPathColumnFrontend (no API contract change required —
resolved_pathstays in broadcast maps and API responses):public/packet-helpers.js:40–49—getResolvedPath()parserpublic/hop-resolver.js:277—resolveFromServer()public/packets.js:252, 387, 1183, 1768, 1961, 2328— caching + detail panespublic/live.js:547, 2026, 2112–2124— broadcast consumptionProposed refactor (revised)
Remove
ResolvedPath []*stringfromStoreTxandStoreObs. Replace with three narrowly-scoped data structures and one code-flow rule:A) Membership index (forward)
Load()from the same SQL data (decode → hash → insert → discard the strings).nodeInResolvedPath(tx, pk)withtxID in resolvedPubkeyIndex[hash(pk)]followed by collision-safety check.nodeInResolvedPathsemantics (it scans all obs).B) Reverse map for eviction (NEW — addresses critique #6)
removeTxFromResolvedPubkeyIndex(txID)reads this, deletes from forward index, deletes the reverse entry. O(hops) per tx.C) Decode-window discipline (NEW — addresses gotchas #1–#5, #7, #10)
Rule:
resolved_pathJSON bytes flow through ingest exactly once, and during that one decode window:[]*string(stack-allocated when small).addToByNode,touchRelayLastSeen,addTxToPathHopIndex(resolved-pubkey branch),resolvedPubkeyIndexinsert,resolvedPubkeyReverseinsert, broadcast map (raw JSON bytes), persist batch (raw JSON bytes).[]*stringis dropped at the end of the function — never lands onStoreTx/StoreObs.This single rule makes the refactor coherent: every consumer that needs resolved pubkeys gets them at the only time they're guaranteed cheap to produce.
D) On-demand SQL fetch for cold-path API responses
For
txToMap/obsToMapcalls that serve packets that came in viaLoad()(not the live ingest path), do a SQL lookup:resolved_pathdirectly.E) Backfill refactor (addresses critique #3, gotcha #10)
Pseudocode for the new
backfillResolvedPathsAsyncinner loop:The remove-old/add-new sequence works because the reverse map tells us exactly what to remove — even though the field is gone.
F) xxhash64 collision strategy (addresses critique #8)
resolvedPubkeyIndex[h]returns candidates, the/api/nodes/{pubkey}/pathshandler does a single batched SQL query against theresolved_pathcolumn to confirm the exact pubkey appears in each candidate's resolved path. Same query, same code path as the on-demand SQL fetch (D).confirmResolvedPathContains(txIDs []int, pk string) []int— one helper, one query, called only on the "Paths through node" hot path.G) Schema migration
No schema change required. The
resolved_pathSQLite column stays — it's still the source of truth for ingest-time resolution, the on-demand decode path, and the collision-safety check.Memory budget (revised, addresses critique #2)
resolvedPubkeyIndex(forward map)resolvedPubkeyReverse(reverse map)Out of scope
resolved_pathcolumn. The point is to stop decoding the column en masse at startup, after which the on-disk format becomes irrelevant for OOM. JSON is fine for the hundreds of rows decoded per API request.byPathHopraw-byte half (graceful degradation whenresolved_pathis NULL is preserved).Required tests (revised)
Unit (
cmd/server/store_test.go,routes_test.go,neighbor_persist_test.go)TestResolvedPubkeyIndex_BuildFromLoad— fixture with known pubkeys → afterLoad(), forward + reverse maps are consistent.TestResolvedPubkeyIndex_HashCollision— crafted-vector test with two different pubkeys producing the same xxhash64. Confirm/api/nodes/{pubkey}/pathsreturns only the true match (collision-safety SQL filters the false candidate).TestResolvedPubkeyIndex_IngestUpdate— afterIngestNewObservations, both maps reflect new entries; noResolvedPathfield exists on the struct.TestResolvedPubkeyIndex_RemoveOnEvict— eviction removes all entries via reverse map; forward map has no orphan txIDs.TestResolvedPubkeyIndex_PerObsCoverage— fixture where best obs and a non-best obs have different resolved paths; both sets of pubkeys appear in the index.TestStoreTx_NoResolvedPathField— compile-time guard on struct shape.TestAddToByNode_WithoutResolvedPathField— relay nodes still indexed intobyNodepost-refactor.TestTouchRelayLastSeen_WithoutResolvedPathField— relaylast_seenstill updated; integration with bug: nodes only used for relaying/pathed traffic show as dead #660 / feat: repeater liveness indicator with relay stats (#662) #755 untouched.TestWebSocketBroadcast_IncludesResolvedPath— broadcast maps still carryresolved_pathfor the frontend (decode-window capture works).TestBackfill_UpdatesIndexAndByPathHop— backfill writes to SQL AND populates the new forward/reverse index AND updatesbyPathHopresolved-key entries; verify with a fixture starting from all-NULL.TestBackfill_RemoveOldOnReBackfill— if an obs is backfilled twice (e.g., graph improved), reverse map enables removing old hash entries before adding new ones.Endpoint (
routes_test.go)TestPathsThroughNode_PrecisionAfterRefactor— identical results before/after on a prefix-collision fixture.TestPathsThroughNode_NilResolvedPathFallback— packets with NULLresolved_pathstill returned via raw-byte index match.TestPathsThroughNode_CollisionSafety— when the hash index returns candidates due to a (crafted) collision, the SQL safety check filters the false candidate.TestPacketsAPI_OnDemandResolvedPath—/api/packetsincludesresolved_pathfor cold packets via SQL fetch.TestPacketsAPI_OnDemandResolvedPath_LRUHit— second request for the same packets uses the LRU cache (verify with cache hit counter).TestPacketsAPI_OnDemandResolvedPath_Empty— NULLresolved_pathin DB returns null (or omitted) in API.TestLivePolling_ResolvedPathFromBroadcast— live polling responses includeresolved_pathfrom the in-flight broadcast cache, no SQL hit.TestLivePolling_LRUUnderConcurrentIngest— under 100 concurrent live polls + ingest writes, p95 latency < 50ms.Integration / regression
TestRepeaterLiveness_StillAccurate— relay-detection logic from PR feat: repeater liveness indicator with relay stats (#662) #755 area returns identical counts.TestSubpathDetail_StillWorks—/api/analytics/subpath-detailreturns identical results (uses resolved hop names indirectly).TestNodeDetailPage_RelayCount— Node Detail "relayed N packets" count matches pre-refactor.Memory / performance
BenchmarkLoad_BeforeAfter— 100K observations fixture;runtime.MemStats.HeapAllocafterLoad(). Target: ≥80% reduction (was ≥50% in v1; revised based on revised budget).BenchmarkResolvedPubkeyIndex_Memory— actual heap of the new structures at realistic 50K and 500K unique-pubkey distributions; verify within revised budget.BenchmarkPathsThroughNode_Latency— query node with ~5K candidates. Target: equal or faster.BenchmarkPacketsAPI_FirstPage—/api/packets?limit=100end-to-end. Target: <20 ms regression.BenchmarkLivePolling_UnderIngest— sustained live polling at 1 Hz under continuous ingest at production rates. Target: p99 < 100 ms.Manual validation (release checklist)
Risks (revised)
s.muwith everything else in the store. No new locking surface, but the backfill path now does more work under the write lock — verify backfill batch sizes don't cause noticeable API hiccups. Risk: low–medium.Acceptance criteria (revised)
/api/nodes/{pubkey}/pathsreturns byte-identical results before and after on the regression fixture, including the prefix-collision fixture.BenchmarkLoadshows ≥80% heap reduction.useResolvedPathIndex bool(default true in v3.6.0; default-off path falls back to the current per-StoreTx behavior for one release as a rollback safety net).ResolvedPathfield onStoreTx/StoreObs(compile-time guard test).Estimated effort
8–12h for a senior Go developer familiar with the codebase (revised from v1's optimistic 4–6h):
Load()index build, all decode-window consumers wired)Related: #791, #743, #748, #555, #660, #755. Replaces the approach in PR #796 (closed).