Skip to content

perf: remove per-StoreTx ResolvedPath, replace with membership index + on-demand API decode #799

@Kpa-clawbot

Description

@Kpa-clawbot

Revision history: v1 → v2 (folds in expert critique comment 1 and implementer review comment 2). Strikethroughs removed for readability; see comments for the original v1 critique threads.

Problem

Heap profiling on a representative server (388K observations, 37K transmissions) shows the in-memory packet store consumes ~630 MB. The dominant cost (per pprof inuse_space):

Source Heap %
PacketStore.Load() 445 MB 71%
unmarshalResolvedPath (JSON decode of resolved_path column) 205 MB 33%
sqlite.columnText (raw column reads) 100 MB 16%
reflect.New (json reflection) 41 MB 7%
IngestNewObservations (steady state) 123 MB 20%
IngestNewFromDB 48 MB 8%
Subpath/path-hop indexes (combined) ~10 MB 1.5%

Extrapolating to a database with 1.66M observations (real user report in #791): roughly 1 GB just to deserialize and hold ResolvedPath slices on every StoreTx/StoreObs — directly responsible for OOM kills during startup on memory-constrained deployments.

Why ResolvedPath is so expensive

StoreTx.ResolvedPath and StoreObs.ResolvedPath are []*string — one full hex pubkey per hop, parallel-aligned with PathJSON. Loading 1M+ observations means:

  • 1M+ JSON arrays decoded via encoding/json reflection
  • 1M × ~5 hops × ~64 bytes per hex string = ~320 MB of small string allocations
  • Each []*string slice header is ~24 bytes, plus pointer slots
  • Held in heap for the lifetime of the process

Complete Audit of ResolvedPath / resolved_path Consumers

# Location Function What it does Refactor strategy
1 store.go:553–580 addToByNode Indexes relay nodes from resolved hops into byNode Decode-window: extract pubkeys at ingest before discarding
2 store.go:603–645 touchRelayLastSeen Updates relay last_seen in DB (#660 / #755 area) Decode-window: same as #1
3 store.go:513 pickBestObservation propagation obs→tx Copies best obs's resolved path to tx Goes away with field; consumers fed via decode window
4 store.go:832–833, 2177, 2213, 2237 txToMap/obsToMap REST API serialization On-demand SQL fetch (part B) for cold; broadcast capture for hot
5 store.go:1572, 1622, 1830–1831, 1921 IngestNewObservations / IngestNewFromDB WebSocket broadcast + persist Decode-window: pull resolved_path JSON straight into broadcast map and persist batch — never store on struct
6 store.go:2305–2336 nodeInResolvedPath Disambiguation post-filter for "Paths through node X" Replaced by membership index lookup + collision-safety SQL fallback
7 store.go:2441–2456 addTxToPathHopIndex Indexes tx under raw bytes AND full resolved pubkeys in byPathHop Decode-window: feed pubkeys into byPathHop at ingest before discarding strings; also feed the new resolvedPubkeyIndex
8 store.go:2469–2490 removeTxFromPathHopIndex Removes tx from index entries on eviction/path-change NEW: maintain reverse map txID → []uint64 indexedHashes so eviction can find what to remove
9 routes.go:2211, 2235 mapSliceToStoreTxs / mapSliceToObservations Rehydrates from broadcast Dead-code removal — broadcast still carries the JSON bytes, but they don't land on the struct
10 neighbor_persist.go:384, 451, 493 backfillResolvedPathsAsync Backfill writes obs.ResolvedPath back to struct + persists to SQL Refactored: backfill writes only to SQL + updates the membership index + updates byPathHop resolved-key entries; never touches struct
11 config.go:95 BackfillResolvedPathHours config Config knob Unchanged
12 main.go:154–161 ensureResolvedPathColumn Schema migration Unchanged — column stays

Frontend (no API contract change required — resolved_path stays in broadcast maps and API responses):

  • public/packet-helpers.js:40–49getResolvedPath() parser
  • public/hop-resolver.js:277resolveFromServer()
  • public/packets.js:252, 387, 1183, 1768, 1961, 2328 — caching + detail panes
  • public/live.js:547, 2026, 2112–2124 — broadcast consumption

Proposed refactor (revised)

Remove ResolvedPath []*string from StoreTx and StoreObs. Replace with three narrowly-scoped data structures and one code-flow rule:

A) Membership index (forward)

// resolvedPubkeyIndex maps an 8-byte xxhash64 of a resolved pubkey
// to the list of transmission IDs whose resolved path contains it.
resolvedPubkeyIndex map[uint64][]int   // pubkeyHash → []txID
  • Built once during Load() from the same SQL data (decode → hash → insert → discard the strings).
  • Replaces nodeInResolvedPath(tx, pk) with txID in resolvedPubkeyIndex[hash(pk)] followed by collision-safety check.
  • Coverage: indexes pubkeys from every observation's resolved_path on the tx, not just the "best" observation. This matches current nodeInResolvedPath semantics (it scans all obs).

B) Reverse map for eviction (NEW — addresses critique #6)

// resolvedPubkeyReverse maps txID to the set of pubkey hashes it was
// indexed under. Required for clean removal on eviction / path change.
resolvedPubkeyReverse map[int][]uint64 // txID → []pubkeyHash
  • Populated at the same time as the forward index.
  • removeTxFromResolvedPubkeyIndex(txID) reads this, deletes from forward index, deletes the reverse entry. O(hops) per tx.
  • Memory cost: ~5 hops × 8 bytes × 1M tx = ~40 MB + map overhead. Cheap.

C) Decode-window discipline (NEW — addresses gotchas #1#5, #7, #10)

Rule: resolved_path JSON bytes flow through ingest exactly once, and during that one decode window:

  1. The JSON is parsed into a temporary []*string (stack-allocated when small).
  2. All consumers are fed in order: addToByNode, touchRelayLastSeen, addTxToPathHopIndex (resolved-pubkey branch), resolvedPubkeyIndex insert, resolvedPubkeyReverse insert, broadcast map (raw JSON bytes), persist batch (raw JSON bytes).
  3. The []*string is dropped at the end of the function — never lands on StoreTx/StoreObs.

This single rule makes the refactor coherent: every consumer that needs resolved pubkeys gets them at the only time they're guaranteed cheap to produce.

D) On-demand SQL fetch for cold-path API responses

For txToMap/obsToMap calls that serve packets that came in via Load() (not the live ingest path), do a SQL lookup:

SELECT id, resolved_path FROM observations WHERE id IN (?, ?, ?, ...)
  • Single batch query per API request, returning ≤500 rows.
  • LRU cache (10K entries) is mandatory for the live polling path (addresses critique Potential XSS: decoded.text not escaped in node detail panel #5).
  • For hot packets (those still in the broadcast pipeline), no SQL needed — the broadcast carries resolved_path directly.

E) Backfill refactor (addresses critique #3, gotcha #10)

Pseudocode for the new backfillResolvedPathsAsync inner loop:

for batch := range pendingObs {
    for _, obs := range batch {
        rp := resolvePathForObs(obs.PathJSON, obs.ObserverID, tx, pm, graph)
        rpJSON := marshalResolvedPath(rp)
        // 1. SQL update (keeps existing behavior)
        sqlUpdate(obs.ID, rpJSON)
        // 2. Update forward + reverse index
        s.mu.Lock()
        for _, pk := range rp {
            if pk == nil { continue }
            h := xxhash64(*pk)
            // Old entries (if any) are removed via the reverse map
            removeOldHashes(obs.TransmissionID)
            s.resolvedPubkeyIndex[h] = append(s.resolvedPubkeyIndex[h], obs.TransmissionID)
            s.resolvedPubkeyReverse[obs.TransmissionID] = append(s.resolvedPubkeyReverse[obs.TransmissionID], h)
        }
        // 3. Update byPathHop resolved-key entries (similar pattern)
        s.mu.Unlock()
    }
}

The remove-old/add-new sequence works because the reverse map tells us exactly what to remove — even though the field is gone.

F) xxhash64 collision strategy (addresses critique #8)

  • Probability: ~1 in 4 billion per pair at 1M unique pubkeys. In practice, expect zero collisions.
  • Detection + safety: when resolvedPubkeyIndex[h] returns candidates, the /api/nodes/{pubkey}/paths handler does a single batched SQL query against the resolved_path column to confirm the exact pubkey appears in each candidate's resolved path. Same query, same code path as the on-demand SQL fetch (D).
  • No collision lists, no secondary hash, no open addressing. The SQL safety net handles it.
  • Code: confirmResolvedPathContains(txIDs []int, pk string) []int — one helper, one query, called only on the "Paths through node" hot path.

G) Schema migration

No schema change required. The resolved_path SQLite column stays — it's still the source of truth for ingest-time resolution, the on-demand decode path, and the collision-safety check.

Memory budget (revised, addresses critique #2)

Structure Estimate at 1M obs / 50K unique pubkeys Estimate at 1M obs / 500K unique pubkeys
resolvedPubkeyIndex (forward map) ~50 MB ~120 MB
resolvedPubkeyReverse (reverse map) ~40 MB ~40 MB
LRU cache for API on-demand decode (10K × ~200 B) ~2 MB ~2 MB
Total new structures ~92 MB ~162 MB
Removed: per-StoreTx/Obs ResolvedPath slices ~1 GB ~1 GB
Net savings ~900 MB ~840 MB

Out of scope

  • Replacing JSON with protobuf/CBOR for the resolved_path column. The point is to stop decoding the column en masse at startup, after which the on-disk format becomes irrelevant for OOM. JSON is fine for the hundreds of rows decoded per API request.
  • Changing the neighbor affinity graph or backfill resolution logic.
  • Removing the byPathHop raw-byte half (graceful degradation when resolved_path is NULL is preserved).

Required tests (revised)

Unit (cmd/server/store_test.go, routes_test.go, neighbor_persist_test.go)

  • TestResolvedPubkeyIndex_BuildFromLoad — fixture with known pubkeys → after Load(), forward + reverse maps are consistent.
  • TestResolvedPubkeyIndex_HashCollision — crafted-vector test with two different pubkeys producing the same xxhash64. Confirm /api/nodes/{pubkey}/paths returns only the true match (collision-safety SQL filters the false candidate).
  • TestResolvedPubkeyIndex_IngestUpdate — after IngestNewObservations, both maps reflect new entries; no ResolvedPath field exists on the struct.
  • TestResolvedPubkeyIndex_RemoveOnEvict — eviction removes all entries via reverse map; forward map has no orphan txIDs.
  • TestResolvedPubkeyIndex_PerObsCoverage — fixture where best obs and a non-best obs have different resolved paths; both sets of pubkeys appear in the index.
  • TestStoreTx_NoResolvedPathField — compile-time guard on struct shape.
  • NEW: TestAddToByNode_WithoutResolvedPathField — relay nodes still indexed into byNode post-refactor.
  • NEW: TestTouchRelayLastSeen_WithoutResolvedPathField — relay last_seen still updated; integration with bug: nodes only used for relaying/pathed traffic show as dead #660 / feat: repeater liveness indicator with relay stats (#662) #755 untouched.
  • NEW: TestWebSocketBroadcast_IncludesResolvedPath — broadcast maps still carry resolved_path for the frontend (decode-window capture works).
  • NEW: TestBackfill_UpdatesIndexAndByPathHop — backfill writes to SQL AND populates the new forward/reverse index AND updates byPathHop resolved-key entries; verify with a fixture starting from all-NULL.
  • NEW: TestBackfill_RemoveOldOnReBackfill — if an obs is backfilled twice (e.g., graph improved), reverse map enables removing old hash entries before adding new ones.

Endpoint (routes_test.go)

  • TestPathsThroughNode_PrecisionAfterRefactor — identical results before/after on a prefix-collision fixture.
  • TestPathsThroughNode_NilResolvedPathFallback — packets with NULL resolved_path still returned via raw-byte index match.
  • TestPathsThroughNode_CollisionSafety — when the hash index returns candidates due to a (crafted) collision, the SQL safety check filters the false candidate.
  • TestPacketsAPI_OnDemandResolvedPath/api/packets includes resolved_path for cold packets via SQL fetch.
  • TestPacketsAPI_OnDemandResolvedPath_LRUHit — second request for the same packets uses the LRU cache (verify with cache hit counter).
  • TestPacketsAPI_OnDemandResolvedPath_Empty — NULL resolved_path in DB returns null (or omitted) in API.
  • TestLivePolling_ResolvedPathFromBroadcast — live polling responses include resolved_path from the in-flight broadcast cache, no SQL hit.
  • TestLivePolling_LRUUnderConcurrentIngest — under 100 concurrent live polls + ingest writes, p95 latency < 50ms.

Integration / regression

  • TestRepeaterLiveness_StillAccurate — relay-detection logic from PR feat: repeater liveness indicator with relay stats (#662) #755 area returns identical counts.
  • TestSubpathDetail_StillWorks/api/analytics/subpath-detail returns identical results (uses resolved hop names indirectly).
  • TestNodeDetailPage_RelayCount — Node Detail "relayed N packets" count matches pre-refactor.

Memory / performance

  • BenchmarkLoad_BeforeAfter — 100K observations fixture; runtime.MemStats.HeapAlloc after Load(). Target: ≥80% reduction (was ≥50% in v1; revised based on revised budget).
  • BenchmarkResolvedPubkeyIndex_Memory — actual heap of the new structures at realistic 50K and 500K unique-pubkey distributions; verify within revised budget.
  • BenchmarkPathsThroughNode_Latency — query node with ~5K candidates. Target: equal or faster.
  • BenchmarkPacketsAPI_FirstPage/api/packets?limit=100 end-to-end. Target: <20 ms regression.
  • BenchmarkLivePolling_UnderIngest — sustained live polling at 1 Hz under continuous ingest at production rates. Target: p99 < 100 ms.

Manual validation (release checklist)

  • Start server with the Existing SQLite database prevents CoreScope from becoming reachable, empty DB starts immediately #791 user's database (or synthetic 1M-observation DB). Verify startup completes in a 1 GB container.
  • On Node Detail → Paths tab: open a node known to have prefix collisions. Verify only that node's actual paths appear.
  • On Live page: verify hop names render correctly without flicker on new packets.
  • On Packets browser: filter by node, verify resolved hop column shows correct names.
  • After running for 1h: verify backfill log shows "[store] async resolved_path backfill complete" with no errors and the index reflects backfilled entries.

Risks (revised)

  • Hash collisions: ~1 in 4 billion at 1M keys. Handled by the SQL safety check on the only hot lookup path. Risk: low.
  • On-demand SQL latency under live polling: mitigated by mandatory LRU cache. Risk: medium until benchmarked.
  • Concurrent index update: the forward + reverse maps share s.mu with everything else in the store. No new locking surface, but the backfill path now does more work under the write lock — verify backfill batch sizes don't cause noticeable API hiccups. Risk: low–medium.
  • Backfill re-indexing: addressed by the reverse map. Risk: low.
  • Subtle correctness bug in production: addressed by feature flag (see acceptance criteria).

Acceptance criteria (revised)

  1. All required tests pass.
  2. Startup heap with the Existing SQLite database prevents CoreScope from becoming reachable, empty DB starts immediately #791 database fits within a 1 GB container limit.
  3. /api/nodes/{pubkey}/paths returns byte-identical results before and after on the regression fixture, including the prefix-collision fixture.
  4. BenchmarkLoad shows ≥80% heap reduction.
  5. Feature flag useResolvedPathIndex bool (default true in v3.6.0; default-off path falls back to the current per-StoreTx behavior for one release as a rollback safety net).
  6. No new force-pushes, no schema migration required, no changes to firmware assumptions.
  7. No new ResolvedPath field on StoreTx / StoreObs (compile-time guard test).

Estimated effort

8–12h for a senior Go developer familiar with the codebase (revised from v1's optimistic 4–6h):

  • 4h: Core refactor (field removal, Load() index build, all decode-window consumers wired)
  • 2h: On-demand SQL fetch + LRU cache for API endpoints
  • 1h: Backfill refactor + reverse-map maintenance
  • 1h: Memory accounting helper updates
  • 2h: Unit + endpoint + collision tests
  • 1h: Integration testing, benchmarks, edge cases (NULL resolved_path during backfill window, feature flag toggle)

Related: #791, #743, #748, #555, #660, #755. Replaces the approach in PR #796 (closed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions