Skip to content

perf: remove per-StoreTx ResolvedPath, replace with membership index + on-demand decode (final spec) #800

@Kpa-clawbot

Description

@Kpa-clawbot

Goal

Cut server startup heap by ~900 MB on databases with 1M+ observations by removing per-StoreTx/StoreObs ResolvedPath slices. Unblocks #791.

Discussion, profiling data, and full review history: #799 (closed in favor of this spec).

Profile (in brief)

pprof inuse_space on a representative server (388K obs, 630 MB heap):

Design

Remove

ResolvedPath []*string field from StoreTx and StoreObs. Compile-time guard test ensures it stays gone.

Add

Structure Purpose Estimated cost (1M obs)
resolvedPubkeyIndex map[uint64][]int xxhash64(pubkey) → []txID. Forward index for "Paths through node X" + collision-safety candidates. 50–120 MB
resolvedPubkeyReverse map[int][]uint64 txID → []hashes it was indexed under. Required for clean removal on eviction / backfill re-index. ~40 MB
apiResolvedPathLRU (sized 10K, ~200 B each) Cache for on-demand API decode of resolved_path. Mandatory for live polling path. ~2 MB

Decode-window discipline (single rule)

resolved_path JSON is decoded at exactly one place per packet (ingest / Load() row). During that decode window, all consumers are fed in this order, then the temporary []*string is dropped — never lands on the struct:

  1. addToByNode — relay node indexing
  2. touchRelayLastSeen — relay liveness DB updates (bug: nodes only used for relaying/pathed traffic show as dead #660 / feat: repeater liveness indicator with relay stats (#662) #755)
  3. addTxToPathHopIndex resolved-pubkey branch — byPathHop full-pubkey keys
  4. resolvedPubkeyIndex + resolvedPubkeyReverse insert
  5. WebSocket broadcast map (carries the raw JSON bytes — no struct mutation)
  6. Persist batch (carries the raw JSON bytes for SQL UPDATE)

Enforced by: (a) struct field gone (compile-time), (b) godoc on ingestObservationDecoded() documenting the contract, (c) test that broadcast maps include resolved_path post-refactor.

On-demand SQL fetch (cold path)

txToMap / obsToMap API serializers and the eviction byNode/nodeHashes cleanup query SQLite for resolved_path when needed:

SELECT id, resolved_path FROM observations WHERE id IN (?, ?, ?, ...)
  • Single batched query per request (≤500 rows).
  • Result cached in apiResolvedPathLRU keyed by obs ID.
  • LRU cache invalidation: backfill writes call apiResolvedPathLRU.Delete(obsID) after committing the SQL UPDATE.

Collision safety

xxhash64 collision rate at 1M unique keys = ~1 in 4B per pair. When resolvedPubkeyIndex[h] returns candidates, /api/nodes/{pubkey}/paths runs one batched SQL query to verify the exact pubkey appears in each candidate's resolved path. Same query path as the on-demand SQL fetch — no separate code.

Backfill refactor

backfillResolvedPathsAsync:

  1. SQL UPDATE (unchanged)
  2. Use reverse map to remove old hash entries for the obs's tx
  3. Insert new hash entries into forward + reverse maps
  4. Update byPathHop resolved-key entries
  5. Invalidate LRU cache for the obs ID

Schema

No schema change. SQLite resolved_path column stays — source of truth for ingest-time resolution, on-demand decode, and collision-safety check.

Feature flag

useResolvedPathIndex bool (default true in v3.6.0). The off-path keeps the old per-StoreTx field as a one-release rollback safety net. Removed in v3.7.0.

Consumers (audit)

All Go consumers of ResolvedPath / resolved_path and their post-refactor strategy:

Function File Strategy
addToByNode store.go Decode-window
touchRelayLastSeen store.go Decode-window
pickBestObservation (obs→tx propagation) store.go Removed (no field to propagate)
txToMap / obsToMap (REST API) store.go On-demand SQL + LRU
IngestNewObservations / IngestNewFromDB (broadcast + persist) store.go Decode-window: raw JSON straight to broadcast/persist, never struct
nodeInResolvedPath store.go Replaced by index lookup + collision-safety SQL
addTxToPathHopIndex (resolved-key branch) store.go Decode-window
removeTxFromPathHopIndex store.go New reverse map drives removal
mapSliceToStoreTxs / mapSliceToObservations routes.go Dead code — delete
backfillResolvedPathsAsync neighbor_persist.go New flow above
Eviction path: byNode / nodeHashes cleanup store.go On-demand SQL fetch (cold path, cheap)

Frontend: no API contract change required. resolved_path remains in broadcast maps and API responses.

Tests

Unit

  • TestResolvedPubkeyIndex_BuildFromLoad — forward + reverse maps consistent after Load()
  • TestResolvedPubkeyIndex_HashCollision — crafted-vector collision; SQL safety filters false candidate
  • TestResolvedPubkeyIndex_IngestUpdate — both maps reflect new ingests; struct has no field
  • TestResolvedPubkeyIndex_RemoveOnEvict — eviction removes via reverse map; no orphan txIDs
  • TestResolvedPubkeyIndex_PerObsCoverage — non-best obs's resolved pubkeys are also indexed
  • TestStoreTx_NoResolvedPathField — compile-time guard
  • TestAddToByNode_WithoutResolvedPathField — relay nodes still in byNode
  • TestTouchRelayLastSeen_WithoutResolvedPathField — relay last_seen still updated
  • TestWebSocketBroadcast_IncludesResolvedPath — broadcast carries resolved_path
  • TestBackfill_UpdatesIndexAndByPathHop — backfill populates new structures
  • TestBackfill_RemoveOldOnReBackfill — re-backfill removes old hashes via reverse map
  • TestBackfill_InvalidatesLRU — LRU cache evicts the obs after backfill UPDATE
  • TestEviction_ByNodeCleanup_OnDemandSQL — eviction path SQL-fetches resolved_path to clean byNode / nodeHashes

Endpoint

  • TestPathsThroughNode_PrecisionAfterRefactor — identical results before/after on prefix-collision fixture
  • TestPathsThroughNode_NilResolvedPathFallback — NULL resolved_path packets still returned via raw-byte fallback
  • TestPathsThroughNode_CollisionSafety — crafted hash collision filtered by SQL safety check
  • TestPacketsAPI_OnDemandResolvedPath/api/packets includes resolved_path for cold packets
  • TestPacketsAPI_OnDemandResolvedPath_LRUHit — second request hits cache
  • TestPacketsAPI_OnDemandResolvedPath_Empty — NULL returns null/omitted
  • TestLivePolling_ResolvedPathFromBroadcast — live poll uses in-flight broadcast cache, no SQL
  • TestLivePolling_LRUUnderConcurrentIngest — 100 concurrent live polls + ingest writes; p95 < 50 ms

Feature flag

  • TestFeatureFlag_OffPath_PreservesOldBehavior — with useResolvedPathIndex=false, struct still has field, all existing tests pass
  • TestFeatureFlag_Toggle_NoStateLeak — toggling the flag at runtime doesn't corrupt state (or document it as restart-only)

Concurrency

  • TestReverseMap_NoLeakOnPartialFailure — if backfill UPDATE succeeds but index insert panics, recovery doesn't leave the reverse map in an inconsistent state
  • TestDecodeWindow_LockHoldTimeBounded — measure write-lock duration during ingest decode window; document budget

Integration / regression

Benchmarks

  • BenchmarkLoad_BeforeAfter — 100K obs fixture; target ≥80% heap reduction
  • BenchmarkResolvedPubkeyIndex_Memory — at 50K and 500K unique-pubkey distributions; verify within budget
  • BenchmarkPathsThroughNode_Latency — 5K candidates; equal or faster
  • BenchmarkPacketsAPI_FirstPage/api/packets?limit=100; <20 ms regression
  • BenchmarkLivePolling_UnderIngest — 1 Hz live polling under continuous ingest; p99 < 100 ms

Manual validation

Acceptance criteria

  1. All tests above pass
  2. Existing SQLite database prevents CoreScope from becoming reachable, empty DB starts immediately #791 user's DB starts in a 1 GB container limit
  3. /api/nodes/{pubkey}/paths byte-identical results before/after on regression fixture
  4. BenchmarkLoad ≥80% heap reduction
  5. Feature flag works; off-path preserves current behavior
  6. No schema migration; no firmware assumptions changed; no force-pushes
  7. StoreTx / StoreObs have no ResolvedPath field (compile-time)

Estimated effort

10–14h for a senior Go developer familiar with the codebase:

  • 4h core refactor (field removal + decode-window plumbing)
  • 2h on-demand SQL + LRU + invalidation
  • 1h backfill refactor + reverse-map maintenance
  • 1h eviction byNode/nodeHashes cleanup via on-demand SQL
  • 1h memory accounting + feature flag plumbing
  • 3h tests (unit + endpoint + collision + concurrency + feature flag)
  • 2h integration, benchmarks, edge cases

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions