feat: remove runtime manifest dependency + streaming rewards #197

7layermagik · 2026-01-28T05:26:03Z

Summary

This PR eliminates the runtime dependency on the snapshot manifest file for replay and replaces the memory-intensive full stake cache with streaming stake account processing from AccountsDB.

Key Architectural Changes

1. Remove Manifest Runtime Dependency

State file v2 now captures all manifest-derived fields at parse time
Replay no longer opens or reads manifest files during execution
All epoch boundary processing uses AccountsDB directly

2. Streaming Rewards (Zero Memory Spike)

Replace full stake cache (~1GB for 2.5M accounts) with streaming from AccountsDB
Two-pass streaming rewards calculation:
- Pass 1: Stream stakes → compute total points
- Pass 2: Stream stakes → recompute points + calculate rewards → write to spool
Per-partition spool files for O(1) reads during distribution
RAM stays flat across epoch boundaries

3. Stake Pubkey Index

Persistent index file (stake_pubkeys.idx) tracks all known stake accounts
New stake accounts appended incrementally during replay
Enables efficient streaming without full account scan

Changes by Component

State File (pkg/state/)

Add ManifestEpochStakes, ManifestEpochAuthorizedVoters, ManifestTransactionCount, ManifestEpochAcctsHash fields
Require v2 state file format (no manifest fallbacks)
Remove all runtime manifest file access

Global Context (pkg/global/)

Add StreamStakeAccounts() for parallel batched streaming from AccountsDB
Add stake pubkey index load/save/append helpers
Add TrackNewStakePubkey() for incremental index updates
Add voteStakeTotals aggregate map (replaces full stake cache at startup)

Epoch Processing (pkg/replay/epoch.go)

updateStakeHistorySysvar() → stream from AccountsDB
updateEpochStakesAndRefreshVoteCache() → single streaming pass builds both vote totals and effective stakes

Rewards (pkg/rewards/)

Add spool.go with per-partition binary spool files:
- PartitionedSpoolWriters for buffered writes (1MB buffer)
- PartitionReader for sequential reads per partition
- 88-byte record format: stake_pubkey(32) + vote_pubkey(32) + stake(8) + credits(8) + reward(8)
Add CalculateRewardsStreaming() for two-pass streaming calculation
Add DistributeStakingRewardsFromSpool() for partition-based distribution
Channel-based single writer pattern with error capture
Strict error policy: any failure returns error (consensus correctness)
Remove ~300 lines of dead code (old partition tracking, points accumulator, etc.)

Snapshot (pkg/snapshot/)

Add manifest_seed.go for extracting epoch stakes from manifest at parse time

Performance Characteristics

Metric	Before	After
Startup stake cache	~1GB	~50MB (aggregates only)
Epoch boundary RAM spike	~1.5GB	~0 (streaming)
Spool file I/O	N/A	Sequential, buffered 1MB
Rewards calculation	2 passes over cache	2 streaming passes
Rewards distribution	Random AccountsDB reads	Sequential spool + batch AccountsDB

File Changes

 pkg/global/global_ctx.go      | +140 lines (streaming, index helpers)
 pkg/replay/block.go           | refactored (remove manifest reads)
 pkg/replay/epoch.go           | refactored (streaming stake processing)
 pkg/replay/rewards.go         | +53 lines (spool-based distribution)
 pkg/rewards/rewards.go        | +583/-541 (streaming calculation)
 pkg/rewards/spool.go          | +218 lines (new spool infrastructure)
 pkg/rewards/partitions.go     | -46 lines (deleted - dead code)
 pkg/rewards/points.go         | -38 lines (deleted - dead code)
 pkg/snapshot/manifest_seed.go | +133 lines (epoch stakes extraction)
 pkg/state/state.go            | +77 lines (manifest fields in state)

🤖 Generated with Claude Code

Replay now reads all seed data from state file instead of manifest at runtime. This eliminates the need for the manifest file after AccountsDB is built. Changes: - Add manifest_* fields to MithrilState schema (v2) - Create PopulateManifestSeed() to copy manifest data at build time - Update configureInitialBlock/FromResume to use state file - Update newReplayCtx to prefer state file over manifest - Update buildInitialEpochStakesCache to use ManifestEpochStakes - Clear ManifestEpochStakes after first replayed slot past snapshot - Backwards compat: fall back to manifest for old state files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Require state schema version 2 (no v0/v1 migration, error instead) - Remove manifest fallback in configureInitialBlock (fatal if missing) - Remove manifest fallback in buildInitialEpochStakesCache (fatal if missing) - Remove manifest fallback in newReplayCtx (fatal if missing) - Remove snapshotManifest parameter from ReplayBlocks signature - Remove snapshotManifest parameter from configureInitialBlock signature - Add ManifestTransactionCount and ManifestEpochAuthorizedVoters fields - Add proper error handling for corrupted state file decoding - Clean up unused imports (epochstakes, snapshot from replay package) This ensures replay NEVER reads from manifest after AccountsDB build. Old state files will error with "delete AccountsDB and rebuild from snapshot". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Make ManifestEvictedBlockhash required in strict v2 mode (affects transaction age validation for first block) - Switch manifest_seed.go from mr-tron/base58 to internal pkg/base58 for consistency with block.go decode path Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…e timestamps - EpochAuthorizedVoters: change from map[string]string to map[string][]string to support multiple authorized voters per vote account (matches original manifest behavior where PutEntry appends to a slice) - VoteTimestamps: populate from ALL vote accounts in cache, not just those with non-zero stake (matches original manifest behavior where all vote accounts from Bank.Stakes.VoteAccounts had timestamps populated) Note: This changes the state file schema for manifest_epoch_authorized_voters. Existing state files will require rebuild from snapshot. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Stop populating global.StakeCache at startup. Instead, build an aggregate map of vote pubkey → total stake directly from AccountsDB scan. Changes: - Add voteStakeTotals map + mutex + helpers in global_ctx.go - Modify setupInitialVoteAcctsAndStakeAccts to build aggregate only: - Remove PutStakeCacheItemBulk calls - Each batch worker builds local map, merges into shared under mutex - Call SetVoteStakeTotals() to store aggregate for later use - Full stake cache no longer populated at startup (memory savings) Note: Epoch boundary functions still use StakeCache and will fail until Step 2 adds on-demand cache rebuild. This is expected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace full stake cache with streaming from AccountsDB for epoch boundary processing and rewards calculation. Key changes: - Add StreamStakeAccounts() for parallel stake account streaming - Add spool file infrastructure (SpoolWriter/SpoolReader) for binary serialization of stake rewards during calculation - Modify updateStakeHistorySysvar() to stream from AccountsDB - Modify updateEpochStakesAndRefreshVoteCache() to stream in single pass - Add CalculateRewardsStreaming() for two-pass streaming rewards: - Pass 1: Calculate total points - Pass 2: Calculate rewards and write to spool file - Add DistributeStakingRewardsFromSpool() for partition distribution - Update recordStakeDelegation() to only track new pubkeys for index Architecture: - No full stake cache held in memory - All stake data streamed directly from AccountsDB - Rewards written to binary spool file during calculation - Distribution reads from spool file partition-by-partition - RAM stays flat across epoch boundaries Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Addresses issues identified in ChatGPT review of ce71c54: 1. Per-partition spool files (O(N) → O(1) reads per partition) - Write to reward_spool_<slot>_p<partition>.bin during calculation - Sequential read per partition during distribution (no indexing) - Eliminates O(N × partitions) file scanning 2. Channel-based single writer for error capture - All spool writes routed through one goroutine - Write errors captured in atomic.Value and propagated - Cleanup on failure 3. Silent account error tracking - Track failed account reads/unmarshals with atomic counters - Log warning with count and first error when failures occur 4. stakePointsCache elimination (already in ce71c54) - Two-pass streaming: Pass 1 for total points, Pass 2 recomputes + writes - Trades ~2x CPU for 0 extra RAM (~140MB saved) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Clean PartitionedRewardDistributionInfo: remove unused Credits, RewardPartitions, StakingRewards, WorkerPool fields - Remove dead code: rewardDistributionTask, rewardDistributionWorker, InitWorkerPool, ReleaseWorkerPool, DistributeStakingRewardsForPartition - DistributeStakingRewardsFromSpool now streams with reader.Next() loop instead of ReadAll() for flat RAM during distribution - Use dynamic slices with append (no pre-allocated arrays with nils) - Implement strict error policy: any account read/unmarshal/marshal failure returns error immediately for consensus correctness - Add buffered I/O (1MB) to PartitionReader and partitionWriter - Remove all legacy single-file spool types from spool.go Net reduction: ~294 lines removed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace atomic.Value with atomic.Pointer[error] to avoid potential issues with uninitialized atomic.Value (CompareAndSwap on zero value) - Check workerPool.Invoke() return value and fail fast if pool rejects task (balance wg.Done since worker won't run) - Proper pointer dereference (*werr, *ferr) for error formatting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Delete pkg/rewards/partitions.go (old in-memory partition tracking) - Delete pkg/rewards/points.go (CalculatedStakePointsAccumulator) - Remove CalculateStakeRewardsAndPartitions, CalculateStakePoints, delegationAndPubkey, idxAndReward from rewards.go - Remove unused rpc import - Fix workerPool.Invoke error handling in StreamStakeAccounts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

StreamStakeAccounts now waits for all already-queued batches to complete before returning an error, preventing workers from running against shared state after the caller has moved on. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The function and its call referenced the removed snapshot.SnapshotManifest parameter, causing build failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

7layermagik and others added 12 commits January 27, 2026 23:03

fix: remove leftover nilifySnapshotManifest after manifest removal

f610577

The function and its call referenced the removed snapshot.SnapshotManifest parameter, causing build failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove runtime manifest dependency + streaming rewards #197

feat: remove runtime manifest dependency + streaming rewards #197

Uh oh!

7layermagik commented Jan 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: remove runtime manifest dependency + streaming rewards #197

Are you sure you want to change the base?

feat: remove runtime manifest dependency + streaming rewards #197

Uh oh!

Conversation

7layermagik commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Architectural Changes

Changes by Component

State File (pkg/state/)

Global Context (pkg/global/)

Epoch Processing (pkg/replay/epoch.go)

Rewards (pkg/rewards/)

Snapshot (pkg/snapshot/)

Performance Characteristics

File Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

7layermagik commented Jan 28, 2026 •

edited

Loading