From 4dac51bf3035fd079c08794dbc19598182a13a27 Mon Sep 17 00:00:00 2001 From: Kris Zyp Date: Tue, 26 May 2026 07:19:38 -0600 Subject: [PATCH 1/2] docs(storage): document storage.rocks.* memory config options MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds documentation for four new storage configuration parameters introduced in v5.1.0: - storage.rocks.blockCacheSize — explicit override for the shared RocksDB block cache size - storage.rocks.writeBufferManagerSize — process-wide cap on memtable memory across all databases - storage.rocks.writeBufferManagerCostToCache — share memtable accounting with the block cache for unified observability - storage.rocks.writeBufferManagerAllowStall — hard cap vs soft cap behavior when the WriteBufferManager limit is reached - Adds bullet-list entries to reference/configuration/options.md under the existing `storage` section. - Adds a new "RocksDB Memory" detail section to reference/database/storage-tuning.md between "Read & Write Behavior" and "Storage Reclamation", with workload-recipe-style guidance for when to lower the cache or enable the manager. Documents the multi-tenant shared-host scenario as the primary use case for tuning these knobs. Source PR: HarperFast/harper#780 Co-Authored-By: Claude Sonnet 4.7 --- reference/configuration/options.md | 4 ++ reference/database/storage-tuning.md | 93 ++++++++++++++++++++++++++++ 2 files changed, 97 insertions(+) diff --git a/reference/configuration/options.md b/reference/configuration/options.md index ad8a59a6..4b29433c 100644 --- a/reference/configuration/options.md +++ b/reference/configuration/options.md @@ -239,6 +239,10 @@ storage: - `reclamation.threshold` — Free-space ratio below which reclamation begins evicting from caching tables; _Default_: `0.4` (Added in: v4.5.0) - `reclamation.interval` — Free-space check interval; _Default_: `1h` - `reclamation.evictionFactor` — Heuristic factor for early eviction under disk pressure; _Default_: `100000`. See [Storage Tuning — Reclamation](../database/storage-tuning.md#storage-reclamation) +- `rocks.blockCacheSize` — RocksDB shared block cache size in bytes; _Default_: 25% of constrained memory. See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0) +- `rocks.writeBufferManagerSize` — Process-wide cap (bytes) on RocksDB memtable memory across all databases. `0` disables; _Default_: `0`. See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0) +- `rocks.writeBufferManagerCostToCache` — Charge memtable memory against the block cache so both share a single accounting pool; _Default_: `false`. Has no effect when `writeBufferManagerSize` is `0`. (Added in: v5.1.0) +- `rocks.writeBufferManagerAllowStall` — Stall writes when memtable memory exceeds `writeBufferManagerSize` (hard cap) instead of allowing brief overshoot with more aggressive flushing (soft cap); _Default_: `false`. (Added in: v5.1.0) --- diff --git a/reference/database/storage-tuning.md b/reference/database/storage-tuning.md index 98a30283..23789fc7 100644 --- a/reference/database/storage-tuning.md +++ b/reference/database/storage-tuning.md @@ -138,6 +138,99 @@ Default: `true` In-memory record caching of decoded records. Disable to reduce heap usage when records are large and unlikely to be re-read in the same process. +## RocksDB Memory + +RocksDB uses two large native memory pools that Harper exposes for tuning: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These are RocksDB-specific — when `storage.engine` is `lmdb`, none of these options apply. + +For single-tenant deployments the defaults are appropriate. The knobs below are intended for shared-host or memory-constrained environments where multiple Harper databases coexist with other workloads and total memory must be bounded predictably. + +### `storage.rocks.blockCacheSize` + + + +Type: `number` (bytes) + +Default: 25% of constrained (cgroup) or total memory + +The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool — sizing it correctly is a balance between read-cache hit rate and leaving room for memtables, the heap, and OS page cache. + +The default sizes the cache to 25% of available memory, computed once at startup. This is reasonable for single-tenant servers but can be excessive in multi-tenant deployments where the cache is rarely filled and the unused capacity contributes to the process's idle-state memory floor (the cache itself does not shrink on idle — entries persist until LRU eviction or a manual `SetCapacity` change). + +```yaml +storage: + rocks: + blockCacheSize: 268435456 # 256 MB +``` + +Lower the cache size when: + +- Multiple Harper instances or other memory-heavy processes share the host. +- Read access patterns favor warm data far smaller than 25% of memory. +- The instance is provisioned with a strict cgroup limit and the headroom is needed for memtables or application heap. + +Raise it (or leave at the default) when reads dominate and the working set is large. + +### `storage.rocks.writeBufferManagerSize` + + + +Type: `number` (bytes) + +Default: `0` (disabled) + +When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history used by [OptimisticTransactionDB](./storage-algorithm.md) for conflict checking — is capped at this size across the entire process. + +Without a `WriteBufferManager`, each column family manages its own memtable budget. For databases with many tables (column families), this can grow unbounded: every column family retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection, so a 14-table database can hold 1–2 GB of resident anonymous memory before any cap is reached. + +Enabling the manager bounds that growth at a single configurable limit: + +```yaml +storage: + rocks: + writeBufferManagerSize: 268435456 # 256 MB total memtable budget +``` + +The manager affects new databases opened after it is configured; existing open databases retain whatever budget they were attached with. + +### `storage.rocks.writeBufferManagerCostToCache` + + + +Type: `boolean` + +Default: `false` + +When `true`, memtable memory tracked by the `WriteBufferManager` is **charged against the block cache** as pinned cache entries. The block cache and write buffers then share a single accounting pool, visible through one operational metric (`rocksdb.block-cache-usage`). + +This does not let the cache "shrink" to make room for writes — pinned entries cannot be evicted by LRU — but it unifies observability and bounds the combined memory footprint when `writeBufferManagerSize` is at or below `blockCacheSize`. + +Has no effect when `storage.rocks.writeBufferManagerSize` is `0` or when the block cache is disabled. + +```yaml +storage: + rocks: + blockCacheSize: 536870912 # 512 MB + writeBufferManagerSize: 268435456 # 256 MB + writeBufferManagerCostToCache: true +``` + +### `storage.rocks.writeBufferManagerAllowStall` + + + +Type: `boolean` + +Default: `false` + +Controls behavior when memtable memory reaches `writeBufferManagerSize`: + +- `false` (soft cap) — Memtables may briefly exceed the limit. RocksDB compensates by flushing more aggressively. Writes proceed without latency spikes; total memory may temporarily overshoot during bursts. +- `true` (hard cap) — Writes are stalled until flushes free up memory. Total memtable memory is strictly bounded; write latency can spike during bursts. + +Use the default (`false`) for most workloads. Enable stalling only when a strict OOM-prevention guarantee is required and the application can tolerate occasional write-latency spikes. + +This option is the only `WriteBufferManager` setting that can be changed at runtime — `costToCache` is fixed at first creation. + ## Storage Reclamation `storage.reclamation` controls how Harper evicts data from caching tables (tables with [`sourcedFrom`](../resources/resource-api.md#sourcedfromresource-options)) when disk usage runs high. Reclamation does **not** affect non-caching tables — those rely on explicit deletion, TTL expiration, or [compaction](./compaction.md). From 2b57700abf1e5dfea7d30920466cadd68ab90c41 Mon Sep 17 00:00:00 2001 From: Kris Zyp Date: Tue, 26 May 2026 11:47:42 -0600 Subject: [PATCH 2/2] docs(storage): generalize RocksDB memory section, add cache hierarchy Reframes the RocksDB Memory section to focus on general memory management rather than Fabric-specific multi-tenant tuning (that context now lives in host-manager's DESIGN.md, where the actual tuning decision is made). Adds a new "How RocksDB reads are cached" subsection explaining the three-tier hierarchy: block cache -> OS page cache -> disk with concrete latency expectations and the implication for sizing: shrinking the block cache shifts hits from the decompressed in-process cache to the kernel's dynamically-sized compressed page cache, not directly to disk. This helps operators reason about why a smaller block cache isn't necessarily worse on memory-constrained hosts. Drops the "shared-host / multi-tenant" framing from the section intro and from the blockCacheSize discussion. Softens the WriteBufferManager rationale to talk about "many tables" generically rather than naming a 14-table customer profile. Co-Authored-By: Claude Sonnet 4.7 --- reference/database/storage-tuning.md | 30 ++++++++++++++++++---------- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/reference/database/storage-tuning.md b/reference/database/storage-tuning.md index 23789fc7..36bcc4e5 100644 --- a/reference/database/storage-tuning.md +++ b/reference/database/storage-tuning.md @@ -140,9 +140,19 @@ In-memory record caching of decoded records. Disable to reduce heap usage when r ## RocksDB Memory -RocksDB uses two large native memory pools that Harper exposes for tuning: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These are RocksDB-specific — when `storage.engine` is `lmdb`, none of these options apply. +RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`. -For single-tenant deployments the defaults are appropriate. The knobs below are intended for shared-host or memory-constrained environments where multiple Harper databases coexist with other workloads and total memory must be bounded predictably. +### How RocksDB reads are cached + +A read of a record that isn't in the memtable goes through three tiers before reaching disk: + +1. **Block cache** (in-process, decompressed) — sized by `storage.rocks.blockCacheSize`. A hit returns in roughly a microsecond with no syscall and no decompression cost. +2. **OS page cache** (kernel, compressed SST file pages) — sized dynamically by the kernel from whatever memory isn't claimed by the process. A block-cache miss that hits the page cache costs a `read` syscall plus decompression — still on the order of microseconds, just an order of magnitude slower than the block cache. +3. **Disk** — if neither cache holds the page, RocksDB reads from the SST file directly. + +Harper uses buffered I/O, so the OS page cache is always in play. The implication for sizing: shrinking the block cache doesn't directly translate to more disk reads — it shifts hits from the block cache (decompressed) to the OS page cache (compressed). The OS page cache also adjusts dynamically to host-wide memory pressure, which the block cache does not. Reserving less memory for the block cache leaves more for the page cache and for unrelated allocations on the host. + +The trade-off favors a larger block cache when read latency matters and the working set fits; it favors a smaller block cache when memory pressure or noisy neighbors are the dominant concern. ### `storage.rocks.blockCacheSize` @@ -152,9 +162,9 @@ Type: `number` (bytes) Default: 25% of constrained (cgroup) or total memory -The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool — sizing it correctly is a balance between read-cache hit rate and leaving room for memtables, the heap, and OS page cache. +The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool. -The default sizes the cache to 25% of available memory, computed once at startup. This is reasonable for single-tenant servers but can be excessive in multi-tenant deployments where the cache is rarely filled and the unused capacity contributes to the process's idle-state memory floor (the cache itself does not shrink on idle — entries persist until LRU eviction or a manual `SetCapacity` change). +The cache fills as blocks are read; it does **not** shrink on idle. Once the cache reaches its high-water mark for a workload, entries persist until LRU eviction or a manual capacity change. A long-running instance with a brief burst of activity will hold the cached blocks for the lifetime of the process. ```yaml storage: @@ -164,11 +174,11 @@ storage: Lower the cache size when: -- Multiple Harper instances or other memory-heavy processes share the host. -- Read access patterns favor warm data far smaller than 25% of memory. -- The instance is provisioned with a strict cgroup limit and the headroom is needed for memtables or application heap. +- The host has limited memory headroom and the OS page cache is a meaningful second tier. +- Read access patterns favor a warm working set far smaller than 25% of memory. +- The instance runs under a strict cgroup limit and the headroom is needed for memtables or application heap. -Raise it (or leave at the default) when reads dominate and the working set is large. +Raise it (or leave at the default) when reads dominate and the working set comfortably fits at 25%. ### `storage.rocks.writeBufferManagerSize` @@ -178,9 +188,9 @@ Type: `number` (bytes) Default: `0` (disabled) -When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history used by [OptimisticTransactionDB](./storage-algorithm.md) for conflict checking — is capped at this size across the entire process. +When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's OptimisticTransactionDB retains for conflict checking — is capped at this size across the entire process. -Without a `WriteBufferManager`, each column family manages its own memtable budget. For databases with many tables (column families), this can grow unbounded: every column family retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection, so a 14-table database can hold 1–2 GB of resident anonymous memory before any cap is reached. +Without a `WriteBufferManager`, each column family (table) manages its own memtable budget. The total grows with the number of column families: each one retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection. A database with many tables can accumulate hundreds of megabytes to a few gigabytes of resident anonymous memory before any cap is reached. Enabling the manager bounds that growth at a single configurable limit: