From 4dac51bf3035fd079c08794dbc19598182a13a27 Mon Sep 17 00:00:00 2001
From: Kris Zyp <kriszyp@gmail.com>
Date: Tue, 26 May 2026 07:19:38 -0600
Subject: [PATCH 1/2] docs(storage): document storage.rocks.* memory config
 options
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds documentation for four new storage configuration parameters
introduced in v5.1.0:

- storage.rocks.blockCacheSize — explicit override for the shared
  RocksDB block cache size
- storage.rocks.writeBufferManagerSize — process-wide cap on
  memtable memory across all databases
- storage.rocks.writeBufferManagerCostToCache — share memtable
  accounting with the block cache for unified observability
- storage.rocks.writeBufferManagerAllowStall — hard cap vs soft
  cap behavior when the WriteBufferManager limit is reached

- Adds bullet-list entries to reference/configuration/options.md
  under the existing `storage` section.
- Adds a new "RocksDB Memory" detail section to
  reference/database/storage-tuning.md between "Read & Write
  Behavior" and "Storage Reclamation", with workload-recipe-style
  guidance for when to lower the cache or enable the manager.

Documents the multi-tenant shared-host scenario as the primary use
case for tuning these knobs.

Source PR: HarperFast/harper#780

Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>
---
 reference/configuration/options.md   |  4 ++
 reference/database/storage-tuning.md | 93 ++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/reference/configuration/options.md b/reference/configuration/options.md
index ad8a59a6..4b29433c 100644
--- a/reference/configuration/options.md
+++ b/reference/configuration/options.md
@@ -239,6 +239,10 @@ storage:
 - `reclamation.threshold` — Free-space ratio below which reclamation begins evicting from caching tables; _Default_: `0.4` (Added in: v4.5.0)
 - `reclamation.interval` — Free-space check interval; _Default_: `1h`
 - `reclamation.evictionFactor` — Heuristic factor for early eviction under disk pressure; _Default_: `100000`. See [Storage Tuning — Reclamation](../database/storage-tuning.md#storage-reclamation)
+- `rocks.blockCacheSize` — RocksDB shared block cache size in bytes; _Default_: 25% of constrained memory. See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0)
+- `rocks.writeBufferManagerSize` — Process-wide cap (bytes) on RocksDB memtable memory across all databases. `0` disables; _Default_: `0`. See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0)
+- `rocks.writeBufferManagerCostToCache` — Charge memtable memory against the block cache so both share a single accounting pool; _Default_: `false`. Has no effect when `writeBufferManagerSize` is `0`. (Added in: v5.1.0)
+- `rocks.writeBufferManagerAllowStall` — Stall writes when memtable memory exceeds `writeBufferManagerSize` (hard cap) instead of allowing brief overshoot with more aggressive flushing (soft cap); _Default_: `false`. (Added in: v5.1.0)
 
 ---
 
diff --git a/reference/database/storage-tuning.md b/reference/database/storage-tuning.md
index 98a30283..23789fc7 100644
--- a/reference/database/storage-tuning.md
+++ b/reference/database/storage-tuning.md
@@ -138,6 +138,99 @@ Default: `true`
 
 In-memory record caching of decoded records. Disable to reduce heap usage when records are large and unlikely to be re-read in the same process.
 
+## RocksDB Memory
+
+RocksDB uses two large native memory pools that Harper exposes for tuning: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These are RocksDB-specific — when `storage.engine` is `lmdb`, none of these options apply.
+
+For single-tenant deployments the defaults are appropriate. The knobs below are intended for shared-host or memory-constrained environments where multiple Harper databases coexist with other workloads and total memory must be bounded predictably.
+
+### `storage.rocks.blockCacheSize`
+
+<VersionBadge version="v5.1.0" />
+
+Type: `number` (bytes)
+
+Default: 25% of constrained (cgroup) or total memory
+
+The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool — sizing it correctly is a balance between read-cache hit rate and leaving room for memtables, the heap, and OS page cache.
+
+The default sizes the cache to 25% of available memory, computed once at startup. This is reasonable for single-tenant servers but can be excessive in multi-tenant deployments where the cache is rarely filled and the unused capacity contributes to the process's idle-state memory floor (the cache itself does not shrink on idle — entries persist until LRU eviction or a manual `SetCapacity` change).
+
+```yaml
+storage:
+  rocks:
+    blockCacheSize: 268435456 # 256 MB
+```
+
+Lower the cache size when:
+
+- Multiple Harper instances or other memory-heavy processes share the host.
+- Read access patterns favor warm data far smaller than 25% of memory.
+- The instance is provisioned with a strict cgroup limit and the headroom is needed for memtables or application heap.
+
+Raise it (or leave at the default) when reads dominate and the working set is large.
+
+### `storage.rocks.writeBufferManagerSize`
+
+<VersionBadge version="v5.1.0" />
+
+Type: `number` (bytes)
+
+Default: `0` (disabled)
+
+When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history used by [OptimisticTransactionDB](./storage-algorithm.md) for conflict checking — is capped at this size across the entire process.
+
+Without a `WriteBufferManager`, each column family manages its own memtable budget. For databases with many tables (column families), this can grow unbounded: every column family retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection, so a 14-table database can hold 1–2 GB of resident anonymous memory before any cap is reached.
+
+Enabling the manager bounds that growth at a single configurable limit:
+
+```yaml
+storage:
+  rocks:
+    writeBufferManagerSize: 268435456 # 256 MB total memtable budget
+```
+
+The manager affects new databases opened after it is configured; existing open databases retain whatever budget they were attached with.
+
+### `storage.rocks.writeBufferManagerCostToCache`
+
+<VersionBadge version="v5.1.0" />
+
+Type: `boolean`
+
+Default: `false`
+
+When `true`, memtable memory tracked by the `WriteBufferManager` is **charged against the block cache** as pinned cache entries. The block cache and write buffers then share a single accounting pool, visible through one operational metric (`rocksdb.block-cache-usage`).
+
+This does not let the cache "shrink" to make room for writes — pinned entries cannot be evicted by LRU — but it unifies observability and bounds the combined memory footprint when `writeBufferManagerSize` is at or below `blockCacheSize`.
+
+Has no effect when `storage.rocks.writeBufferManagerSize` is `0` or when the block cache is disabled.
+
+```yaml
+storage:
+  rocks:
+    blockCacheSize: 536870912 # 512 MB
+    writeBufferManagerSize: 268435456 # 256 MB
+    writeBufferManagerCostToCache: true
+```
+
+### `storage.rocks.writeBufferManagerAllowStall`
+
+<VersionBadge version="v5.1.0" />
+
+Type: `boolean`
+
+Default: `false`
+
+Controls behavior when memtable memory reaches `writeBufferManagerSize`:
+
+- `false` (soft cap) — Memtables may briefly exceed the limit. RocksDB compensates by flushing more aggressively. Writes proceed without latency spikes; total memory may temporarily overshoot during bursts.
+- `true` (hard cap) — Writes are stalled until flushes free up memory. Total memtable memory is strictly bounded; write latency can spike during bursts.
+
+Use the default (`false`) for most workloads. Enable stalling only when a strict OOM-prevention guarantee is required and the application can tolerate occasional write-latency spikes.
+
+This option is the only `WriteBufferManager` setting that can be changed at runtime — `costToCache` is fixed at first creation.
+
 ## Storage Reclamation
 
 `storage.reclamation` controls how Harper evicts data from caching tables (tables with [`sourcedFrom`](../resources/resource-api.md#sourcedfromresource-options)) when disk usage runs high. Reclamation does **not** affect non-caching tables — those rely on explicit deletion, TTL expiration, or [compaction](./compaction.md).

From 2b57700abf1e5dfea7d30920466cadd68ab90c41 Mon Sep 17 00:00:00 2001
From: Kris Zyp <kriszyp@gmail.com>
Date: Tue, 26 May 2026 11:47:42 -0600
Subject: [PATCH 2/2] docs(storage): generalize RocksDB memory section, add
 cache hierarchy

Reframes the RocksDB Memory section to focus on general memory
management rather than Fabric-specific multi-tenant tuning (that
context now lives in host-manager's DESIGN.md, where the actual
tuning decision is made).

Adds a new "How RocksDB reads are cached" subsection explaining
the three-tier hierarchy:

  block cache -> OS page cache -> disk

with concrete latency expectations and the implication for sizing:
shrinking the block cache shifts hits from the decompressed in-process
cache to the kernel's dynamically-sized compressed page cache, not
directly to disk. This helps operators reason about why a smaller
block cache isn't necessarily worse on memory-constrained hosts.

Drops the "shared-host / multi-tenant" framing from the section
intro and from the blockCacheSize discussion. Softens the
WriteBufferManager rationale to talk about "many tables" generically
rather than naming a 14-table customer profile.

Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>
---
 reference/database/storage-tuning.md | 30 ++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/reference/database/storage-tuning.md b/reference/database/storage-tuning.md
index 23789fc7..36bcc4e5 100644
--- a/reference/database/storage-tuning.md
+++ b/reference/database/storage-tuning.md
@@ -140,9 +140,19 @@ In-memory record caching of decoded records. Disable to reduce heap usage when r
 
 ## RocksDB Memory
 
-RocksDB uses two large native memory pools that Harper exposes for tuning: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These are RocksDB-specific — when `storage.engine` is `lmdb`, none of these options apply.
+RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks, and an optional **WriteBufferManager** that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`.
 
-For single-tenant deployments the defaults are appropriate. The knobs below are intended for shared-host or memory-constrained environments where multiple Harper databases coexist with other workloads and total memory must be bounded predictably.
+### How RocksDB reads are cached
+
+A read of a record that isn't in the memtable goes through three tiers before reaching disk:
+
+1. **Block cache** (in-process, decompressed) — sized by `storage.rocks.blockCacheSize`. A hit returns in roughly a microsecond with no syscall and no decompression cost.
+2. **OS page cache** (kernel, compressed SST file pages) — sized dynamically by the kernel from whatever memory isn't claimed by the process. A block-cache miss that hits the page cache costs a `read` syscall plus decompression — still on the order of microseconds, just an order of magnitude slower than the block cache.
+3. **Disk** — if neither cache holds the page, RocksDB reads from the SST file directly.
+
+Harper uses buffered I/O, so the OS page cache is always in play. The implication for sizing: shrinking the block cache doesn't directly translate to more disk reads — it shifts hits from the block cache (decompressed) to the OS page cache (compressed). The OS page cache also adjusts dynamically to host-wide memory pressure, which the block cache does not. Reserving less memory for the block cache leaves more for the page cache and for unrelated allocations on the host.
+
+The trade-off favors a larger block cache when read latency matters and the working set fits; it favors a smaller block cache when memory pressure or noisy neighbors are the dominant concern.
 
 ### `storage.rocks.blockCacheSize`
 
@@ -152,9 +162,9 @@ Type: `number` (bytes)
 
 Default: 25% of constrained (cgroup) or total memory
 
-The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool — sizing it correctly is a balance between read-cache hit rate and leaving room for memtables, the heap, and OS page cache.
+The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool.
 
-The default sizes the cache to 25% of available memory, computed once at startup. This is reasonable for single-tenant servers but can be excessive in multi-tenant deployments where the cache is rarely filled and the unused capacity contributes to the process's idle-state memory floor (the cache itself does not shrink on idle — entries persist until LRU eviction or a manual `SetCapacity` change).
+The cache fills as blocks are read; it does **not** shrink on idle. Once the cache reaches its high-water mark for a workload, entries persist until LRU eviction or a manual capacity change. A long-running instance with a brief burst of activity will hold the cached blocks for the lifetime of the process.
 
 ```yaml
 storage:
@@ -164,11 +174,11 @@ storage:
 
 Lower the cache size when:
 
-- Multiple Harper instances or other memory-heavy processes share the host.
-- Read access patterns favor warm data far smaller than 25% of memory.
-- The instance is provisioned with a strict cgroup limit and the headroom is needed for memtables or application heap.
+- The host has limited memory headroom and the OS page cache is a meaningful second tier.
+- Read access patterns favor a warm working set far smaller than 25% of memory.
+- The instance runs under a strict cgroup limit and the headroom is needed for memtables or application heap.
 
-Raise it (or leave at the default) when reads dominate and the working set is large.
+Raise it (or leave at the default) when reads dominate and the working set comfortably fits at 25%.
 
 ### `storage.rocks.writeBufferManagerSize`
 
@@ -178,9 +188,9 @@ Type: `number` (bytes)
 
 Default: `0` (disabled)
 
-When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history used by [OptimisticTransactionDB](./storage-algorithm.md) for conflict checking — is capped at this size across the entire process.
+When set, Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's OptimisticTransactionDB retains for conflict checking — is capped at this size across the entire process.
 
-Without a `WriteBufferManager`, each column family manages its own memtable budget. For databases with many tables (column families), this can grow unbounded: every column family retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection, so a 14-table database can hold 1–2 GB of resident anonymous memory before any cap is reached.
+Without a `WriteBufferManager`, each column family (table) manages its own memtable budget. The total grows with the number of column families: each one retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection. A database with many tables can accumulate hundreds of megabytes to a few gigabytes of resident anonymous memory before any cap is reached.
 
 Enabling the manager bounds that growth at a single configurable limit: