IsaacCheng9 · IsaacCheng9 · May 10, 2026 · May 10, 2026 · May 10, 2026 · May 10, 2026
diff --git a/README.md b/README.md
@@ -2,11 +2,11 @@
 
 [![Test](https://github.com/IsaacCheng9/kv-engine/actions/workflows/test.yml/badge.svg)](https://github.com/IsaacCheng9/kv-engine/actions/workflows/test.yml)
 
-A C++23 LSM-tree key-value store with crash recovery and a gRPC API
-supporting point operations and server-streaming range scans.
+A C++23 LSM-tree key-value store with crash recovery and a gRPC API supporting
+point operations and server-streaming range scans.
 
-Modelled after LevelDB and RocksDB, with the LSM-tree design from O'Neil
-et al. (1996).
+Modelled after LevelDB and RocksDB, with the LSM-tree design from O'Neil et al.
+(1996).
 
 ## Key Features
 
@@ -23,13 +23,12 @@ et al. (1996).
 - **SSTable reader cache** – parsed readers stay resident for each file's
   lifetime and serve concurrent `get()` callers via positioned reads,
   eliminating per-lookup open and index-parse cost
-- **Per-SSTable Bloom filter** – probabilistic membership test consulted
-  before the binary search on `get()`, short-circuiting lookups for keys
-  guaranteed not to be in the file (no false negatives, ~1% false positive
-  rate)
+- **Per-SSTable Bloom filter** – probabilistic membership test consulted before
+  the binary search on `get()`, short-circuiting lookups for keys guaranteed not
+  to be in the file (no false negatives, ~1% false positive rate)
 - **Key range pruning** – cached min/max keys let `get()` skip SSTables whose
-  key range cannot contain the lookup key, avoiding the Bloom check and
-  binary search entirely
+  key range cannot contain the lookup key, avoiding the Bloom check and binary
+  search entirely
 - **gRPC API** – `Put` / `Get` / `Delete` as unary RPCs and `Scan` as
   server-streaming, with snapshot semantics isolating in-flight scans from
   concurrent writes, flushes, and compactions
@@ -38,6 +37,20 @@ et al. (1996).
 
 - Raft consensus for distributed replication across multiple nodes
 
+## Performance
+
+Measured on M1 Max in Release build. Full numbers in
+[`docs/2026_05_05_grpc_with_scan_baseline.txt`](docs/2026_05_05_grpc_with_scan_baseline.txt).
+
+| Workload            |     Throughput | Latency (p50) | Notes                                                            |
+| ------------------- | -------------: | ------------: | ---------------------------------------------------------------- |
+| Memtable read       |   2.6M ops/sec |       0.33 µs | Hot in-memory path                                               |
+| SSTable read        |   114k ops/sec |       8.54 µs | Cached reader + Bloom filter + range pruning                     |
+| Negative lookup     |    73k ops/sec |      13.58 µs | All read-path optimisations short-circuit                        |
+| Write (`put`)       |    16k ops/sec |         42 µs | `fsync`-bound on the WAL                                         |
+| gRPC unary read     |   7.3k ops/sec |       ~130 µs | Loopback overhead vs direct in-process call                      |
+| gRPC streaming scan | ~117k rows/sec |   ~8.5 µs/row | ~15x amortisation vs unary (HTTP/2 framing paid once per stream) |
+
 ## Architecture
 
 ```mermaid
@@ -46,10 +59,13 @@ flowchart TD
     grpc -->|engine API| engine[Engine]
     engine -->|writes| wal[Write-Ahead Log]
     engine -->|writes / reads| memtable[Memtable<br/>sorted in-memory]
+    engine -->|scan creates| snapshot[Scan Snapshot<br/>memtable copy + SSTable readers]
     memtable -->|flush when full| l0[L0 SSTables<br/>overlapping key ranges]
     l0 -->|background compaction| l1[L1 SSTables<br/>merged, deduplicated]
-    engine -.reads.-> l0
-    engine -.reads.-> l1
+    engine -.point reads.-> l0
+    engine -.point reads.-> l1
+    snapshot -.range reads.-> l0
+    snapshot -.range reads.-> l1
     wal -.replay on startup.-> memtable
 ```
 
@@ -145,17 +161,12 @@ the scan don't change what it yields. Tombstones are collapsed and shadowed
 older versions of a key are discarded; the caller sees only the newest live
 value per key in `[start_key, end_key)` order.
 
-### Performance
-
-On loopback (no real network RTT), gRPC adds ~130 µs round-trip vs direct
-in-process engine calls – HTTP/2 framing + protobuf serialise/deserialise +
-kernel TCP loopback. See the `grpc_*` rows in
-`docs/2026_05_05_grpc_with_scan_baseline.txt` for full numbers.
+### Why Server-Streaming for `Scan`
 
-Streaming RPCs amortise that overhead: `grpc_scan` measures ~8.5 µs per row vs
-~130 µs per unary call. Server-streaming pays the HTTP/2 framing cost once per
-stream rather than once per row, so the per-operation gRPC tax shrinks ~15x
-for range queries. This is the argument for using server-streaming `Scan` over
+Streaming RPCs amortise gRPC overhead: `grpc_scan` measures **~8.5 µs per row vs
+~130 µs per unary call.** Server-streaming pays the HTTP/2 framing cost once per
+stream rather than once per row, so the **per-operation gRPC tax shrinks ~15x
+for range queries.** This is the argument for using server-streaming `Scan` over
 a cursor-based unary API for `Scan`-shaped workloads.
 
 ## Benchmarks