First off: thanks for the work! I made an interesting observation: running 2 processes of DS4-Q2 seems to increase aggregated tok/s by 60%. Claude made a lengthy write-up below:
Following up on the multi-session discussion (#209) and the batching one (#275) with some empirical numbers, in case they're useful to others running DS4 on big-RAM Macs.
I'm taking the path you've already endorsed for multi-user — run separate ds4-server instances and route between them (in #209 you noted decoding is serialized and the OS page cache already helps; in #275 you closed in-process batching as ~zero-gain on a Mac). So this isn't a request to add batching — it's data on how far the multi-instance route actually goes, and where it hits a wall that I don't think is documented yet.
Setup: M3 Ultra, 256 GB, a tiny round-robin reverse proxy in front of N ds4-server processes, each with its own DS4_LOCK_FILE + --kv-disk-dir. Two quants: Q2 (IQ2XXS-class, gguf ≈ 80.8 GB) and Q4 (gguf ≈ 153 GB). Load = identical ~300-token completions fired concurrently; aggregate = total completion tokens / wall time.
Results — Q2 (80.8 GB)
| concurrency |
aggregate tok/s |
per-instance |
scaling |
| 1 |
~33 |
33 |
1.00× |
| 2 |
~49–50 |
~25 each |
~1.5–1.6× |
| 3 |
— |
— |
thrashes (calls hang/timeout) |
| 4 |
— |
— |
won't warm up (Metal OOM) |
Two instances give ~1.5–1.6× aggregate. Per-instance decode drops ~24% under 2-way load (33 → ~25 tok/s), so it's partly memory-bandwidth-bound — not free, but a real net win. (Amusingly the same magnitude as the MTP speedup TrevorS measured in #244, via a totally different mechanism.)
The part I couldn't find documented: RAM is not the ceiling, wired residency is
The intuition "N instances of a 153 GB model = N×153 GB" is wrong here, in both directions:
- mmap shares the weights. The gguf is mmap'd into the page cache once and all instances read the same physical pages (log:
Metal mapped mmaped model as 1 overlapping shared buffers). Measured: 4× Q2 loaded idle = 214 GB PhysMem total, ~3.3 GB wired — not 4×80 = 323 GB. RAM capacity is cheap. This is the same effect discussed in llama.cpp #21223, where the conclusion was roughly "mmap sharing works on Apple Silicon thanks to unified memory."
- But that's not the whole story. The binding constraint during active inference is wired / GPU-resident working set, which is largely per-process and not shared:
- Q2: 2 active engines fit fine (~1.6×). 3 over-commit and thrash. A 4th won't warm up —
Metal model warmup failed: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory).
- Q4 (153 GB): with just one engine active the box was already at ~145 GB wired / ~253 GB used / ~1.6 GB free. A second engine's warmup OOMs instantly. So 2× Q4 is infeasible on 256 GB — not because of RAM (mmap shares the 153 GB) but because the per-engine wired footprint during inference is too large to host two.
So on Apple Silicon unified memory, mmap-sharing makes the page-cache cost ~1× the model, but it does not make you immune to a memory ceiling — the wired residency per active process still bites. The useful planning number isn't gguf-size × N; it's wired footprint during active inference. On 256 GB that works out to "2 active Q2 engines, or 1 Q4."
(Tangentially, this lines up with your note in #244 that swapping mmap for HBM device buffers "created pressure" and bad logits — mmap being safer is the same wired-pressure phenomenon seen from the other side.)
Why I'm posting
Mostly as a data point for the next person sizing this on a 256/512 GB Mac: the multi-instance route gets ~1.6× for the small quant and nothing for the big quant, and the limit is wired residency, not total RAM — which I think sharpens (and partly corrects) the optimistic "unified memory solves multi-instance" reading. Happy to re-run with other params, ctx sizes, or quants if useful.
First off: thanks for the work! I made an interesting observation: running 2 processes of DS4-Q2 seems to increase aggregated tok/s by 60%. Claude made a lengthy write-up below:
Following up on the multi-session discussion (#209) and the batching one (#275) with some empirical numbers, in case they're useful to others running DS4 on big-RAM Macs.
I'm taking the path you've already endorsed for multi-user — run separate
ds4-serverinstances and route between them (in #209 you noted decoding is serialized and the OS page cache already helps; in #275 you closed in-process batching as ~zero-gain on a Mac). So this isn't a request to add batching — it's data on how far the multi-instance route actually goes, and where it hits a wall that I don't think is documented yet.Setup: M3 Ultra, 256 GB, a tiny round-robin reverse proxy in front of N
ds4-serverprocesses, each with its ownDS4_LOCK_FILE+--kv-disk-dir. Two quants: Q2 (IQ2XXS-class, gguf ≈ 80.8 GB) and Q4 (gguf ≈ 153 GB). Load = identical ~300-token completions fired concurrently; aggregate = total completion tokens / wall time.Results — Q2 (80.8 GB)
Two instances give ~1.5–1.6× aggregate. Per-instance decode drops ~24% under 2-way load (33 → ~25 tok/s), so it's partly memory-bandwidth-bound — not free, but a real net win. (Amusingly the same magnitude as the MTP speedup TrevorS measured in #244, via a totally different mechanism.)
The part I couldn't find documented: RAM is not the ceiling, wired residency is
The intuition "N instances of a 153 GB model = N×153 GB" is wrong here, in both directions:
Metal mapped mmaped model as 1 overlapping shared buffers). Measured: 4× Q2 loaded idle = 214 GB PhysMem total, ~3.3 GB wired — not 4×80 = 323 GB. RAM capacity is cheap. This is the same effect discussed in llama.cpp #21223, where the conclusion was roughly "mmap sharing works on Apple Silicon thanks to unified memory."Metal model warmup failed: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory).So on Apple Silicon unified memory, mmap-sharing makes the page-cache cost ~1× the model, but it does not make you immune to a memory ceiling — the wired residency per active process still bites. The useful planning number isn't gguf-size × N; it's wired footprint during active inference. On 256 GB that works out to "2 active Q2 engines, or 1 Q4."
(Tangentially, this lines up with your note in #244 that swapping mmap for HBM device buffers "created pressure" and bad logits — mmap being safer is the same wired-pressure phenomenon seen from the other side.)
Why I'm posting
Mostly as a data point for the next person sizing this on a 256/512 GB Mac: the multi-instance route gets ~1.6× for the small quant and nothing for the big quant, and the limit is wired residency, not total RAM — which I think sharpens (and partly corrects) the optimistic "unified memory solves multi-instance" reading. Happy to re-run with other params, ctx sizes, or quants if useful.