Poor mans parallelism: ~1.6× from 2 processes

First off: thanks for the work! I made an interesting observation: running 2 processes of DS4-Q2 seems to increase aggregated tok/s by 60%. Claude made a lengthy write-up below: 



Following up on the multi-session discussion (#209) and the batching one (#275) with some empirical numbers, in case they're useful to others running DS4 on big-RAM Macs.

I'm taking the path you've already endorsed for multi-user — **run separate `ds4-server` instances and route between them** (in #209 you noted decoding is serialized and the OS page cache already helps; in #275 you closed in-process batching as ~zero-gain on a Mac). So this isn't a request to add batching — it's data on how far the multi-instance route actually goes, and where it hits a wall that I don't think is documented yet.

Setup: **M3 Ultra, 256 GB**, a tiny round-robin reverse proxy in front of N `ds4-server` processes, each with its own `DS4_LOCK_FILE` + `--kv-disk-dir`. Two quants: **Q2** (`IQ2XXS`-class, gguf ≈ 80.8 GB) and **Q4** (gguf ≈ 153 GB). Load = identical ~300-token completions fired concurrently; aggregate = total completion tokens / wall time.

## Results — Q2 (80.8 GB)

| concurrency | aggregate tok/s | per-instance | scaling |
|---|---|---|---|
| 1 | ~33 | 33 | 1.00× |
| 2 | ~49–50 | ~25 each | **~1.5–1.6×** |
| 3 | — | — | thrashes (calls hang/timeout) |
| 4 | — | — | won't warm up (Metal OOM) |

Two instances give **~1.5–1.6× aggregate**. Per-instance decode drops ~24% under 2-way load (33 → ~25 tok/s), so it's *partly* memory-bandwidth-bound — not free, but a real net win. (Amusingly the same magnitude as the MTP speedup TrevorS measured in #244, via a totally different mechanism.)

## The part I couldn't find documented: RAM is not the ceiling, wired residency is

The intuition "N instances of a 153 GB model = N×153 GB" is wrong here, in both directions:

- **mmap shares the weights.** The gguf is mmap'd into the page cache once and all instances read the same physical pages (log: `Metal mapped mmaped model as 1 overlapping shared buffers`). Measured: **4× Q2 loaded idle = 214 GB PhysMem total, ~3.3 GB wired** — not 4×80 = 323 GB. RAM *capacity* is cheap. This is the same effect discussed in llama.cpp [#21223](https://github.com/ggml-org/llama.cpp/discussions/21223), where the conclusion was roughly "mmap sharing works on Apple Silicon thanks to unified memory."
- **But that's not the whole story.** The binding constraint during *active* inference is **wired / GPU-resident working set**, which is largely per-process and *not* shared:
  - **Q2:** 2 active engines fit fine (~1.6×). 3 over-commit and thrash. A 4th won't warm up — `Metal model warmup failed: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory)`.
  - **Q4 (153 GB):** with just **one** engine active the box was already at **~145 GB wired / ~253 GB used / ~1.6 GB free**. A second engine's warmup OOMs instantly. So 2× Q4 is infeasible on 256 GB — *not* because of RAM (mmap shares the 153 GB) but because the per-engine wired footprint during inference is too large to host two.

So on Apple Silicon unified memory, mmap-sharing makes the page-cache cost ~1× the model, but it does **not** make you immune to a memory ceiling — the wired residency per active process still bites. The useful planning number isn't gguf-size × N; it's **wired footprint during active inference**. On 256 GB that works out to "2 active Q2 engines, or 1 Q4."

(Tangentially, this lines up with your note in #244 that swapping mmap for HBM device buffers "created pressure" and bad logits — mmap being safer is the same wired-pressure phenomenon seen from the other side.)

## Why I'm posting

Mostly as a data point for the next person sizing this on a 256/512 GB Mac: the multi-instance route gets ~1.6× for the small quant and **nothing** for the big quant, and the limit is wired residency, not total RAM — which I think sharpens (and partly corrects) the optimistic "unified memory solves multi-instance" reading. Happy to re-run with other params, ctx sizes, or quants if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor mans parallelism: ~1.6× from 2 processes #314

Results — Q2 (80.8 GB)

The part I couldn't find documented: RAM is not the ceiling, wired residency is

Why I'm posting

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

concurrency	aggregate tok/s	per-instance	scaling
1	~33	33	1.00×
2	~49–50	~25 each	~1.5–1.6×
3	—	—	thrashes (calls hang/timeout)
4	—	—	won't warm up (Metal OOM)

Poor mans parallelism: ~1.6× from 2 processes #314

Description

Results — Q2 (80.8 GB)

The part I couldn't find documented: RAM is not the ceiling, wired residency is

Why I'm posting

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions