Skip to content

Poor mans parallelism: ~1.6× from 2 processes #314

@tdamsma

Description

@tdamsma

First off: thanks for the work! I made an interesting observation: running 2 processes of DS4-Q2 seems to increase aggregated tok/s by 60%. Claude made a lengthy write-up below:

Following up on the multi-session discussion (#209) and the batching one (#275) with some empirical numbers, in case they're useful to others running DS4 on big-RAM Macs.

I'm taking the path you've already endorsed for multi-user — run separate ds4-server instances and route between them (in #209 you noted decoding is serialized and the OS page cache already helps; in #275 you closed in-process batching as ~zero-gain on a Mac). So this isn't a request to add batching — it's data on how far the multi-instance route actually goes, and where it hits a wall that I don't think is documented yet.

Setup: M3 Ultra, 256 GB, a tiny round-robin reverse proxy in front of N ds4-server processes, each with its own DS4_LOCK_FILE + --kv-disk-dir. Two quants: Q2 (IQ2XXS-class, gguf ≈ 80.8 GB) and Q4 (gguf ≈ 153 GB). Load = identical ~300-token completions fired concurrently; aggregate = total completion tokens / wall time.

Results — Q2 (80.8 GB)

concurrency aggregate tok/s per-instance scaling
1 ~33 33 1.00×
2 ~49–50 ~25 each ~1.5–1.6×
3 thrashes (calls hang/timeout)
4 won't warm up (Metal OOM)

Two instances give ~1.5–1.6× aggregate. Per-instance decode drops ~24% under 2-way load (33 → ~25 tok/s), so it's partly memory-bandwidth-bound — not free, but a real net win. (Amusingly the same magnitude as the MTP speedup TrevorS measured in #244, via a totally different mechanism.)

The part I couldn't find documented: RAM is not the ceiling, wired residency is

The intuition "N instances of a 153 GB model = N×153 GB" is wrong here, in both directions:

  • mmap shares the weights. The gguf is mmap'd into the page cache once and all instances read the same physical pages (log: Metal mapped mmaped model as 1 overlapping shared buffers). Measured: 4× Q2 loaded idle = 214 GB PhysMem total, ~3.3 GB wired — not 4×80 = 323 GB. RAM capacity is cheap. This is the same effect discussed in llama.cpp #21223, where the conclusion was roughly "mmap sharing works on Apple Silicon thanks to unified memory."
  • But that's not the whole story. The binding constraint during active inference is wired / GPU-resident working set, which is largely per-process and not shared:
    • Q2: 2 active engines fit fine (~1.6×). 3 over-commit and thrash. A 4th won't warm up — Metal model warmup failed: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory).
    • Q4 (153 GB): with just one engine active the box was already at ~145 GB wired / ~253 GB used / ~1.6 GB free. A second engine's warmup OOMs instantly. So 2× Q4 is infeasible on 256 GB — not because of RAM (mmap shares the 153 GB) but because the per-engine wired footprint during inference is too large to host two.

So on Apple Silicon unified memory, mmap-sharing makes the page-cache cost ~1× the model, but it does not make you immune to a memory ceiling — the wired residency per active process still bites. The useful planning number isn't gguf-size × N; it's wired footprint during active inference. On 256 GB that works out to "2 active Q2 engines, or 1 Q4."

(Tangentially, this lines up with your note in #244 that swapping mmap for HBM device buffers "created pressure" and bad logits — mmap being safer is the same wired-pressure phenomenon seen from the other side.)

Why I'm posting

Mostly as a data point for the next person sizing this on a 256/512 GB Mac: the multi-instance route gets ~1.6× for the small quant and nothing for the big quant, and the limit is wired residency, not total RAM — which I think sharpens (and partly corrects) the optimistic "unified memory solves multi-instance" reading. Happy to re-run with other params, ctx sizes, or quants if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions