Summary
A distributed CUDA worker host-registers the ENTIRE gguf mapping for device
access instead of only its assigned --layers slice. A Q4 worker therefore needs
~150GB and fails on a 128GB machine, even though its slice is only ~78GB.
Q2 (~81GB whole file) fits and runs fine.
Setup
- Coordinator: MacBook Pro M5 Max 128GB (Metal), --layers 0:20
- Worker: NVIDIA DGX Spark (GB10, sm_121) 128GB (CUDA), --layers 21:output
- 5GbE link; route forms fine ("complete route ready")
- Same commit on both; q4-imatrix / q2-imatrix from download_model.sh
Q4 (fails)
ds4: restricting cuda model map to layers 21:output (48 spans, 78.21 GiB tensor span)
ds4: CUDA host registration skipped: out of memory
ds4: cuda backend initialized for graph diagnostics
(coordinator) prompt processing failed: cuda layer-slice failed
Q2 (works)
ds4: restricting cuda model map to layers 21:output (48 spans, 41.08 GiB tensor span)
ds4: CUDA registered 80.76 GiB model mapping for device access
prefill: 9.36 t/s, generation: 16.59 t/s
Likely cause
"registered 80.76 GiB" ≈ the whole Q2 file while the slice is only 41 GiB, so
registration covers the full mmap, not the slice. For Q4 the whole file (~150GB)
exceeds the 128GB worker, so cudaHostRegister OOMs and the slice can't run.
Registering only the slice's pages (~78GB) would fit and let Q4 run distributed,
which is the whole point of the layer split. GB10 has coherent unified memory,
so full host pinning may be avoidable on this platform entirely.
Summary
A distributed CUDA worker host-registers the ENTIRE gguf mapping for device
access instead of only its assigned --layers slice. A Q4 worker therefore needs
~150GB and fails on a 128GB machine, even though its slice is only ~78GB.
Q2 (~81GB whole file) fits and runs fine.
Setup
Q4 (fails)
ds4: restricting cuda model map to layers 21:output (48 spans, 78.21 GiB tensor span)
ds4: CUDA host registration skipped: out of memory
ds4: cuda backend initialized for graph diagnostics
(coordinator) prompt processing failed: cuda layer-slice failed
Q2 (works)
ds4: restricting cuda model map to layers 21:output (48 spans, 41.08 GiB tensor span)
ds4: CUDA registered 80.76 GiB model mapping for device access
prefill: 9.36 t/s, generation: 16.59 t/s
Likely cause
"registered 80.76 GiB" ≈ the whole Q2 file while the slice is only 41 GiB, so
registration covers the full mmap, not the slice. For Q4 the whole file (~150GB)
exceeds the 128GB worker, so cudaHostRegister OOMs and the slice can't run.
Registering only the slice's pages (~78GB) would fit and let Q4 run distributed,
which is the whole point of the layer split. GB10 has coherent unified memory,
so full host pinning may be avoidable on this platform entirely.