Distributed CUDA worker host-registers the whole GGUF, not just its layer slice → Q4 OOM on 128GB DGX Spark

## Summary
A distributed CUDA worker host-registers the ENTIRE gguf mapping for device
access instead of only its assigned --layers slice. A Q4 worker therefore needs
~150GB and fails on a 128GB machine, even though its slice is only ~78GB.
Q2 (~81GB whole file) fits and runs fine.

## Setup
- Coordinator: MacBook Pro M5 Max 128GB (Metal), --layers 0:20
- Worker: NVIDIA DGX Spark (GB10, sm_121) 128GB (CUDA), --layers 21:output
- 5GbE link; route forms fine ("complete route ready")
- Same commit on both; q4-imatrix / q2-imatrix from download_model.sh

## Q4 (fails)
ds4: restricting cuda model map to layers 21:output (48 spans, 78.21 GiB tensor span)
ds4: CUDA host registration skipped: out of memory
ds4: cuda backend initialized for graph diagnostics
(coordinator) prompt processing failed: cuda layer-slice failed

## Q2 (works)
ds4: restricting cuda model map to layers 21:output (48 spans, 41.08 GiB tensor span)
ds4: CUDA registered 80.76 GiB model mapping for device access
prefill: 9.36 t/s, generation: 16.59 t/s

## Likely cause
"registered 80.76 GiB" ≈ the whole Q2 file while the slice is only 41 GiB, so
registration covers the full mmap, not the slice. For Q4 the whole file (~150GB)
exceeds the 128GB worker, so cudaHostRegister OOMs and the slice can't run.
Registering only the slice's pages (~78GB) would fit and let Q4 run distributed,
which is the whole point of the layer split. GB10 has coherent unified memory,
so full host pinning may be avoidable on this platform entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed CUDA worker host-registers the whole GGUF, not just its layer slice → Q4 OOM on 128GB DGX Spark #293

Summary

Setup

Q4 (fails)

Q2 (works)

Likely cause

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Distributed CUDA worker host-registers the whole GGUF, not just its layer slice → Q4 OOM on 128GB DGX Spark #293

Description

Summary

Setup

Q4 (fails)

Q2 (works)

Likely cause

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions