--ssd-streaming works on Flash q2-imatrix but fails on Flash q4-imatrix (M5 Max, latest main)
Same machine/flags, only the model differs.
q2-imatrix: runs fine (coherent output), prefill 3.54 / gen 13.19 t/s.
q4-imatrix: fails immediately —
non-routed weights: 8.20 GiB, routed expert 13.50 MiB, cached 5902 (77.81 GiB)
initial model map restricted to token embedding (0.99 GiB)
Metal model range 0.01..3.39 GiB is not covered by mapped model views
prompt processing failed: metal prefill failed
Same failure with --ssd-streaming-cold and --ssd-streaming-cache-experts 16GB.
The streaming path itself works (q2 runs), so this looks specific to the q4
tensor layout (Q4_K experts + F16 indexer/compressor/HC); the map never covers
the low-offset (0..3.4 GiB) non-routed F16 tensors for q4.
Is q4/Flash intended to be supported by --ssd-streaming, or is it PRO/q4-layout WIP?
--ssd-streaming works on Flash q2-imatrix but fails on Flash q4-imatrix (M5 Max, latest main)
Same machine/flags, only the model differs.
q2-imatrix: runs fine (coherent output), prefill 3.54 / gen 13.19 t/s.
q4-imatrix: fails immediately —
non-routed weights: 8.20 GiB, routed expert 13.50 MiB, cached 5902 (77.81 GiB)
initial model map restricted to token embedding (0.99 GiB)
Metal model range 0.01..3.39 GiB is not covered by mapped model views
prompt processing failed: metal prefill failed
Same failure with --ssd-streaming-cold and --ssd-streaming-cache-experts 16GB.
The streaming path itself works (q2 runs), so this looks specific to the q4
tensor layout (Q4_K experts + F16 indexer/compressor/HC); the map never covers
the low-offset (0..3.4 GiB) non-routed F16 tensors for q4.
Is q4/Flash intended to be supported by --ssd-streaming, or is it PRO/q4-layout WIP?