Skip to content

--ssd-streaming broken for Flash q4-imatrix: forward pass hits model ranges not covered by mapped views (q2-imatrix works) #341

@ngt1999

Description

@ngt1999

--ssd-streaming works on Flash q2-imatrix but fails on Flash q4-imatrix (M5 Max, latest main)

Same machine/flags, only the model differs.

q2-imatrix: runs fine (coherent output), prefill 3.54 / gen 13.19 t/s.
q4-imatrix: fails immediately —
non-routed weights: 8.20 GiB, routed expert 13.50 MiB, cached 5902 (77.81 GiB)
initial model map restricted to token embedding (0.99 GiB)
Metal model range 0.01..3.39 GiB is not covered by mapped model views
prompt processing failed: metal prefill failed

Same failure with --ssd-streaming-cold and --ssd-streaming-cache-experts 16GB.
The streaming path itself works (q2 runs), so this looks specific to the q4
tensor layout (Q4_K experts + F16 indexer/compressor/HC); the map never covers
the low-offset (0..3.4 GiB) non-routed F16 tensors for q4.
Is q4/Flash intended to be supported by --ssd-streaming, or is it PRO/q4-layout WIP?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions