feat: port TQ3_0 KV cache from llama-turboquant#2
feat: port TQ3_0 KV cache from llama-turboquant#2carlosfundora wants to merge 6 commits intoPrismML-Eng:prism-v1from
Conversation
…CUDA) Adds two 1-bit quantization types: - Q1_0: block size 32, ~1.5 bpw - Q1_0_g128: block size 128, ~1.125 bpw Backend support: CPU (x86 SSE/AVX + ARM NEON), Metal, CUDA. Kernel implementations follow Q4_0 as boilerplate, adapted for 1-bit sign-based dequantization. CUDA MMQ kernels included but disabled (cuBLAS fallback used for prompt processing) pending accuracy debugging. Made-with: Cursor
[cuda] Fix mmq/mma path
TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.
|
Thanks this is pretty cool. How does it work? It is good? Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it. |
|
VRAM usage was reduced by roughly 35%. |
|
Oh how does it work with SGlang for 1-bit, was it easy to add support there? |
|
So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do. If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌 |
|
@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?) |
|
excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement |
Yes, I commented on there today. looking forward to it. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).
Changes:
- Add
GGML_TYPE_TQ3_0type definition/traits and (de)quantization hooks in GGML core + CPU integration. - Add CUDA kernels/support for writing (
SET_ROWS) and using (MMVQvecdot) TQ3_0 KV cache blocks. - Expose
tq3_0via CLI and bench tooling; disable flash-attention whentype_k == TQ3_0.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/llama-bench/llama-bench.cpp | Adds tq3_0 string → type mapping for benchmarking. |
| src/llama-context.cpp | Forces flash-attention off when using TQ3_0 K-cache. |
| ggml/src/ggml.c | Registers TQ3_0 type traits and enables chunk quantization dispatch. |
| ggml/src/ggml-quants.h | Adds TQ3_0 quantize/dequantize API declarations. |
| ggml/src/ggml-quants.c | Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation. |
| ggml/src/ggml-cuda/vecdotq.cuh | Adds fused TQ3_0×Q8_1 vecdot for MMVQ. |
| ggml/src/ggml-cuda/set-rows.cu | Enables SET_ROWS into TQ3_0 buffers (KV updates). |
| ggml/src/ggml-cuda/mmvq.cu | Wires TQ3_0 into MMVQ type switches. |
| ggml/src/ggml-cuda/ggml-cuda.cu | Marks additional ops/types as CUDA-supported (incl. TQ3_0). |
| ggml/src/ggml-cuda/cpy-utils.cuh | Adds device quantization helper for TQ3_0 blocks. |
| ggml/src/ggml-cuda/convert.cu | Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32. |
| ggml/src/ggml-cuda/common.cuh | Adds CUDA type-traits for TQ3_0 (qk/qr/qi). |
| ggml/src/ggml-cpu/quants.h | Declares CPU quantize entrypoint for TQ3_0. |
| ggml/src/ggml-cpu/quants.c | Implements CPU quantize wrapper calling reference quantizer. |
| ggml/src/ggml-cpu/ops.cpp | Allows TQ3_0 through quantized op switch cases. |
| ggml/src/ggml-cpu/ggml-cpu.cpp | Tightens CPU op support checks for MUL_MAT and FLASH_ATTN_EXT. |
| ggml/src/ggml-cpu/ggml-cpu.c | Registers TQ3_0 CPU type-traits (from_float). |
| ggml/src/ggml-common.h | Defines block_tq3_0 layout and constants. |
| ggml/include/ggml.h | Adds GGML_TYPE_TQ3_0 to the public enum/API. |
| common/arg.cpp | Exposes tq3_0 as an allowed KV cache type via CLI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp) Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128. https://github.com/PrismML-Eng/llama.cpp/tree/master This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers). |
|
Now few more of our backends are merged in main llama.cpp. |
|
@carlosfundora Do you still want to merge this? There is some outdates stuff too, for example we removed Q1_0_g128 since they did not like the name, and removed the old Q1_0 (group size 32), now Q1_0 (group size 128) is the only new quantization that is added (and many backends aleady merged into llama.cpp)> Which part does the Turboquant related things? I am curous to try, but need to rebase this and fix some confilicts. |

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.