Skip to content

feat: port TQ3_0 KV cache from llama-turboquant#2

Open
carlosfundora wants to merge 6 commits intoPrismML-Eng:prism-v1from
carlosfundora:feature/tq3_0-kv-cache
Open

feat: port TQ3_0 KV cache from llama-turboquant#2
carlosfundora wants to merge 6 commits intoPrismML-Eng:prism-v1from
carlosfundora:feature/tq3_0-kv-cache

Conversation

@carlosfundora
Copy link
Copy Markdown

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

khosravipasha and others added 6 commits March 2, 2026 10:49
…CUDA)

Adds two 1-bit quantization types:
- Q1_0: block size 32, ~1.5 bpw
- Q1_0_g128: block size 128, ~1.125 bpw

Backend support: CPU (x86 SSE/AVX + ARM NEON), Metal, CUDA.
Kernel implementations follow Q4_0 as boilerplate, adapted for
1-bit sign-based dequantization.

CUDA MMQ kernels included but disabled (cuBLAS fallback used for
prompt processing) pending accuracy debugging.

Made-with: Cursor
[cuda] Fix mmq/mma path
TurboQuant 3-bit (3.5 bpw) KV cache compression:
- Per-block WHT rotation with 4-centroid MSE codebook
- QJL residual signs for error correction
- GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy
- CPU: quantize/dequantize with WHT butterfly transform
- Flash attention auto-disabled for TQ3_0 K cache

Combined with PrismML's Q1_0 GPU inference, this enables
1-bit weights + 3-bit KV cache on a single build.
@khosravipasha
Copy link
Copy Markdown
Collaborator

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

@carlosfundora
Copy link
Copy Markdown
Author

b2a77edf-03cb-4a58-b569-9a148a6ee24b.jpg

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

@carlosfundora
Copy link
Copy Markdown
Author

VRAM usage was reduced by roughly 35%.

@khosravipasha
Copy link
Copy Markdown
Collaborator

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

@carlosfundora
Copy link
Copy Markdown
Author

carlosfundora commented Apr 4, 2026

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

@khosravipasha
Copy link
Copy Markdown
Collaborator

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

@rosmur
Copy link
Copy Markdown

rosmur commented Apr 5, 2026

excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement

@carlosfundora
Copy link
Copy Markdown
Author

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

Yes, I commented on there today. looking forward to it.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).

Changes:

  • Add GGML_TYPE_TQ3_0 type definition/traits and (de)quantization hooks in GGML core + CPU integration.
  • Add CUDA kernels/support for writing (SET_ROWS) and using (MMVQ vecdot) TQ3_0 KV cache blocks.
  • Expose tq3_0 via CLI and bench tooling; disable flash-attention when type_k == TQ3_0.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tools/llama-bench/llama-bench.cpp Adds tq3_0 string → type mapping for benchmarking.
src/llama-context.cpp Forces flash-attention off when using TQ3_0 K-cache.
ggml/src/ggml.c Registers TQ3_0 type traits and enables chunk quantization dispatch.
ggml/src/ggml-quants.h Adds TQ3_0 quantize/dequantize API declarations.
ggml/src/ggml-quants.c Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation.
ggml/src/ggml-cuda/vecdotq.cuh Adds fused TQ3_0×Q8_1 vecdot for MMVQ.
ggml/src/ggml-cuda/set-rows.cu Enables SET_ROWS into TQ3_0 buffers (KV updates).
ggml/src/ggml-cuda/mmvq.cu Wires TQ3_0 into MMVQ type switches.
ggml/src/ggml-cuda/ggml-cuda.cu Marks additional ops/types as CUDA-supported (incl. TQ3_0).
ggml/src/ggml-cuda/cpy-utils.cuh Adds device quantization helper for TQ3_0 blocks.
ggml/src/ggml-cuda/convert.cu Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32.
ggml/src/ggml-cuda/common.cuh Adds CUDA type-traits for TQ3_0 (qk/qr/qi).
ggml/src/ggml-cpu/quants.h Declares CPU quantize entrypoint for TQ3_0.
ggml/src/ggml-cpu/quants.c Implements CPU quantize wrapper calling reference quantizer.
ggml/src/ggml-cpu/ops.cpp Allows TQ3_0 through quantized op switch cases.
ggml/src/ggml-cpu/ggml-cpu.cpp Tightens CPU op support checks for MUL_MAT and FLASH_ATTN_EXT.
ggml/src/ggml-cpu/ggml-cpu.c Registers TQ3_0 CPU type-traits (from_float).
ggml/src/ggml-common.h Defines block_tq3_0 layout and constants.
ggml/include/ggml.h Adds GGML_TYPE_TQ3_0 to the public enum/API.
common/arg.cpp Exposes tq3_0 as an allowed KV cache type via CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cuda/vecdotq.cuh
Comment thread ggml/src/ggml-quants.c
Comment thread ggml/src/ggml-quants.c
Comment thread ggml/src/ggml-common.h
Comment thread ggml/include/ggml.h
Comment thread ggml/src/ggml-cpu/ggml-cpu.c
Comment thread common/arg.cpp
Comment thread ggml/src/ggml-cuda/convert.cu
Comment thread ggml/src/ggml-cuda/cpy-utils.cuh
Comment thread ggml/src/ggml-cuda/vecdotq.cuh
@khosravipasha
Copy link
Copy Markdown
Collaborator

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers).

@khosravipasha
Copy link
Copy Markdown
Collaborator

Now few more of our backends are merged in main llama.cpp.
I have more free time to help with testing now to see how it affects evals/etc.
Also remember you mentioned a sglang option.
Might want to rebase this from the master branch (generic cpu, metal, vulkan backend are already in the master branch, cuda and x86 pending)

@khosravipasha
Copy link
Copy Markdown
Collaborator

@carlosfundora
Just revamped the prism branch by grabbing the llama.cpp master and applying our two pending PRs (x86 and cuda backned)

Do you still want to merge this? There is some outdates stuff too, for example we removed Q1_0_g128 since they did not like the name, and removed the old Q1_0 (group size 32), now Q1_0 (group size 128) is the only new quantization that is added (and many backends aleady merged into llama.cpp)>

Which part does the Turboquant related things? I am curous to try, but need to rebase this and fix some confilicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants