feat: port TQ3_0 KV cache from llama-turboquant by carlosfundora · Pull Request #2 · PrismML-Eng/llama.cpp

carlosfundora · 2026-04-01T18:50:56Z

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

…CUDA) Adds two 1-bit quantization types: - Q1_0: block size 32, ~1.5 bpw - Q1_0_g128: block size 128, ~1.125 bpw Backend support: CPU (x86 SSE/AVX + ARM NEON), Metal, CUDA. Kernel implementations follow Q4_0 as boilerplate, adapted for 1-bit sign-based dequantization. CUDA MMQ kernels included but disabled (cuBLAS fallback used for prompt processing) pending accuracy debugging. Made-with: Cursor

[cuda] Fix mmq/mma path

TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.

khosravipasha · 2026-04-02T22:26:28Z

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

carlosfundora · 2026-04-03T06:35:21Z

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

carlosfundora · 2026-04-03T06:37:44Z

VRAM usage was reduced by roughly 35%.

khosravipasha · 2026-04-03T10:46:56Z

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

carlosfundora · 2026-04-04T03:37:35Z

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

khosravipasha · 2026-04-04T18:45:32Z

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

rosmur · 2026-04-05T06:49:23Z

excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement

carlosfundora · 2026-04-06T04:23:56Z

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

Yes, I commented on there today. looking forward to it.

Copilot

Pull request overview

This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).

Changes:

Add GGML_TYPE_TQ3_0 type definition/traits and (de)quantization hooks in GGML core + CPU integration.
Add CUDA kernels/support for writing (SET_ROWS) and using (MMVQ vecdot) TQ3_0 KV cache blocks.
Expose tq3_0 via CLI and bench tooling; disable flash-attention when type_k == TQ3_0.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tools/llama-bench/llama-bench.cpp	Adds `tq3_0` string → type mapping for benchmarking.
src/llama-context.cpp	Forces flash-attention off when using TQ3_0 K-cache.
ggml/src/ggml.c	Registers TQ3_0 type traits and enables chunk quantization dispatch.
ggml/src/ggml-quants.h	Adds TQ3_0 quantize/dequantize API declarations.
ggml/src/ggml-quants.c	Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation.
ggml/src/ggml-cuda/vecdotq.cuh	Adds fused TQ3_0×Q8_1 vecdot for MMVQ.
ggml/src/ggml-cuda/set-rows.cu	Enables `SET_ROWS` into TQ3_0 buffers (KV updates).
ggml/src/ggml-cuda/mmvq.cu	Wires TQ3_0 into MMVQ type switches.
ggml/src/ggml-cuda/ggml-cuda.cu	Marks additional ops/types as CUDA-supported (incl. TQ3_0).
ggml/src/ggml-cuda/cpy-utils.cuh	Adds device quantization helper for TQ3_0 blocks.
ggml/src/ggml-cuda/convert.cu	Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32.
ggml/src/ggml-cuda/common.cuh	Adds CUDA type-traits for TQ3_0 (qk/qr/qi).
ggml/src/ggml-cpu/quants.h	Declares CPU quantize entrypoint for TQ3_0.
ggml/src/ggml-cpu/quants.c	Implements CPU quantize wrapper calling reference quantizer.
ggml/src/ggml-cpu/ops.cpp	Allows TQ3_0 through quantized op switch cases.
ggml/src/ggml-cpu/ggml-cpu.cpp	Tightens CPU op support checks for `MUL_MAT` and `FLASH_ATTN_EXT`.
ggml/src/ggml-cpu/ggml-cpu.c	Registers TQ3_0 CPU type-traits (from_float).
ggml/src/ggml-common.h	Defines `block_tq3_0` layout and constants.
ggml/include/ggml.h	Adds `GGML_TYPE_TQ3_0` to the public enum/API.
common/arg.cpp	Exposes `tq3_0` as an allowed KV cache type via CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

khosravipasha · 2026-04-06T19:08:43Z

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers).

khosravipasha · 2026-04-10T06:59:44Z

Now few more of our backends are merged in main llama.cpp.
I have more free time to help with testing now to see how it affects evals/etc.
Also remember you mentioned a sglang option.
Might want to rebase this from the master branch (generic cpu, metal, vulkan backend are already in the master branch, cuda and x86 pending)

khosravipasha · 2026-04-14T00:29:06Z

@carlosfundora
Just revamped the prism branch by grabbing the llama.cpp master and applying our two pending PRs (x86 and cuda backned)

Do you still want to merge this? There is some outdates stuff too, for example we removed Q1_0_g128 since they did not like the name, and removed the old Q1_0 (group size 32), now Q1_0 (group size 128) is the only new quantization that is added (and many backends aleady merged into llama.cpp)>

Which part does the Turboquant related things? I am curous to try, but need to rebase this and fix some confilicts.

khosravipasha and others added 6 commits March 2, 2026 10:49

[cuda] fix mmq/mma path

43e67bd

Merge pull request #1 from PrismML-Eng/mmq

bc8122e

[cuda] Fix mmq/mma path

simplify dequantize_row_q1_0_g128

1b0fadf

add slim release workflow for prism

1179bfc

github-actions bot added Nvidia GPU ggml examples labels Apr 1, 2026

khosravipasha requested a review from Copilot April 6, 2026 08:03

Copilot started reviewing on behalf of khosravipasha April 6, 2026 08:04 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

claudlos mentioned this pull request Apr 7, 2026

vulkan: add Q1_0_g128 (1-bit ternary) shader support #9

Closed

khosravipasha force-pushed the prism branch from ba7e817 to 6df15a6 Compare April 13, 2026 22:28

Conversation

carlosfundora commented Apr 1, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

khosravipasha commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 4, 2026

Uh oh!

rosmur commented Apr 5, 2026

Uh oh!

carlosfundora commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Apr 6, 2026

Uh oh!

khosravipasha commented Apr 10, 2026

Uh oh!

khosravipasha commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carlosfundora commented Apr 4, 2026 •

edited

Loading