Users report that models "support 128K context" but OOM on consumer hardware. The culprit is the KV cache, which grows linearly with context length and often exceeds the model size itself.
KV compression makes 32K context possible on an 8GB Mac, where FP32 KV would OOM at ~16K.
Model: Llama 3.2 3B Instruct Q8_0 (~3.2 GB) Hardware: Apple M1 Pro, 16 GB RAM (8 threads, CPU-only, no Metal) Date: 2026-04-09
| KV Type | K per token | V per token | K+V per token | vs FP32 |
|---|---|---|---|---|
| FP32 | 112.00 KB | 112.00 KB | 224.00 KB | 1.00x |
| turbo_kv_4b | 15.75 KB | 56.00 KB | 71.75 KB | 3.12x |
| Context | FP32 KV total | turbo_kv_4b total | 8GB Mac | 16GB Mac |
|---|---|---|---|---|
| 4K | 4.1 GB | 3.5 GB | Both OK | Both OK |
| 8K | 5.0 GB | 3.8 GB | Both OK | Both OK |
| 16K | 6.7 GB | 4.3 GB | FP32 borderline | Both OK |
| 32K | 10.4 GB | 5.5 GB | FP32 OOM / compressed OK | Both OK |
| 64K | 17.5 GB | 7.7 GB | Both OOM | FP32 OOM / compressed OK |
| 128K | 31.7 GB | 12.3 GB | Both OOM | FP32 OOM / compressed OK |
| KV Type | tok/s | vs FP32 |
|---|---|---|
| FP32 | 6.9 | baseline |
| turbo_kv_4b | 7.8 | +13% |
KV compression is not just smaller — it's faster. The NEON vqtbl1q_s8
table-lookup attention kernel processes compressed K blocks in fewer
instructions than the FP32 multiply-accumulate chain.
Both KV types produce coherent, grammatical output at 32K context:
- FP32: "In the year 2154, in a world where robots had surpassed human intelligence, a lone robot named Assistant stood tall among the bustling streets of New Tokyo."
- turbo_kv_4b: "The robots rise to the top of the hill, their metallic bodies glinting in the sunlight as they march in lockstep towards their destination."
Formal PPL comparison at 3B scale is documented in the main benchmark table (turbo_kv_4b: +3.8% PPL vs FP32 on 957-token eval).
"Your 8GB Mac just got 32K context."
KV compression extends usable context 2-4x on the same hardware, with no speed penalty (actually +13% faster) and minimal quality loss (+3.8% PPL).
# 32K context with KV compression
build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf \
--ctx 32768 -k turbo_kv_4b -j 8 -p "Tell me a story" -n 50 -c -M
# Compare: 32K context without compression
build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf \
--ctx 32768 -k fp32 -j 8 -p "Tell me a story" -n 50 -c -MPython (v0.9.2+):
from quantcpp import Model
m = Model("Llama-3.2-3B-Q8_0.gguf", context_length=32768) # KV compression ON by default
print(m.ask("Tell me a long story about a robot"))