-
Notifications
You must be signed in to change notification settings - Fork 439
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Introduction
TurboQuant (ICLR 2026) brings data-oblivious compression to KV caches and embedding vectors, achieving ~3.5 bits per value with provably near-optimal geometry preservation. Unlike codebook-based approaches (GPTQ, AWQ), TurboQuant requires no training, no dataset-specific tuning, and works online — compressing vectors as they arrive.
This issue tracks the full implementation of TurboQuant within the ruvLLM engine, from the core compressor through KV cache integration, embedding store, and production-grade optimizations.
Why TurboQuant?
| Property | TurboQuant | GPTQ/AWQ | PiQ3 (existing) |
|---|---|---|---|
| Training required | No | Yes (calibration set) | No |
| Online compression | Yes | No (batch) | Yes |
| Geometry preservation | Provable (2.7× optimal) | Empirical | Empirical |
| KV cache compatible | Native | Retrofit | Not designed for |
| Memory reduction | ~6× vs FP16 | ~4× vs FP16 | ~5× vs FP16 |
| Attention speedup | Up to 8× | Limited | N/A |
Algorithm Overview
TurboQuant is a two-stage pipeline:
-
PolarQuant: Random Hadamard rotation → scalar quantization per coordinate
- Rotation makes dimensions approximately independent (Beta-distributed)
- Enables optimal per-coordinate scalar quantization without codebooks
-
QJL Residual: 1-bit Quantized Johnson-Lindenstrauss on the residual
- Corrects quantization error with just 1 extra bit per dimension
- Produces an unbiased inner product estimator
Implementation Status
Core Compressor (turbo_quant.rs) — ✅ Complete
-
TurboQuantCompressorwith Hadamard rotation + scalar quantization - QJL residual correction (optional, improves inner products)
- 4 bit-width configurations: 2.5, 3.0, 3.5, 4.0 bits
- Batch compression/decompression
- Asymmetric inner product computation
- Non-power-of-2 dimension handling (auto-padding)
- 13 unit tests passing
KV Cache Integration (kv_cache.rs) — ✅ Complete
-
TurboQuantCacheTier— compressed storage for cold tokens -
TurboQuantKvCache— three-tier cache (FP16 hot + TurboQuant cold) - Auto-migration from hot to cold tier based on
tail_length - Configurable migration batch size
- Token eviction (oldest-first) with max token enforcement
-
CacheTier::TurboQuantenum variant -
CacheQuantization::TurboQuantHybridconfiguration
Embedding Store (turbo_quant.rs) — ✅ Complete
-
TurboQuantEmbeddingStorefor RuVector-compatible compressed search - Batch build from embeddings
- Asymmetric search (exact query × compressed vectors)
- ID-based retrieval and decompression
Optimization Roadmap
Phase 1: Inner Product Optimization — 🔄 In Progress
- Rotated-domain inner product: Skip inverse Hadamard by computing
<Hq, Hk>directly (orthogonal invariance) - Batch query rotation: Rotate query once, reuse across all compressed vectors
- SIMD-accelerated dequantize: NEON/AVX2 bitstream unpacking
Phase 2: Benchmarks & Profiling
- Criterion benchmarks for all operations (compress, decompress, inner product, cache ops)
- Dimension scaling analysis (64 → 1024)
- Batch size analysis (1 → 1000)
- Memory efficiency measurements at each bit width
- Comparison with existing PiQ3 compression
Phase 3: Production Hardening
- Thread-safe
TurboQuantKvCachestress tests - Large-scale cache simulation (100K+ tokens)
- Integration with ruvLLM inference pipeline (
TransformerBlock) - Attention kernel integration (compressed attention computation)
- ANE-optimized memory layouts for Apple Silicon
Phase 4: Advanced Features
- Streaming compression (process tokens individually without batch overhead)
- Adaptive bit-width selection based on layer sensitivity
- Hierarchical quantization (different bits for different attention heads)
- Mixed-precision attention (FP16 query × TurboQuant keys → FP32 accumulator)
Key Metrics
| Metric | Target | Current |
|---|---|---|
| Compression ratio vs FP16 | >4× | ~4.6× (3.5-bit) |
| Inner product relative error | <15% | <15% (tested) |
| Compression throughput | >1M vec/s | TBD (benchmarks needed) |
| Attention latency reduction | >4× | TBD (integration needed) |
| Max sequence length at 8GB | >128K | TBD (simulation needed) |
Files
| File | Description |
|---|---|
crates/ruvllm/src/quantize/turbo_quant.rs |
Core compressor, cache tier, embedding store |
crates/ruvllm/src/quantize/mod.rs |
Module exports |
crates/ruvllm/src/kv_cache.rs |
Three-tier KV cache with TurboQuant integration |
docs/research/quantization-edge/08-turboquant-kv-cache-compression.md |
Research document |
References
- TurboQuant (ICLR 2026): Data-oblivious KV cache compression
- PolarQuant (AISTATS 2026): Random rotation quantization
- QJL: Quantized Johnson-Lindenstrauss projection
- PR feat(ruvllm): TurboQuant KV cache & vector compression #297: Initial implementation
Related
- ADR-090: Quantization pipeline architecture
- Existing PiQ3 quantization in
crates/ruvllm/src/quantize/pi_quant.rs - Hadamard transform in
crates/ruvllm/src/quantize/hadamard.rs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request