feat(ruvllm): TurboQuant KV Cache & Vector Compression — Full Implementation Plan

## Introduction

TurboQuant (ICLR 2026) brings **data-oblivious** compression to KV caches and embedding vectors, achieving ~3.5 bits per value with provably near-optimal geometry preservation. Unlike codebook-based approaches (GPTQ, AWQ), TurboQuant requires no training, no dataset-specific tuning, and works online — compressing vectors as they arrive.

This issue tracks the full implementation of TurboQuant within the ruvLLM engine, from the core compressor through KV cache integration, embedding store, and production-grade optimizations.

### Why TurboQuant?

| Property | TurboQuant | GPTQ/AWQ | PiQ3 (existing) |
|----------|-----------|----------|-----------------|
| Training required | No | Yes (calibration set) | No |
| Online compression | Yes | No (batch) | Yes |
| Geometry preservation | Provable (2.7× optimal) | Empirical | Empirical |
| KV cache compatible | Native | Retrofit | Not designed for |
| Memory reduction | ~6× vs FP16 | ~4× vs FP16 | ~5× vs FP16 |
| Attention speedup | Up to 8× | Limited | N/A |

### Algorithm Overview

TurboQuant is a two-stage pipeline:

1. **PolarQuant**: Random Hadamard rotation → scalar quantization per coordinate
   - Rotation makes dimensions approximately independent (Beta-distributed)
   - Enables optimal per-coordinate scalar quantization without codebooks

2. **QJL Residual**: 1-bit Quantized Johnson-Lindenstrauss on the residual
   - Corrects quantization error with just 1 extra bit per dimension
   - Produces an unbiased inner product estimator

## Implementation Status

### Core Compressor (`turbo_quant.rs`) — ✅ Complete
- [x] `TurboQuantCompressor` with Hadamard rotation + scalar quantization
- [x] QJL residual correction (optional, improves inner products)
- [x] 4 bit-width configurations: 2.5, 3.0, 3.5, 4.0 bits
- [x] Batch compression/decompression
- [x] Asymmetric inner product computation
- [x] Non-power-of-2 dimension handling (auto-padding)
- [x] 13 unit tests passing

### KV Cache Integration (`kv_cache.rs`) — ✅ Complete
- [x] `TurboQuantCacheTier` — compressed storage for cold tokens
- [x] `TurboQuantKvCache` — three-tier cache (FP16 hot + TurboQuant cold)
- [x] Auto-migration from hot to cold tier based on `tail_length`
- [x] Configurable migration batch size
- [x] Token eviction (oldest-first) with max token enforcement
- [x] `CacheTier::TurboQuant` enum variant
- [x] `CacheQuantization::TurboQuantHybrid` configuration

### Embedding Store (`turbo_quant.rs`) — ✅ Complete
- [x] `TurboQuantEmbeddingStore` for RuVector-compatible compressed search
- [x] Batch build from embeddings
- [x] Asymmetric search (exact query × compressed vectors)
- [x] ID-based retrieval and decompression

## Optimization Roadmap

### Phase 1: Inner Product Optimization — 🔄 In Progress
- [ ] **Rotated-domain inner product**: Skip inverse Hadamard by computing `<Hq, Hk>` directly (orthogonal invariance)
- [ ] **Batch query rotation**: Rotate query once, reuse across all compressed vectors
- [ ] **SIMD-accelerated dequantize**: NEON/AVX2 bitstream unpacking

### Phase 2: Benchmarks & Profiling
- [ ] Criterion benchmarks for all operations (compress, decompress, inner product, cache ops)
- [ ] Dimension scaling analysis (64 → 1024)
- [ ] Batch size analysis (1 → 1000)
- [ ] Memory efficiency measurements at each bit width
- [ ] Comparison with existing PiQ3 compression

### Phase 3: Production Hardening
- [ ] Thread-safe `TurboQuantKvCache` stress tests
- [ ] Large-scale cache simulation (100K+ tokens)
- [ ] Integration with ruvLLM inference pipeline (`TransformerBlock`)
- [ ] Attention kernel integration (compressed attention computation)
- [ ] ANE-optimized memory layouts for Apple Silicon

### Phase 4: Advanced Features
- [ ] Streaming compression (process tokens individually without batch overhead)
- [ ] Adaptive bit-width selection based on layer sensitivity
- [ ] Hierarchical quantization (different bits for different attention heads)
- [ ] Mixed-precision attention (FP16 query × TurboQuant keys → FP32 accumulator)

## Key Metrics

| Metric | Target | Current |
|--------|--------|---------|
| Compression ratio vs FP16 | >4× | ~4.6× (3.5-bit) |
| Inner product relative error | <15% | <15% (tested) |
| Compression throughput | >1M vec/s | TBD (benchmarks needed) |
| Attention latency reduction | >4× | TBD (integration needed) |
| Max sequence length at 8GB | >128K | TBD (simulation needed) |

## Files

| File | Description |
|------|-------------|
| `crates/ruvllm/src/quantize/turbo_quant.rs` | Core compressor, cache tier, embedding store |
| `crates/ruvllm/src/quantize/mod.rs` | Module exports |
| `crates/ruvllm/src/kv_cache.rs` | Three-tier KV cache with TurboQuant integration |
| `docs/research/quantization-edge/08-turboquant-kv-cache-compression.md` | Research document |

## References

- [TurboQuant (ICLR 2026)](https://arxiv.org/abs/2504.19874): Data-oblivious KV cache compression
- [PolarQuant (AISTATS 2026)](https://arxiv.org/abs/2502.02617): Random rotation quantization
- [QJL](https://arxiv.org/abs/2406.03482): Quantized Johnson-Lindenstrauss projection
- PR #297: Initial implementation

## Related

- ADR-090: Quantization pipeline architecture
- Existing PiQ3 quantization in `crates/ruvllm/src/quantize/pi_quant.rs`
- Hadamard transform in `crates/ruvllm/src/quantize/hadamard.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ruvllm): TurboQuant KV Cache & Vector Compression — Full Implementation Plan #298

Introduction

Why TurboQuant?

Algorithm Overview

Implementation Status

Core Compressor (`turbo_quant.rs`) — ✅ Complete

KV Cache Integration (`kv_cache.rs`) — ✅ Complete

Embedding Store (`turbo_quant.rs`) — ✅ Complete

Optimization Roadmap

Phase 1: Inner Product Optimization — 🔄 In Progress

Phase 2: Benchmarks & Profiling

Phase 3: Production Hardening

Phase 4: Advanced Features

Key Metrics

Files

References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Property	TurboQuant	GPTQ/AWQ	PiQ3 (existing)
Training required	No	Yes (calibration set)	No
Online compression	Yes	No (batch)	Yes
Geometry preservation	Provable (2.7× optimal)	Empirical	Empirical
KV cache compatible	Native	Retrofit	Not designed for
Memory reduction	~6× vs FP16	~4× vs FP16	~5× vs FP16
Attention speedup	Up to 8×	Limited	N/A

Metric	Target	Current
Compression ratio vs FP16	>4×	~4.6× (3.5-bit)
Inner product relative error	<15%	<15% (tested)
Compression throughput	>1M vec/s	TBD (benchmarks needed)
Attention latency reduction	>4×	TBD (integration needed)
Max sequence length at 8GB	>128K	TBD (simulation needed)

File	Description
`crates/ruvllm/src/quantize/turbo_quant.rs`	Core compressor, cache tier, embedding store
`crates/ruvllm/src/quantize/mod.rs`	Module exports
`crates/ruvllm/src/kv_cache.rs`	Three-tier KV cache with TurboQuant integration
`docs/research/quantization-edge/08-turboquant-kv-cache-compression.md`	Research document

feat(ruvllm): TurboQuant KV Cache & Vector Compression — Full Implementation Plan #298

Description

Introduction

Why TurboQuant?

Algorithm Overview

Implementation Status

Core Compressor (turbo_quant.rs) — ✅ Complete

KV Cache Integration (kv_cache.rs) — ✅ Complete

Embedding Store (turbo_quant.rs) — ✅ Complete

Optimization Roadmap

Phase 1: Inner Product Optimization — 🔄 In Progress

Phase 2: Benchmarks & Profiling

Phase 3: Production Hardening

Phase 4: Advanced Features

Key Metrics

Files

References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Core Compressor (`turbo_quant.rs`) — ✅ Complete

KV Cache Integration (`kv_cache.rs`) — ✅ Complete

Embedding Store (`turbo_quant.rs`) — ✅ Complete