Skip to content

feat(ruvllm): TurboQuant KV Cache & Vector Compression — Full Implementation Plan #298

@ruvnet

Description

@ruvnet

Introduction

TurboQuant (ICLR 2026) brings data-oblivious compression to KV caches and embedding vectors, achieving ~3.5 bits per value with provably near-optimal geometry preservation. Unlike codebook-based approaches (GPTQ, AWQ), TurboQuant requires no training, no dataset-specific tuning, and works online — compressing vectors as they arrive.

This issue tracks the full implementation of TurboQuant within the ruvLLM engine, from the core compressor through KV cache integration, embedding store, and production-grade optimizations.

Why TurboQuant?

Property TurboQuant GPTQ/AWQ PiQ3 (existing)
Training required No Yes (calibration set) No
Online compression Yes No (batch) Yes
Geometry preservation Provable (2.7× optimal) Empirical Empirical
KV cache compatible Native Retrofit Not designed for
Memory reduction ~6× vs FP16 ~4× vs FP16 ~5× vs FP16
Attention speedup Up to 8× Limited N/A

Algorithm Overview

TurboQuant is a two-stage pipeline:

  1. PolarQuant: Random Hadamard rotation → scalar quantization per coordinate

    • Rotation makes dimensions approximately independent (Beta-distributed)
    • Enables optimal per-coordinate scalar quantization without codebooks
  2. QJL Residual: 1-bit Quantized Johnson-Lindenstrauss on the residual

    • Corrects quantization error with just 1 extra bit per dimension
    • Produces an unbiased inner product estimator

Implementation Status

Core Compressor (turbo_quant.rs) — ✅ Complete

  • TurboQuantCompressor with Hadamard rotation + scalar quantization
  • QJL residual correction (optional, improves inner products)
  • 4 bit-width configurations: 2.5, 3.0, 3.5, 4.0 bits
  • Batch compression/decompression
  • Asymmetric inner product computation
  • Non-power-of-2 dimension handling (auto-padding)
  • 13 unit tests passing

KV Cache Integration (kv_cache.rs) — ✅ Complete

  • TurboQuantCacheTier — compressed storage for cold tokens
  • TurboQuantKvCache — three-tier cache (FP16 hot + TurboQuant cold)
  • Auto-migration from hot to cold tier based on tail_length
  • Configurable migration batch size
  • Token eviction (oldest-first) with max token enforcement
  • CacheTier::TurboQuant enum variant
  • CacheQuantization::TurboQuantHybrid configuration

Embedding Store (turbo_quant.rs) — ✅ Complete

  • TurboQuantEmbeddingStore for RuVector-compatible compressed search
  • Batch build from embeddings
  • Asymmetric search (exact query × compressed vectors)
  • ID-based retrieval and decompression

Optimization Roadmap

Phase 1: Inner Product Optimization — 🔄 In Progress

  • Rotated-domain inner product: Skip inverse Hadamard by computing <Hq, Hk> directly (orthogonal invariance)
  • Batch query rotation: Rotate query once, reuse across all compressed vectors
  • SIMD-accelerated dequantize: NEON/AVX2 bitstream unpacking

Phase 2: Benchmarks & Profiling

  • Criterion benchmarks for all operations (compress, decompress, inner product, cache ops)
  • Dimension scaling analysis (64 → 1024)
  • Batch size analysis (1 → 1000)
  • Memory efficiency measurements at each bit width
  • Comparison with existing PiQ3 compression

Phase 3: Production Hardening

  • Thread-safe TurboQuantKvCache stress tests
  • Large-scale cache simulation (100K+ tokens)
  • Integration with ruvLLM inference pipeline (TransformerBlock)
  • Attention kernel integration (compressed attention computation)
  • ANE-optimized memory layouts for Apple Silicon

Phase 4: Advanced Features

  • Streaming compression (process tokens individually without batch overhead)
  • Adaptive bit-width selection based on layer sensitivity
  • Hierarchical quantization (different bits for different attention heads)
  • Mixed-precision attention (FP16 query × TurboQuant keys → FP32 accumulator)

Key Metrics

Metric Target Current
Compression ratio vs FP16 >4× ~4.6× (3.5-bit)
Inner product relative error <15% <15% (tested)
Compression throughput >1M vec/s TBD (benchmarks needed)
Attention latency reduction >4× TBD (integration needed)
Max sequence length at 8GB >128K TBD (simulation needed)

Files

File Description
crates/ruvllm/src/quantize/turbo_quant.rs Core compressor, cache tier, embedding store
crates/ruvllm/src/quantize/mod.rs Module exports
crates/ruvllm/src/kv_cache.rs Three-tier KV cache with TurboQuant integration
docs/research/quantization-edge/08-turboquant-kv-cache-compression.md Research document

References

Related

  • ADR-090: Quantization pipeline architecture
  • Existing PiQ3 quantization in crates/ruvllm/src/quantize/pi_quant.rs
  • Hadamard transform in crates/ruvllm/src/quantize/hadamard.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions