TurboQuant Reproduction

This repository provides a reproduction implementation for TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate, with a focus on the Llama 3.1 8B Instruct LongBench Table 1 KV cache compression experiments.

The reproduced settings include:

Full Cache
TurboQuant 2.5 bit
TurboQuant 3.5 bit

TurboQuant is an online, data-oblivious vector quantization method for high-dimensional vectors. It applies randomized rotation, scalar Lloyd Max quantization on rotated coordinates, and an optional residual 1 bit QJL style correction stage for inner product estimation. The original paper evaluates TurboQuant on KV cache compression for long-context LLM inference and nearest-neighbor search. This repository focuses on the LongBench KV cache compression setting.

Scope

This repository focuses on reproducing the Llama 3.1 8B Instruct LongBench Table 1 experiments. It does not attempt to reproduce every experiment in the TurboQuant paper.

Implemented components include:

TurboQuant KV cache quantization
Full cache baseline evaluation
LongBench prompt formatting
LongBench-compatible scoring
experiment launch scripts
result aggregation scripts
comparison table builders
unit tests for quantization and cache behavior

Repository Layout

turboquant/                 Core TurboQuant and LongBench utilities
experiments/longbench/      LongBench generation runner
scripts/                    Data preparation, scoring, job queue, and report builders
configs/                    Local model and dataset path configuration
tests/                      Unit tests for quantization and cache behavior
reproduce/                  Reproduction plans, reports, and final comparison tables

Important result files:

reproduce/TABLE1_OFFICIAL_COMPARISON.md
reproduce/TABLE1_OFFICIAL_COMPARISON.json
reproduce/TABLE1_OFFICIAL_COMPARISON.csv
reproduce/REPRODUCTION_MANIFEST.json

Environment

Create the conda environment:

conda env create -f environment.yml
conda activate turboquant

Run tests:

python -m pytest tests -q

Expected result:

9 passed

Data and Model Preparation

The reproduction uses:

Model: meta-llama/Llama-3.1-8B-Instruct
Benchmark: LongBench English Table 1 tasks

Set local model and dataset paths in:

configs/paths.yaml
configs/llama_first.yaml

A typical configuration uses the following environment variables:

export HF_HOME=/path/to/huggingface/cache
export DATA_ROOT=/path/to/turboquant/data
export MODEL_PATH=/path/to/Llama-3.1-8B-Instruct

Prepare or update the LongBench cache entries:

python scripts/prepare_longbench_cache.py \
  --cache-root "$DATA_ROOT/hf_cache" \
  --output-report reproduce/logs/longbench_cache_prepare_report.json

How to Run

Single LongBench Task

Full Cache example:

python experiments/longbench/run_full_cache_eval.py \
  --dataset-key longbench_2wikimqa \
  --device cuda:0 \
  --cache-mode full \
  --prompt-mode longbench \
  --chat-template-mode auto \
  --start-index 0 \
  --end-index 200 \
  --output reproduce/runs/table1_official/longbench_2wikimqa_full_cache_all.jsonl \
  --progress-every 20

TurboQuant 2.5 bit example:

python experiments/longbench/run_full_cache_eval.py \
  --dataset-key longbench_2wikimqa \
  --device cuda:0 \
  --cache-mode turboquant \
  --kv-bits 2.5 \
  --turboquant-fast-materialized-eval \
  --prompt-mode longbench \
  --chat-template-mode auto \
  --start-index 0 \
  --end-index 200 \
  --output reproduce/runs/table1_official/longbench_2wikimqa_turboquant_2_5bit_all.jsonl \
  --progress-every 20

TurboQuant 3.5 bit example:

python experiments/longbench/run_full_cache_eval.py \
  --dataset-key longbench_2wikimqa \
  --device cuda:0 \
  --cache-mode turboquant \
  --kv-bits 3.5 \
  --turboquant-fast-materialized-eval \
  --prompt-mode longbench \
  --chat-template-mode auto \
  --start-index 0 \
  --end-index 200 \
  --output reproduce/runs/table1_official/longbench_2wikimqa_turboquant_3_5bit_all.jsonl \
  --progress-every 20

Score a generated JSONL file:

python scripts/summarize_jsonl_accuracy.py \
  reproduce/runs/table1_official/longbench_2wikimqa_turboquant_2_5bit_all.jsonl \
  --recompute-longbench-score \
  --output reproduce/runs/table1_official/longbench_2wikimqa_turboquant_2_5bit_all.aggregate.json

Full Table 1 Reproduction

Run Full Cache jobs:

bash scripts/run_table1_full_cache_parallel.sh

Run TurboQuant jobs:

python scripts/queue_table1_turboquant_jobs.py

Recompute LongBench-compatible metrics for all Table 1 outputs:

python scripts/recompute_table1_official_metrics.py \
  --run-root reproduce/runs/table1_official \
  --workers 8

Build final comparison tables:

python scripts/build_table1_official_comparison.py \
  --run-root reproduce/runs/table1_official \
  --output-prefix reproduce/TABLE1_OFFICIAL_COMPARISON

Experimental Results

Scope:

Model: meta-llama/Llama-3.1-8B-Instruct
Benchmark: LongBench V1 English Table 1 tasks
Settings: Full Cache, TurboQuant 2.5 bit, TurboQuant 3.5 bit

Category Level Results

Method	KV Size	Source	SingleQA	MultiQA	Summarization	Few shot	Synthetic	Code	Average
Full Cache	16.0	paper	45.29	45.16	26.55	68.38	59.54	46.28	50.06
Full Cache	16.0	local	44.45	44.18	29.29	69.53	52.60	62.26	50.38
TurboQuant	2.5	paper	44.16	44.96	24.80	68.01	59.65	45.76	49.44
TurboQuant	2.5	local	38.61	36.01	27.54	67.12	48.22	55.02	45.42
TurboQuant	3.5	paper	45.01	45.31	26.00	68.63	59.95	46.17	50.06
TurboQuant	3.5	local	42.73	43.04	28.72	68.59	52.06	61.16	49.38

Task Level Local Scores

Category	Dataset	Full Cache	TurboQuant 2.5 bit	TurboQuant 3.5 bit
SingleQA	narrativeqa	31.13	25.35	29.25
SingleQA	qasper	46.77	42.44	46.32
SingleQA	multifieldqa_en	55.44	48.03	52.61
MultiQA	hotpotqa	55.09	44.83	54.65
MultiQA	2wikimqa	46.36	37.96	44.47
MultiQA	musique	31.09	25.24	30.01
Summarization	gov_report	34.98	32.02	34.08
Summarization	qmsum	25.56	24.38	25.33
Summarization	multi_news	27.32	26.22	26.77
Few shot	trec	72.50	70.50	72.00
Few shot	triviaqa	92.48	89.39	91.34
Few shot	samsum	43.61	41.47	42.43
Synthetic	passage_retrieval_en	99.50	93.00	100.00
Synthetic	passage_count	5.71	3.45	4.12
Code	lcc	66.61	57.04	64.04
Code	repobench-p	57.90	53.00	58.29

Run Completeness

Method	Complete Tasks	Expected Tasks
Full Cache	16	16
TurboQuant 2.5 bit	16	16
TurboQuant 3.5 bit	16	16

Reproduction Notes

The final comparison table is generated by:

python scripts/build_table1_official_comparison.py \
  --run-root reproduce/runs/table1_official \
  --output-prefix reproduce/TABLE1_OFFICIAL_COMPARISON

Scores are computed with LongBench-compatible prompt templates, maximum generation lengths, and metrics.

Large generated outputs under the following directory are intentionally excluded from version control:

reproduce/runs/

Limitations

This repository focuses on the Llama 3.1 8B Instruct LongBench Table 1 KV cache experiments. Results may differ from the paper because of environment differences, model checkpoint revisions, tokenizer or chat template behavior, decoding settings, hardware, and dependency versions.

This repository does not currently include optimized low-level kernels for packed low-bit KV cache inference. The current implementation is designed for reproducible evaluation and analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant Reproduction

Scope

Repository Layout

Environment

Data and Model Preparation

How to Run

Single LongBench Task

Full Table 1 Reproduction

Experimental Results

Category Level Results

Task Level Local Scores

Run Completeness

Reproduction Notes

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
experiments		experiments
reproduce		reproduce
scripts		scripts
tests		tests
turboquant		turboquant
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TurboQuant Reproduction

Scope

Repository Layout

Environment

Data and Model Preparation

How to Run

Single LongBench Task

Full Table 1 Reproduction

Experimental Results

Category Level Results

Task Level Local Scores

Run Completeness

Reproduction Notes

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages