This is a Bonsai-focused fork of llama.cpp that adds TurboQuant KV cache support for Bonsai's 1-bit GGUF model.
The important distinction is that the model weights stay 1-bit quantized. TurboQuant changes the KV cache format, which is where the large VRAM savings come from.
tbq3_0andtbq4_0KV cache types in GGML.- CPU and CUDA backend support for the new cache formats.
llama-server,llama-cli, andllama-benchsupport for selecting those cache types.- Validation against Bonsai's
Bonsai-8B.ggufmodel with the OpenAI-compatible server.
CUDA build:
cmake -S . -B build-tbq-cuda -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_TOOLS=ON
cmake --build build-tbq-cuda -jIf you only need CPU support, build without -DGGML_CUDA=ON.
./build-tbq-cuda/bin/llama-server -m /path/to/Bonsai-8B.gguf \
--cache-type-k tbq4_0 \
--cache-type-v tbq3_0The same cache flags work with llama-cli and llama-bench.
If you use the Bonsai wrapper repo, point it at this build and set:
BONSAI_LLAMA_BIN_DIR=/path/to/build-tbq-cuda/bin
BONSAI_CACHE_TYPE_K=tbq4_0
BONSAI_CACHE_TYPE_V=tbq3_0This fork is based on PrismML's llama.cpp branch and inherits the upstream project history and license. For the full upstream feature set and docs, see ggml-org/llama.cpp.