Skip to content

Apothic-AI/llama.cpp-1bit-turboquant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8,198 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp-1bit-turboquant

This is a Bonsai-focused fork of llama.cpp that adds TurboQuant KV cache support for Bonsai's 1-bit GGUF model.

The important distinction is that the model weights stay 1-bit quantized. TurboQuant changes the KV cache format, which is where the large VRAM savings come from.

What this fork adds

  • tbq3_0 and tbq4_0 KV cache types in GGML.
  • CPU and CUDA backend support for the new cache formats.
  • llama-server, llama-cli, and llama-bench support for selecting those cache types.
  • Validation against Bonsai's Bonsai-8B.gguf model with the OpenAI-compatible server.

Build

CUDA build:

cmake -S . -B build-tbq-cuda -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_TOOLS=ON
cmake --build build-tbq-cuda -j

If you only need CPU support, build without -DGGML_CUDA=ON.

Run

./build-tbq-cuda/bin/llama-server -m /path/to/Bonsai-8B.gguf \
  --cache-type-k tbq4_0 \
  --cache-type-v tbq3_0

The same cache flags work with llama-cli and llama-bench.

Bonsai Integration

If you use the Bonsai wrapper repo, point it at this build and set:

BONSAI_LLAMA_BIN_DIR=/path/to/build-tbq-cuda/bin
BONSAI_CACHE_TYPE_K=tbq4_0
BONSAI_CACHE_TYPE_V=tbq3_0

Upstream

This fork is based on PrismML's llama.cpp branch and inherits the upstream project history and license. For the full upstream feature set and docs, see ggml-org/llama.cpp.

About

No description, website, or topics provided.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors