llama.cpp-1bit-turboquant

This is a Bonsai-focused fork of llama.cpp that adds TurboQuant KV cache support for Bonsai's 1-bit GGUF model.

The important distinction is that the model weights stay 1-bit quantized. TurboQuant changes the KV cache format, which is where the large VRAM savings come from.

What this fork adds

tbq3_0 and tbq4_0 KV cache types in GGML.
CPU and CUDA backend support for the new cache formats.
llama-server, llama-cli, and llama-bench support for selecting those cache types.
Validation against Bonsai's Bonsai-8B.gguf model with the OpenAI-compatible server.

Build

CUDA build:

cmake -S . -B build-tbq-cuda -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_TOOLS=ON
cmake --build build-tbq-cuda -j

If you only need CPU support, build without -DGGML_CUDA=ON.

Run

./build-tbq-cuda/bin/llama-server -m /path/to/Bonsai-8B.gguf \
  --cache-type-k tbq4_0 \
  --cache-type-v tbq3_0

The same cache flags work with llama-cli and llama-bench.

Bonsai Integration

If you use the Bonsai wrapper repo, point it at this build and set:

BONSAI_LLAMA_BIN_DIR=/path/to/build-tbq-cuda/bin
BONSAI_CACHE_TYPE_K=tbq4_0
BONSAI_CACHE_TYPE_V=tbq3_0

Upstream

This fork is based on PrismML's llama.cpp branch and inherits the upstream project history and license. For the full upstream feature set and docs, see ggml-org/llama.cpp.

Name		Name	Last commit message	Last commit date
Latest commit History 8,198 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp-1bit-turboquant

What this fork adds

Build

Run

Bonsai Integration

Upstream

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp-1bit-turboquant

What this fork adds

Build

Run

Bonsai Integration

Upstream

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages