This guide explains how to build vlut.cpp and reproduce the evaluation results reported in the paper.
Complete Preparation first, then follow one of the three workflows:
| Workflow | Goal | Estimated time |
|---|---|---|
| Quick Functional Check | Verify the build runs correctly. | < 5 min |
| Minimal Reproduction | Reproduce vlut.cpp results on one device and model. | < 1 hr |
| Full Comparison | Compare against all baselines from the paper. | > 1 day per device |
Preparation itself takes about 30 minutes (build + model download).
vlut.cpp targets modern x86 and ARM CPUs.
Minimum requirements:
| Resource | Inference only | With model conversion |
|---|---|---|
| CPU | x86_64 with AVX2, or ARMv8 with NEON/SVE | same |
| RAM | 4 GB | 24 GB |
| Disk | 16 GB | 64 GB (full comparison) |
Software: Linux, WSL2, or Android (Termux). Python 3.10+.
For stable benchmarking results:
- Use only performance cores (P-cores) on heterogeneous CPUs.
- Minimize background processes during benchmarking.
- Set power management to high-performance mode (e.g., battery set to performance mode on laptops, or CPU governor set to
performanceon Linux).
vlut.cpp supports several ternary model families:
The paper evaluates three models:
| Short name | HuggingFace repository |
|---|---|
| HF BitNet 3B | 1bitLLM/bitnet_b1_58-3B |
| Llama3 8B | HF1BitLLM/Llama3-8B-1.58-100B-tokens |
| Falcon3 1B | tiiuae/Falcon3-1B-Instruct-1.58bit |
Pre-converted GGUF models are hosted in this HuggingFace collection:
| Model | Pre-converted repository |
|---|---|
| HF BitNet 3B | XXXXyu/bitnet_b1_58-3B-vlut-gguf |
| Llama3 8B | XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf |
| Falcon3 1B | XXXXyu/Falcon3-1B-Instruct-1.58bit-vlut-gguf |
Store models locally using the model's short name as the directory name:
~/models/
├── Llama3-8B-1.58-100B-tokens/
│ ├── ggml-model-I1_V_2.gguf
│ └── ggml-model-I2_V_4.gguf
├── bitnet_b1_58-3B/
│ └── ...
└── Falcon3-1B-Instruct-1.58bit/
└── ...
cmake -B build
cmake --build build --config Release -j4 # adjust -j to your CPU core countOptional CMake flags:
| Flag | Purpose |
|---|---|
-DVLUT_SVE=ON |
Enable SVE intrinsics (e.g., AWS Graviton 3). |
-DTABLE_ENTRY_SIZE=<N> |
Set the N-tile size as described in the paper (default: 32). |
conda create -y -n vlut-cpp python=3.10
conda activate vlut-cpp
python -m pip install huggingface_hub pandas matplotlibDownload the pre-converted models. For example, Llama3 8B (~2 GB per .gguf file):
hf download XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf \
ggml-model-I1_V_2.gguf \
ggml-model-I2_V_4.gguf \
--local-dir ~/models/Llama3-8B-1.58-100B-tokensOther models (BitNet 3B, Falcon3 1B) are smaller. Replace the repository ID and --local-dir accordingly.
Skip this section if you use the pre-converted GGUF models above.
Install conversion dependencies:
python -m pip install -r requirements.txtDownload the original HuggingFace weights (e.g., Llama3 8B):
hf download HF1BitLLM/Llama3-8B-1.58-100B-tokens \
--local-dir ~/models/Llama3-8B-1.58-100B-tokensConvert to the vlut.cpp intermediate GGUF format:
python ./convert_hf_to_gguf_vlut.py \
~/models/Llama3-8B-1.58-100B-tokens \
--outfile ~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.ggufQuantize the intermediate GGUF into one or more Vec-LUT packings:
./build/bin/llama-quantize \
~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I1_V_2
./build/bin/llama-quantize \
~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I2_V_4
./build/bin/llama-quantize \
~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I2_V_8This produces files in the same directory with default names (ggml-model-I1_V_2.gguf, etc.). Do not rename them — the evaluation scripts depend on these names.
Run a single batched-decoding pass to verify the build:
./build/bin/llama-batched \
-m ~/models/Llama3-8B-1.58-100B-tokens/ggml-model-I1_V_2.gguf \
-p "I believe" \
-np 32 \
-n 16 \
-t 4 \
-ngl 0 \
--temp 0.5 \
--repeat-penalty 1.5Expected: The model loads without errors and the program prints generated text from 32 parallel sequences. This corresponds to evaluation/demo/run_batched_decode.sh.
This section walks through a minimal evaluation of vlut.cpp on one device and one model. Only vlut.cpp and pre-converted models are needed — no baseline frameworks. The same wrapper scripts are used in the full comparison workflow; here they skip missing baselines with warnings.
To evaluate additional devices or models, repeat the steps below with different hardware and a different MODEL_DIR.
The minimal workflow runs three benchmarks:
| # | Benchmark | What it measures | Script |
|---|---|---|---|
| 1 | GeMM | Raw matrix-multiply throughput. | bench-gemm.sh |
| 2 | End-to-end prefilling | Prompt processing speed. | bench-e2e-prefill.sh |
| 3 | End-to-end batched decoding | Parallel token generation. | bench-e2e-batch.sh |
Choose a device identifier and point to the model directory. The device identifier is used to distinguish results from different machines when plotting them together.
export DEVICE_NAME=mydevice
export MODEL_DIR=~/models/Llama3-8B-1.58-100B-tokensMeasures raw ternary GeMM throughput for the built-in model shapes.
# Single-threaded
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 1 256
# Multi-threaded (adjust thread count to your CPU)
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 4 256Arguments: <device_name> <threads> <sequence_length> [entry_size]
device_name— identifier used in output directory names.threads— number of threads.sequence_length— the N dimension (sequence length to benchmark).entry_size— optional, LUT entry size in bytes (default: 32).
Output: evaluation/results_gemm_${DEVICE_NAME}/ containing .log and .csv files, one pair per model shape.
Troubleshooting: If you encounter an
awkerror, editbench-gemm.shand replace$SCRIPT_DIR/test-to-csv.shwith$SCRIPT_DIR/test-to-csv-backup.sh.
Measures prompt processing throughput (tokens/sec) across prompt lengths and thread counts.
DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-prefill.shOptional environment variables:
| Variable | Default | Description |
|---|---|---|
PROMPT_LENGTH |
128,256,512 |
Comma-separated prompt lengths to test. |
THREAD_COUNT |
1,4 |
Comma-separated thread counts. |
REPEAT_COUNT |
3 |
Repetitions per configuration. |
Output: evaluation/results_e2e_prefill_${DEVICE_NAME}/<model_name>/ with .txt log and .csv files for each quantization variant found in MODEL_DIR.
Note: This script deletes the target result directory before writing, so previous results for the same device and model are overwritten.
Measures parallel decoding throughput (tokens/sec) across batch sizes and thread counts.
DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-batch.shOptional environment variables:
| Variable | Default | Description |
|---|---|---|
PREFILL_LEN |
16 |
Prompt length before decoding. |
TOKEN_GEN_LENS |
16 |
Comma-separated generation lengths. |
PARALLEL_SEQS |
64,128,256 |
Comma-separated batch sizes. |
THREAD_COUNT |
4 |
Comma-separated thread counts. |
Output: evaluation/results_e2e_batch_${DEVICE_NAME}/<model_name>/ with .txt log and .csv files for each quantization variant found in MODEL_DIR.
Note: This script also deletes the target result directory before writing.
After all three steps, you should have the following directory structure:
evaluation/
├── results_gemm_mydevice/
│ ├── llama3_8b_t1_ns256_s32.log
│ ├── llama3_8b_t1_ns256_s32.csv
│ ├── llama3_8b_t4_ns256_s32.log
│ └── llama3_8b_t4_ns256_s32.csv
├── results_e2e_prefill_mydevice/
│ └── Llama3-8B-1.58-100B-tokens/
│ ├── ggml-model-I1_V_2_p128-256-512_t1-4_r3_<date>_<time>.txt
│ ├── ggml-model-I1_V_2_p128-256-512_t1-4_r3_<date>_<time>.csv
│ ├── ggml-model-I2_V_4_p128-256-512_t1-4_r3_<date>_<time>.txt
│ └── ggml-model-I2_V_4_p128-256-512_t1-4_r3_<date>_<time>.csv
└── results_e2e_batch_mydevice/
└── Llama3-8B-1.58-100B-tokens/
├── ggml-model-I1_V_2_npp16_ntg16_npl64-128-256_t4_<date>_<time>.txt
├── ggml-model-I1_V_2_npp16_ntg16_npl64-128-256_t4_<date>_<time>.csv
├── ggml-model-I2_V_4_npp16_ntg16_npl64-128-256_t4_<date>_<time>.txt
└── ggml-model-I2_V_4_npp16_ntg16_npl64-128-256_t4_<date>_<time>.csv
To extend the evaluation:
- Another device: Run on different hardware with a new
DEVICE_NAME(e.g.,laptop_amd). Results are stored in separate directories per device. - Another model: Change
MODEL_DIRto a different model directory (e.g.,~/models/bitnet_b1_58-3B) and re-run Steps 2–3. The GeMM benchmark (Step 1) uses built-in shapes and does not depend on model files, so it only needs to be run once per device.
When run in a vlut.cpp-only workspace, the wrapper scripts:
- Run all vlut.cpp quantization variants found in
MODEL_DIR. - Skip missing baseline frameworks (
llama.cpp,T-MAC,bitnet.cpp) with warnings. - Skip missing quantization files (e.g.,
ggml-model-I2_V_8.gguf) with warnings. - Return nonzero exit status only if an attempted benchmark actually fails.
This workflow uses the same scripts as the minimal workflow. The difference is that the baseline repositories and model files are present, so the scripts also benchmark llama.cpp, T-MAC, and bitnet.cpp.
Place all repositories as siblings:
workspace/
├── BitNet/ # bitnet.cpp
├── T-MAC/
├── llama.cpp/
└── vlut.cpp/
The wrapper scripts assume this layout by default.
Build the baselines following their upstream instructions:
| Baseline | Build instructions |
|---|---|
| llama.cpp | https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md |
| bitnet.cpp | https://github.com/microsoft/BitNet/blob/main/README.md#installation |
| T-MAC | https://github.com/microsoft/T-MAC/blob/main/README.md#installation |
Place the baseline model files in MODEL_DIR with these exact names:
| Framework | Required model files |
|---|---|
| llama.cpp | ggml-model-TQ2_0.gguf, ggml-model-TQ1_0.gguf |
| T-MAC | <model-name>.INT_N.gguf |
| bitnet.cpp | ggml-model-tl2.gguf, ggml-model-tl1.gguf, ggml-model-i2_s.gguf (when applicable) |
Important:
T-MACandbitnet.cppmay need re-compilation per model or quantization variant.- Do not mix converted model files across frameworks.
The commands are identical to the Minimal Reproduction steps. With baselines present, the same scripts automatically include them:
GeMM:
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 1 256
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 4 256For T-MAC GeMM results specifically:
./evaluation/scripts/bench-gemm-tmac.sh \
--device "$DEVICE_NAME" \
--tmac_path ../T-MACEnd-to-end prefilling:
DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-prefill.shWith baselines present, this also benchmarks: llama.cpp (TQ2_0, TQ1_0), T-MAC (INT_N), and bitnet.cpp (tl2, tl1, i2_s).
End-to-end batched decoding:
DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-batch.shNote:
T-MACdoes not always buildllama-batched-benchby default. If needed, build it manually inT-MAC/3rdparty/llama.cppafter each T-MAC rebuild.
The plotting scripts read the raw result directories and produce PDF figures and CSV summary reports.
python evaluation/scripts/plot/plot_gemm_combined.py --both
python evaluation/scripts/plot/plot_e2e_prefill_combined.py --both
python evaluation/scripts/plot/plot_e2e_batch_combined.py --bothEach script accepts --single-thread, --multi-thread, or --both (default).
Output:
| Type | Directory |
|---|---|
| Figures (PDF) | evaluation/figures/ |
| GeMM reports (CSV) | evaluation/reports_gemm/ |
| Prefill reports (CSV) | evaluation/reports_e2e_prefill/ |
| Batch reports (CSV) | evaluation/reports_e2e_batch/ |
The device identifier mydevice is included as a built-in dummy identifier in the plotting scripts. The paper's canonical device identifiers are: pc_intel, laptop_amd, orangepi, smartphone, aws_arm.
Note: For the minimal reproduction (without baselines), the CSV speedup reports may be incomplete because they require baseline results for comparison. This is expected. You can still verify the raw result CSV files and the plot figures, which show vlut.cpp results on their own.
Absolute throughput varies with hardware, compiler, and thermal conditions. The relative trends should match the paper:
vlut.cppgenerally outperforms baselines in GeMM throughput, prefilling speed, and parallel decoding throughput.- The
I1quantization remains competitive while using fewer bits; other sub-2-bit baselines degrade significantly relative to 2-bit. - On AWS Graviton 3,
llama.cpp'sQ4_0fallback for HF BitNet 3B can be relatively strong due to ARM-specific 4-bit kernel optimizations.
To find the best performance on a given device, you can tune the TABLE_ENTRY_SIZE and quantization type.
vlut.cpp exposes two performance-critical tuning dimensions:
| Parameter | Set at | Description |
|---|---|---|
TABLE_ENTRY_SIZE |
CMake configure time | N-tile size as described in the paper (default: 32). |
Quantization type (I1_V_2, I2_V_4, I2_V_8) |
Quantization time | K-tiling / packed weight layout. |
Example — building with a different entry size:
cmake -B build-entry64 -DTABLE_ENTRY_SIZE=64
cmake --build build-entry64 --config Release -j4evaluation/scripts/search-config.sh sweeps over entry sizes (16, 32, 64) and thread counts (1, 4, 8) to find the best configuration for your hardware:
./evaluation/scripts/search-config.sh 1 # search I1 variants
./evaluation/scripts/search-config.sh 2 # search I2 variantsOutput:
evaluation/results_search/scores1.csvorscores2.csv.- Per-thread-count optimal configuration printed to stdout.
| Symptom | Fix |
|---|---|
awk errors in bench-gemm.sh |
Replace $SCRIPT_DIR/test-to-csv.sh with $SCRIPT_DIR/test-to-csv-backup.sh in the script. |
| Previous results overwritten | bench-e2e-prefill.sh and bench-e2e-batch.sh delete target directories before writing; back up results if needed. |
T-MAC llama-batched-bench missing |
Build it manually in T-MAC/3rdparty/llama.cpp. |
| Incomplete CSV speedup reports | Expected when baselines are missing; verify raw results and figures instead. |