Evaluation Guide

This guide explains how to build vlut.cpp and reproduce the evaluation results reported in the paper.

Complete Preparation first, then follow one of the three workflows:

Workflow	Goal	Estimated time
Quick Functional Check	Verify the build runs correctly.	< 5 min
Minimal Reproduction	Reproduce vlut.cpp results on one device and model.	< 1 hr
Full Comparison	Compare against all baselines from the paper.	> 1 day per device

Preparation itself takes about 30 minutes (build + model download).

Preparation

Environment

vlut.cpp targets modern x86 and ARM CPUs.

Minimum requirements:

Resource	Inference only	With model conversion
CPU	x86_64 with AVX2, or ARMv8 with NEON/SVE	same
RAM	4 GB	24 GB
Disk	16 GB	64 GB (full comparison)

Software: Linux, WSL2, or Android (Termux). Python 3.10+.

For stable benchmarking results:

Use only performance cores (P-cores) on heterogeneous CPUs.
Minimize background processes during benchmarking.
Set power management to high-performance mode (e.g., battery set to performance mode on laptops, or CPU governor set to performance on Linux).

Models

vlut.cpp supports several ternary model families:

The paper evaluates three models:

Short name	HuggingFace repository
HF BitNet 3B	1bitLLM/bitnet_b1_58-3B
Llama3 8B	HF1BitLLM/Llama3-8B-1.58-100B-tokens
Falcon3 1B	tiiuae/Falcon3-1B-Instruct-1.58bit

Pre-converted GGUF models are hosted in this HuggingFace collection:

https://huggingface.co/collections/XXXXyu/vlutcpp

Model	Pre-converted repository
HF BitNet 3B	XXXXyu/bitnet_b1_58-3B-vlut-gguf
Llama3 8B	XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf
Falcon3 1B	XXXXyu/Falcon3-1B-Instruct-1.58bit-vlut-gguf

Store models locally using the model's short name as the directory name:

~/models/
├── Llama3-8B-1.58-100B-tokens/
│   ├── ggml-model-I1_V_2.gguf
│   └── ggml-model-I2_V_4.gguf
├── bitnet_b1_58-3B/
│   └── ...
└── Falcon3-1B-Instruct-1.58bit/
    └── ...

Build vlut.cpp

cmake -B build
cmake --build build --config Release -j4    # adjust -j to your CPU core count

Optional CMake flags:

Flag	Purpose
`-DVLUT_SVE=ON`	Enable SVE intrinsics (e.g., AWS Graviton 3).
`-DTABLE_ENTRY_SIZE=<N>`	Set the N-tile size as described in the paper (default: 32).

Set Up Python

conda create -y -n vlut-cpp python=3.10
conda activate vlut-cpp
python -m pip install huggingface_hub pandas matplotlib

Download Pre-converted Models

Download the pre-converted models. For example, Llama3 8B (~2 GB per .gguf file):

hf download XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf \
  ggml-model-I1_V_2.gguf \
  ggml-model-I2_V_4.gguf \
  --local-dir ~/models/Llama3-8B-1.58-100B-tokens

Other models (BitNet 3B, Falcon3 1B) are smaller. Replace the repository ID and --local-dir accordingly.

Convert Models (Optional)

Skip this section if you use the pre-converted GGUF models above.

Step 1 — Convert HuggingFace weights to GGUF

Install conversion dependencies:

python -m pip install -r requirements.txt

Download the original HuggingFace weights (e.g., Llama3 8B):

hf download HF1BitLLM/Llama3-8B-1.58-100B-tokens \
  --local-dir ~/models/Llama3-8B-1.58-100B-tokens

Convert to the vlut.cpp intermediate GGUF format:

python ./convert_hf_to_gguf_vlut.py \
  ~/models/Llama3-8B-1.58-100B-tokens \
  --outfile ~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf

Step 2 — Quantize to Vec-LUT packings

Quantize the intermediate GGUF into one or more Vec-LUT packings:

./build/bin/llama-quantize \
  ~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I1_V_2

./build/bin/llama-quantize \
  ~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I2_V_4

./build/bin/llama-quantize \
  ~/models/Llama3-8B-1.58-100B-tokens/Llama3-8B-1.58-100B-tokens.vlut.gguf I2_V_8

This produces files in the same directory with default names (ggml-model-I1_V_2.gguf, etc.). Do not rename them — the evaluation scripts depend on these names.

Quick Functional Check

Run a single batched-decoding pass to verify the build:

./build/bin/llama-batched \
  -m ~/models/Llama3-8B-1.58-100B-tokens/ggml-model-I1_V_2.gguf \
  -p "I believe" \
  -np 32 \
  -n 16 \
  -t 4 \
  -ngl 0 \
  --temp 0.5 \
  --repeat-penalty 1.5

Expected: The model loads without errors and the program prints generated text from 32 parallel sequences. This corresponds to evaluation/demo/run_batched_decode.sh.

Minimal Reproduction

This section walks through a minimal evaluation of vlut.cpp on one device and one model. Only vlut.cpp and pre-converted models are needed — no baseline frameworks. The same wrapper scripts are used in the full comparison workflow; here they skip missing baselines with warnings.

To evaluate additional devices or models, repeat the steps below with different hardware and a different MODEL_DIR.

Overview

The minimal workflow runs three benchmarks:

#	Benchmark	What it measures	Script
1	GeMM	Raw matrix-multiply throughput.	`bench-gemm.sh`
2	End-to-end prefilling	Prompt processing speed.	`bench-e2e-prefill.sh`
3	End-to-end batched decoding	Parallel token generation.	`bench-e2e-batch.sh`

Set Variables

Choose a device identifier and point to the model directory. The device identifier is used to distinguish results from different machines when plotting them together.

export DEVICE_NAME=mydevice
export MODEL_DIR=~/models/Llama3-8B-1.58-100B-tokens

Step 1 — GeMM Benchmark

Measures raw ternary GeMM throughput for the built-in model shapes.

# Single-threaded
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 1 256

# Multi-threaded (adjust thread count to your CPU)
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 4 256

Arguments: <device_name> <threads> <sequence_length> [entry_size]

device_name — identifier used in output directory names.
threads — number of threads.
sequence_length — the N dimension (sequence length to benchmark).
entry_size — optional, LUT entry size in bytes (default: 32).

Output: evaluation/results_gemm_${DEVICE_NAME}/ containing .log and .csv files, one pair per model shape.

Troubleshooting: If you encounter an awk error, edit bench-gemm.sh and replace $SCRIPT_DIR/test-to-csv.sh with $SCRIPT_DIR/test-to-csv-backup.sh.

Step 2 — End-to-End Prefilling

Measures prompt processing throughput (tokens/sec) across prompt lengths and thread counts.

DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-prefill.sh

Optional environment variables:

Variable	Default	Description
`PROMPT_LENGTH`	`128,256,512`	Comma-separated prompt lengths to test.
`THREAD_COUNT`	`1,4`	Comma-separated thread counts.
`REPEAT_COUNT`	`3`	Repetitions per configuration.

Output: evaluation/results_e2e_prefill_${DEVICE_NAME}/<model_name>/ with .txt log and .csv files for each quantization variant found in MODEL_DIR.

Note: This script deletes the target result directory before writing, so previous results for the same device and model are overwritten.

Step 3 — End-to-End Batched Decoding

Measures parallel decoding throughput (tokens/sec) across batch sizes and thread counts.

DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-batch.sh

Optional environment variables:

Variable	Default	Description
`PREFILL_LEN`	`16`	Prompt length before decoding.
`TOKEN_GEN_LENS`	`16`	Comma-separated generation lengths.
`PARALLEL_SEQS`	`64,128,256`	Comma-separated batch sizes.
`THREAD_COUNT`	`4`	Comma-separated thread counts.

Output: evaluation/results_e2e_batch_${DEVICE_NAME}/<model_name>/ with .txt log and .csv files for each quantization variant found in MODEL_DIR.

Note: This script also deletes the target result directory before writing.

Verifying Results

After all three steps, you should have the following directory structure:

evaluation/
├── results_gemm_mydevice/
│   ├── llama3_8b_t1_ns256_s32.log
│   ├── llama3_8b_t1_ns256_s32.csv
│   ├── llama3_8b_t4_ns256_s32.log
│   └── llama3_8b_t4_ns256_s32.csv
├── results_e2e_prefill_mydevice/
│   └── Llama3-8B-1.58-100B-tokens/
│       ├── ggml-model-I1_V_2_p128-256-512_t1-4_r3_<date>_<time>.txt
│       ├── ggml-model-I1_V_2_p128-256-512_t1-4_r3_<date>_<time>.csv
│       ├── ggml-model-I2_V_4_p128-256-512_t1-4_r3_<date>_<time>.txt
│       └── ggml-model-I2_V_4_p128-256-512_t1-4_r3_<date>_<time>.csv
└── results_e2e_batch_mydevice/
    └── Llama3-8B-1.58-100B-tokens/
        ├── ggml-model-I1_V_2_npp16_ntg16_npl64-128-256_t4_<date>_<time>.txt
        ├── ggml-model-I1_V_2_npp16_ntg16_npl64-128-256_t4_<date>_<time>.csv
        ├── ggml-model-I2_V_4_npp16_ntg16_npl64-128-256_t4_<date>_<time>.txt
        └── ggml-model-I2_V_4_npp16_ntg16_npl64-128-256_t4_<date>_<time>.csv

Repeating for More Devices or Models

To extend the evaluation:

Another device: Run on different hardware with a new DEVICE_NAME (e.g., laptop_amd). Results are stored in separate directories per device.
Another model: Change MODEL_DIR to a different model directory (e.g., ~/models/bitnet_b1_58-3B) and re-run Steps 2–3. The GeMM benchmark (Step 1) uses built-in shapes and does not depend on model files, so it only needs to be run once per device.

How the Wrapper Scripts Behave

When run in a vlut.cpp-only workspace, the wrapper scripts:

Run all vlut.cpp quantization variants found in MODEL_DIR.
Skip missing baseline frameworks (llama.cpp, T-MAC, bitnet.cpp) with warnings.
Skip missing quantization files (e.g., ggml-model-I2_V_8.gguf) with warnings.
Return nonzero exit status only if an attempted benchmark actually fails.

Full Comparison

This workflow uses the same scripts as the minimal workflow. The difference is that the baseline repositories and model files are present, so the scripts also benchmark llama.cpp, T-MAC, and bitnet.cpp.

Workspace Layout

Place all repositories as siblings:

workspace/
├── BitNet/          # bitnet.cpp
├── T-MAC/
├── llama.cpp/
└── vlut.cpp/

The wrapper scripts assume this layout by default.

Baseline Setup

Build the baselines following their upstream instructions:

Baseline	Build instructions
llama.cpp	https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
bitnet.cpp	https://github.com/microsoft/BitNet/blob/main/README.md#installation
T-MAC	https://github.com/microsoft/T-MAC/blob/main/README.md#installation

Place the baseline model files in MODEL_DIR with these exact names:

Framework	Required model files
llama.cpp	`ggml-model-TQ2_0.gguf`, `ggml-model-TQ1_0.gguf`
T-MAC	`<model-name>.INT_N.gguf`
bitnet.cpp	`ggml-model-tl2.gguf`, `ggml-model-tl1.gguf`, `ggml-model-i2_s.gguf` (when applicable)

Important:

T-MAC and bitnet.cpp may need re-compilation per model or quantization variant.
Do not mix converted model files across frameworks.

Running Benchmarks

The commands are identical to the Minimal Reproduction steps. With baselines present, the same scripts automatically include them:

GeMM:

./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 1 256
./evaluation/scripts/bench-gemm.sh "$DEVICE_NAME" 4 256

For T-MAC GeMM results specifically:

./evaluation/scripts/bench-gemm-tmac.sh \
  --device "$DEVICE_NAME" \
  --tmac_path ../T-MAC

End-to-end prefilling:

DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-prefill.sh

With baselines present, this also benchmarks: llama.cpp (TQ2_0, TQ1_0), T-MAC (INT_N), and bitnet.cpp (tl2, tl1, i2_s).

End-to-end batched decoding:

DEVICE_NAME="$DEVICE_NAME" \
MODEL_DIR="$MODEL_DIR" \
./evaluation/scripts/bench-e2e-batch.sh

Note: T-MAC does not always build llama-batched-bench by default. If needed, build it manually in T-MAC/3rdparty/llama.cpp after each T-MAC rebuild.

Plotting and Reports

The plotting scripts read the raw result directories and produce PDF figures and CSV summary reports.

python evaluation/scripts/plot/plot_gemm_combined.py --both
python evaluation/scripts/plot/plot_e2e_prefill_combined.py --both
python evaluation/scripts/plot/plot_e2e_batch_combined.py --both

Each script accepts --single-thread, --multi-thread, or --both (default).

Output:

Type	Directory
Figures (PDF)	`evaluation/figures/`
GeMM reports (CSV)	`evaluation/reports_gemm/`
Prefill reports (CSV)	`evaluation/reports_e2e_prefill/`
Batch reports (CSV)	`evaluation/reports_e2e_batch/`

The device identifier mydevice is included as a built-in dummy identifier in the plotting scripts. The paper's canonical device identifiers are: pc_intel, laptop_amd, orangepi, smartphone, aws_arm.

Note: For the minimal reproduction (without baselines), the CSV speedup reports may be incomplete because they require baseline results for comparison. This is expected. You can still verify the raw result CSV files and the plot figures, which show vlut.cpp results on their own.

Expected Results

Absolute throughput varies with hardware, compiler, and thermal conditions. The relative trends should match the paper:

vlut.cpp generally outperforms baselines in GeMM throughput, prefilling speed, and parallel decoding throughput.
The I1 quantization remains competitive while using fewer bits; other sub-2-bit baselines degrade significantly relative to 2-bit.
On AWS Graviton 3, llama.cpp's Q4_0 fallback for HF BitNet 3B can be relatively strong due to ARM-specific 4-bit kernel optimizations.

Configuration Tuning (Optional)

To find the best performance on a given device, you can tune the TABLE_ENTRY_SIZE and quantization type.

vlut.cpp exposes two performance-critical tuning dimensions:

Parameter	Set at	Description
`TABLE_ENTRY_SIZE`	CMake configure time	N-tile size as described in the paper (default: 32).
Quantization type (`I1_V_2`, `I2_V_4`, `I2_V_8`)	Quantization time	K-tiling / packed weight layout.

Example — building with a different entry size:

cmake -B build-entry64 -DTABLE_ENTRY_SIZE=64
cmake --build build-entry64 --config Release -j4

Automated Configuration Search

evaluation/scripts/search-config.sh sweeps over entry sizes (16, 32, 64) and thread counts (1, 4, 8) to find the best configuration for your hardware:

./evaluation/scripts/search-config.sh 1   # search I1 variants
./evaluation/scripts/search-config.sh 2   # search I2 variants

Output:

evaluation/results_search/scores1.csv or scores2.csv.
Per-thread-count optimal configuration printed to stdout.

Troubleshooting

Symptom	Fix
`awk` errors in `bench-gemm.sh`	Replace `$SCRIPT_DIR/test-to-csv.sh` with `$SCRIPT_DIR/test-to-csv-backup.sh` in the script.
Previous results overwritten	`bench-e2e-prefill.sh` and `bench-e2e-batch.sh` delete target directories before writing; back up results if needed.
T-MAC `llama-batched-bench` missing	Build it manually in `T-MAC/3rdparty/llama.cpp`.
Incomplete CSV speedup reports	Expected when baselines are missing; verify raw results and figures instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Guide

Preparation

Environment

Models

Build vlut.cpp

Set Up Python

Download Pre-converted Models

Convert Models (Optional)

Step 1 — Convert HuggingFace weights to GGUF

Step 2 — Quantize to Vec-LUT packings

Quick Functional Check

Minimal Reproduction

Overview

Set Variables

Step 1 — GeMM Benchmark

Step 2 — End-to-End Prefilling

Step 3 — End-to-End Batched Decoding

Verifying Results

Repeating for More Devices or Models

How the Wrapper Scripts Behave

Full Comparison

Workspace Layout

Baseline Setup

Running Benchmarks

Plotting and Reports

Expected Results

Configuration Tuning (Optional)

Automated Configuration Search

Troubleshooting

FilesExpand file tree

Evaluation.md

Latest commit

History

Evaluation.md

File metadata and controls

Evaluation Guide

Preparation

Environment

Models

Build vlut.cpp

Set Up Python

Download Pre-converted Models

Convert Models (Optional)

Step 1 — Convert HuggingFace weights to GGUF

Step 2 — Quantize to Vec-LUT packings

Quick Functional Check

Minimal Reproduction

Overview

Set Variables

Step 1 — GeMM Benchmark

Step 2 — End-to-End Prefilling

Step 3 — End-to-End Batched Decoding

Verifying Results

Repeating for More Devices or Models

How the Wrapper Scripts Behave

Full Comparison

Workspace Layout

Baseline Setup

Running Benchmarks

Plotting and Reports

Expected Results

Configuration Tuning (Optional)

Automated Configuration Search

Troubleshooting