Skip to content

dohu012/llm-quantization-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

LLM Quantization Benchmark: FP16 vs INT8 vs AWQ

A hands-on benchmark comparing three LLM inference precision modes on an NVIDIA T4 GPU.
Model: Qwen2.5-0.5B-Instruct | Hardware: Google Colab T4 (15 GB VRAM)


Results

Method Throughput (tok/s) VRAM Usage vs FP16 Speed VRAM Saved
FP16 (baseline) 24.4 0.998 GB 1.00x
INT8 (bitsandbytes) ~6.3 0.642 GB 0.27x 36%
AWQ INT4 17.1 0.468 GB 0.70x 53%

Tested on 20 prompts × 128 max new tokens, greedy decoding, after 1 warm-up run.


Key Findings

1. bitsandbytes INT8 saves memory but hurts throughput (0.27x)

bitsandbytes with load_in_8bit=True uses dynamic quantization:
weights are stored as INT8 but dequantized back to FP16 before every matrix multiply.
This eliminates the memory saving benefit during computation and adds dequantization overhead,
resulting in 3.7× slower decode throughput despite a genuine 36% VRAM reduction.

Use case: Load a model that barely fits in VRAM, when speed is not critical.

2. AWQ INT4 saves more memory (53%) but is also slower on T4 (0.70x)

AWQ uses static weight-only quantization — weights are calibrated and compressed offline,
removing the per-inference dequantization overhead that hurts bitsandbytes.
However, on a T4 GPU (Turing architecture, 2018), AWQ's INT4 CUDA kernels cannot leverage
native INT4 Tensor Cores because T4 only has FP16/INT8 Tensor Cores.
The INT4 kernels fall back to slower execution, failing to outperform FP16's native Tensor Cores.

Use case: Maximize VRAM savings with less speed penalty than INT8 on legacy GPUs.

3. Quantization speedup is GPU-architecture-dependent

Common misconception: "Quantization always speeds up inference."
Reality: Speed gains require hardware support for the target data type.

GPU INT4 Tensor Core Expected AWQ Speedup
NVIDIA T4 (2018) Slower or marginal
NVIDIA A100 (2020) ~1.3–1.6×
NVIDIA H100 (2022) ~1.5–2.0×
NVIDIA RTX 4090 (2022) ~1.3–1.8×

On capable hardware, AWQ provides both memory savings and throughput improvements.
On T4, quantization is best understood as a memory management tool, not a speed tool.


Setup

pip install torch transformers accelerate bitsandbytes autoawq
python benchmark.py

Requires an NVIDIA GPU with CUDA. Tested on Python 3.12, CUDA 12.2.


Project Structure

llm-quantization-benchmark/
├── benchmark.py    # Full benchmark: FP16 / INT8 / AWQ comparison
└── README.md

Technical Notes

Why measure memory_allocated() not max_memory_allocated()
max_memory_allocated() returns the peak since last reset — if FP16 runs first and sets a high watermark, the INT8 measurement appears identical. memory_allocated() at load time gives the true footprint per method.

Why separate prefill and decode timing
Prefill (processing the input prompt) and decode (generating tokens one by one) have different computational characteristics. Prefill is compute-bound (large matrix multiply); decode is memory-bandwidth-bound (loading weights for each new token). Quantization affects these two phases differently.

Why bitsandbytes INT8 decode is especially slow
Decode loads model weights for every single generated token. With dynamic INT8, each weight tensor is dequantized before use, so the per-token cost includes both a memory load and a dequantization operation, making decode disproportionately slower.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages