LLM Quantization Benchmark: FP16 vs INT8 vs AWQ

A hands-on benchmark comparing three LLM inference precision modes on an NVIDIA T4 GPU.
Model: Qwen2.5-0.5B-Instruct | Hardware: Google Colab T4 (15 GB VRAM)

Results

Method	Throughput (tok/s)	VRAM Usage	vs FP16 Speed	VRAM Saved
FP16 (baseline)	24.4	0.998 GB	1.00x	—
INT8 (bitsandbytes)	~6.3	0.642 GB	0.27x	36%
AWQ INT4	17.1	0.468 GB	0.70x	53%

Tested on 20 prompts × 128 max new tokens, greedy decoding, after 1 warm-up run.

Key Findings

1. bitsandbytes INT8 saves memory but hurts throughput (0.27x)

bitsandbytes with load_in_8bit=True uses dynamic quantization:
weights are stored as INT8 but dequantized back to FP16 before every matrix multiply.
This eliminates the memory saving benefit during computation and adds dequantization overhead,
resulting in 3.7× slower decode throughput despite a genuine 36% VRAM reduction.

Use case: Load a model that barely fits in VRAM, when speed is not critical.

2. AWQ INT4 saves more memory (53%) but is also slower on T4 (0.70x)

AWQ uses static weight-only quantization — weights are calibrated and compressed offline,
removing the per-inference dequantization overhead that hurts bitsandbytes.
However, on a T4 GPU (Turing architecture, 2018), AWQ's INT4 CUDA kernels cannot leverage
native INT4 Tensor Cores because T4 only has FP16/INT8 Tensor Cores.
The INT4 kernels fall back to slower execution, failing to outperform FP16's native Tensor Cores.

Use case: Maximize VRAM savings with less speed penalty than INT8 on legacy GPUs.

3. Quantization speedup is GPU-architecture-dependent

Common misconception: "Quantization always speeds up inference."
Reality: Speed gains require hardware support for the target data type.

GPU	INT4 Tensor Core	Expected AWQ Speedup
NVIDIA T4 (2018)	❌	Slower or marginal
NVIDIA A100 (2020)	✅	~1.3–1.6×
NVIDIA H100 (2022)	✅	~1.5–2.0×
NVIDIA RTX 4090 (2022)	✅	~1.3–1.8×

On capable hardware, AWQ provides both memory savings and throughput improvements.
On T4, quantization is best understood as a memory management tool, not a speed tool.

Setup

pip install torch transformers accelerate bitsandbytes autoawq

python benchmark.py

Requires an NVIDIA GPU with CUDA. Tested on Python 3.12, CUDA 12.2.

Project Structure

llm-quantization-benchmark/
├── benchmark.py    # Full benchmark: FP16 / INT8 / AWQ comparison
└── README.md

Technical Notes

Why measure memory_allocated() not max_memory_allocated()
max_memory_allocated() returns the peak since last reset — if FP16 runs first and sets a high watermark, the INT8 measurement appears identical. memory_allocated() at load time gives the true footprint per method.

Why separate prefill and decode timing
Prefill (processing the input prompt) and decode (generating tokens one by one) have different computational characteristics. Prefill is compute-bound (large matrix multiply); decode is memory-bandwidth-bound (loading weights for each new token). Quantization affects these two phases differently.

Why bitsandbytes INT8 decode is especially slow
Decode loads model weights for every single generated token. With dynamic INT8, each weight tensor is dequantized before use, so the per-token cost includes both a memory load and a dequantization operation, making decode disproportionately slower.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Quantization Benchmark: FP16 vs INT8 vs AWQ

Results

Key Findings

1. bitsandbytes INT8 saves memory but hurts throughput (0.27x)

2. AWQ INT4 saves more memory (53%) but is also slower on T4 (0.70x)

3. Quantization speedup is GPU-architecture-dependent

Setup

Project Structure

Technical Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Quantization Benchmark: FP16 vs INT8 vs AWQ

Results

Key Findings

1. bitsandbytes INT8 saves memory but hurts throughput (0.27x)

2. AWQ INT4 saves more memory (53%) but is also slower on T4 (0.70x)

3. Quantization speedup is GPU-architecture-dependent

Setup

Project Structure

Technical Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages