A hands-on benchmark comparing three LLM inference precision modes on an NVIDIA T4 GPU.
Model: Qwen2.5-0.5B-Instruct | Hardware: Google Colab T4 (15 GB VRAM)
| Method | Throughput (tok/s) | VRAM Usage | vs FP16 Speed | VRAM Saved |
|---|---|---|---|---|
| FP16 (baseline) | 24.4 | 0.998 GB | 1.00x | — |
| INT8 (bitsandbytes) | ~6.3 | 0.642 GB | 0.27x | 36% |
| AWQ INT4 | 17.1 | 0.468 GB | 0.70x | 53% |
Tested on 20 prompts × 128 max new tokens, greedy decoding, after 1 warm-up run.
bitsandbytes with load_in_8bit=True uses dynamic quantization:
weights are stored as INT8 but dequantized back to FP16 before every matrix multiply.
This eliminates the memory saving benefit during computation and adds dequantization overhead,
resulting in 3.7× slower decode throughput despite a genuine 36% VRAM reduction.
Use case: Load a model that barely fits in VRAM, when speed is not critical.
AWQ uses static weight-only quantization — weights are calibrated and compressed offline,
removing the per-inference dequantization overhead that hurts bitsandbytes.
However, on a T4 GPU (Turing architecture, 2018), AWQ's INT4 CUDA kernels cannot leverage
native INT4 Tensor Cores because T4 only has FP16/INT8 Tensor Cores.
The INT4 kernels fall back to slower execution, failing to outperform FP16's native Tensor Cores.
Use case: Maximize VRAM savings with less speed penalty than INT8 on legacy GPUs.
Common misconception: "Quantization always speeds up inference."
Reality: Speed gains require hardware support for the target data type.
| GPU | INT4 Tensor Core | Expected AWQ Speedup |
|---|---|---|
| NVIDIA T4 (2018) | ❌ | Slower or marginal |
| NVIDIA A100 (2020) | ✅ | ~1.3–1.6× |
| NVIDIA H100 (2022) | ✅ | ~1.5–2.0× |
| NVIDIA RTX 4090 (2022) | ✅ | ~1.3–1.8× |
On capable hardware, AWQ provides both memory savings and throughput improvements.
On T4, quantization is best understood as a memory management tool, not a speed tool.
pip install torch transformers accelerate bitsandbytes autoawqpython benchmark.pyRequires an NVIDIA GPU with CUDA. Tested on Python 3.12, CUDA 12.2.
llm-quantization-benchmark/
├── benchmark.py # Full benchmark: FP16 / INT8 / AWQ comparison
└── README.md
Why measure memory_allocated() not max_memory_allocated()
max_memory_allocated() returns the peak since last reset — if FP16 runs first and sets a high watermark, the INT8 measurement appears identical. memory_allocated() at load time gives the true footprint per method.
Why separate prefill and decode timing
Prefill (processing the input prompt) and decode (generating tokens one by one) have different computational characteristics. Prefill is compute-bound (large matrix multiply); decode is memory-bandwidth-bound (loading weights for each new token). Quantization affects these two phases differently.
Why bitsandbytes INT8 decode is especially slow
Decode loads model weights for every single generated token. With dynamic INT8, each weight tensor is dequantized before use, so the per-token cost includes both a memory load and a dequantization operation, making decode disproportionately slower.