This project explores inference-time acceleration of large language models (LLMs) by combining parameter-efficient fine-tuning, weight quantization, speculative decoding, and high-performance inference engines.
We focus on improving both throughput and model quality for the meta-llama/Llama-3.2-3B-Instruct model on constrained hardware such as NVIDIA T4/RTX3090 GPUs.
- Reduce inference latency and memory usage of LLMs without significant loss in accuracy.
- Combine complementary techniques:
- LoRA fine-tuning for better perplexity
- Post-training quantization (GPTQ / AWQ)
- Speculative decoding for multi-token speedup
- High-throughput inference frameworks (vLLM, SGLang)
We follow a workflow-oriented order: analyze the model → improve accuracy → compress weights → accelerate inference.
For each transformer layer in Llama-3.2-3B-Instruct, parameter sizes follow mlp.gate_proj = mlp.up_proj = mlp.down_proj > self_attn.q_proj = self_attn.o_proj > self_attn.v_proj = self_attn.k_proj.
Understanding this structure guides us in choosing layers for fine-tuning or quantization.
LoRA inserts a pair of trainable low-rank matrices into linear layers:
For a weight matrix
Our setup
- Dataset: Salesforce/wikitext-2-raw-v1
- Hardware: RTX3090
- Training: ~15 min
- Strategy: use a low learning rate, monitor perplexity frequently
(to avoid degenerate models with no response)
For details, see train_lora.py.
We evaluate two weight-only post-training quantization methods.
GPTQ minimizes layer-wise post-quantization error:
and
Our setup
- Merge PEFT weights
- 4-bit quantization, group size = 128
- Achieves good trade-off between compression and accuracy
AWQ preserves the most activation-sensitive 1% of weight channels in higher precision (FP16/INT8) and quantizes the rest.
Our setup
- Merge PEFT weights
- 4-bit group quantization (group = 128)
- Retains high-influence channels to keep perplexity low
Speculative decoding uses a two-model approach:
- A smaller, faster draft model proposes several candidate tokens.
- The larger target model verifies them; accepted tokens are reused, rejected tokens are recomputed.
This yields multi-token parallelism during generation.
Our setup
- Draft model:
Llama-3.2-1B-Instruct, fine-tuned and quantized as above - Integrated in the vLLM framework
- Yields significant throughput improvement on GPU inference
We tested two high-performance engines:
| Framework | KV-Cache Strategy | Batching |
|---|---|---|
| vLLM | Paged Attention – splits KV cache into memory pages for dynamic reuse and defragmentation | Continuous batching |
| SGLang | Radix Attention – groups/alines KV cache to reduce cache-line thrashing | Persistent batching |
- vLLM natively supports speculative decoding.
- SGLang limits speculative decoding to EAGLE/EAGLE-3, which requires identical hidden sizes for draft and target models — often leading to GPU resource contention.
- Backend selection in vLLM:
VLLM_ATTENTION_BACKEND=TORCH_SDPA|FLASH_ATTN|XFORMERS|ROCM_FLASH|FLASHINFER|FLASHMLA
On T4 GPUs,XFORMERSgave the highest throughput. - Avoid crashes on NYCU T4 servers by adjusting CUDA graph settings:
compilation_config = {
"cudagraph_capture_sizes": [1, 2, 4, 8, 16],
"max_capture_size": 16,
}Hardware: NVIDIA Tesla T4 Dataset: wikitext-2-raw-v1 Base model perplexity: 11.12
| Method | Framework | Throughput (tok/s) | PPL |
|---|---|---|---|
| GPTQ | vLLM | 84.01 | 11.12 |
| GPTQ + Speculative Decoding | vLLM | 91.97 | 11.12 |
| GPTQ | SGLang | 90.10 | 11.12 |
| GPTQ + Speculative Decoding | SGLang | 57.10 | 11.12 |
| AWQ + Speculative Decoding | vLLM | 71.87 | 10.78 |
| AWQ | vLLM | 19.35 | 10.78 |
| HQQ (baseline) | – | 3.90 | 11.23 |
Key observations
- GPTQ consistently outperforms AWQ in throughput on both vanilla and speculative decoding.
- Without speculative decoding, SGLang > vLLM due to better low-level kernel optimizations and Radix Attention.
- With speculative decoding, vLLM > SGLang because SGLang’s EAGLE-based approach restricts draft–target model choices and causes resource contention.
| Metric | Radix Attention (SGLang) | Paged Attention (vLLM) |
|---|---|---|
| Problem solved | Improves softmax computation efficiency | Efficient memory management for long contexts |
| Core idea | Divide-and-conquer block attention | KV-cache paging (like virtual memory) |
| Best use case | High-speed training / inference | Long-context tasks (RAG, document QA) |
| Large-context support | No | Yes |
| Inference-speed gain | Significant | Significant |
JIT Layout Planning in SGLang optimizes KV-cache memory layout at runtime, considering batch shape, prompt length, and GPU memory alignment. Without speculative decoding, this contributes to a +6.19 tok/s gain over vLLM.
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, 2021
- Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2022
- Lin et al., AWQ: Activation-Aware Weight Quantization for LLMs, 2024
- Xu et al., EAGLE: Speculative Decoding, 2024
- Kwon et al., Radix Attention, 2023
- vLLM: https://github.com/vllm-project/vllm
- SGLang: https://github.com/sgl-project/sglang



