ARCQuant is a high-performance quantization framework designed to resolve the conflict between accuracy and inference efficiency in low-bit LLMs.
While fine-grained quantization (e.g., Block-wise/NVFP4) effectively isolates quantization noise, activation outliers still degrade performance in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches, which introduces significant kernel launch overhead and memory fragmentation.
ARCQuant takes a different approach. Instead of treating outliers separately, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of these critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).
To do:
- Release arxiv version of ARCQuant.
- Release code for reproducing results.
- Release CUDA kernels on NVFP4.
- Release calibration and preprocessing scripts.
- Support vLLM integration.
- Model Support: Add support for more model families:
- Qwen3
- Mixtral
- Wan2.2
conda create -n arcquant python=3.10 -y
conda activate arcquantPlease make sure that CUDA 12.8 is in your environment.
git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txtsudo apt-get update
sudo apt-get install python3-devconda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128cd kernels/
bash remake.shThis might take a few minutes.
Reorder_indices, select_num are needed for quantization:
python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric maxResults are saved in ./saved/
bash evaluate.sh /PATH/TO/YOUR/MODEL/FlashInfer:
cd third-party/flashinfer
python -m pip install -v .We will release our vLLM evaluation very soon.
@article{meng2026arcquant,
title={ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs},
author={Meng, Haoqian and Luo, Yilun and Zhao, Yafei and Liu, Wenyuan and Zhang, Peng and Ma, Xindian},
journal={arXiv preprint arXiv:2601.07475},
year={2026}
}
Our code is built on the following repos, thank you for your contributions to community:
