Hi, thanks for QLLM. I’ve been quantizing Qwen/Qwen3-1.7B and hit a couple of issues around GPTQ and ONNX:
Summary
- 4-bit ORT path: 4-bit quantization + ORT pack mode gives garbage generations already in PyTorch, and the exported ONNX model also produces repetitive/random tokens.
- 8-bit GPTQ path: 8-bit GPTQ quantization works correctly in PyTorch (good generations), but the exported ONNX model fails to load in onnxruntime due to
MatMulNBits / protobuf parsing issues (and possibly IR-version interactions I added myself while experimenting).
Environment
- Python: 3.10+
torch >= 2.0
transformers 4.46.3
onnx (latest from PyPI at time of testing)
onnxruntime-gpu (latest from PyPI at time of testing)
qllm installed from this repo: pip install -e .
4-bit quantization (PyTorch and ONNX both output garbage)
Command:
python -m qllm \
--model Qwen/Qwen3-1.7B \
--quant_method gptq \
--wbits 4 \
--groupsize 128 \
--pack_mode ORT \
--dataset pileval \
--nsamples 128 \
--sym \
--act-order \
--save ./qwen3-1_7b-gptq4_ort \
--export_onnx ./onnx_qwen3-1_7b-gptq4_ort \
--verify_save
Behavior:
- The 4-bit PyTorch quantized model itself generates highly repetitive, random-looking tokens (garbage), so the problem seems to start already in the quantized model, not just in ONNX.
decoder_merged.onnx from this run also loads in onnxruntime, but ONNX generation matches the bad behavior (garbage / repetitive tokens), consistent with the underlying 4-bit quantized model being wrong.
This suggests an issue in the 4-bit ORT GPTQ/packing path (quantization or kernels or accuracy), rather than purely an export problem.
8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)
Command:
python -m qllm \
--model Qwen/Qwen3-1.7B \
--quant_method gptq \
--wbits 8 \
--groupsize 128 \
--pack_mode GPTQ \
--dataset pileval \
--nsamples 128 \
--sym \
--act-order \
--save ./qwen3-1_7b-gptq8 \
--export_onnx ./onnx_qwen3-1_7b-gptq8 \
--verify_save
Behavior:
- The 8-bit GPTQ PyTorch model loads and generates correct, sensible text (so 8-bit GPTQ itself looks good).
- The exported
decoder_merged.onnx fails to load in onnxruntime.
From code inspection and experiments:
-
The 4-bit ORT path uses something like QuantLinearORT, which emits com.microsoft::MatMulNBits with bits=4 and no packing attribute; my onnxruntime build accepts this schema but, as noted, the quantized model already generates garbage.
-
The 8-bit GPTQ path uses QuantLinearGPTQ, which emits com.microsoft::MatMulNBits with bits=8 and a packing/packing_s="gptq" attribute. onnxruntime reports errors such as:
Unrecognized attribute: packing for operator MatMulNBits
-
I tried to workaround this by editing the exported ONNX model with onnx to strip the packing attribute and re-saving. After that, onnxruntime started failing with protobuf parsing errors (e.g. INVALID_PROTOBUF / external data issues), suggesting that manual graph editing corrupted the model or its external data.
IR version / downgrade
- I manually added IR-downgrade logic in the exporter to force the merged ONNX model’s IR version down to 11 (to match an older onnxruntime build), and later added a flag
--onnx_no_ir_downgrade to keep the natural IR (e.g. IR 13) so I could try newer onnxruntime builds.
- I did see protobuf/parsing issues when loading the 8-bit GPTQ ONNX after downgrading IR, and I’m not sure yet whether those are caused by my IR-downgrade change or purely by the
MatMulNBits + packing behavior.
- I’m still in the process of setting up and testing with the “normal” (undowngraded) IR and a fresh, up-to-date onnxruntime-gpu to rule in or out the IR change as the root cause. At least in earlier experiments, the
MatMulNBits + packing combination still looked like a blocker for loading the 8-bit ONNX.
Questions
- For 4-bit ORT:
- Is there a known issue with the current 4-bit ORT GPTQ/packing path or kernels that could explain a 4-bit quantized model (PyTorch) already generating garbage text?
- For 8-bit GPTQ:
- What is the expected
MatMulNBits schema for GPTQ-style packing (attribute name/type, allowed values, etc.) that onnxruntime supports?
- Should the exporter avoid emitting
packing/packing_s for now, or is there a specific onnxruntime version / execution provider combination that’s expected to work with the current schema?
- More generally:
- What IR/opset + onnxruntime-gpu version combo do you recommend for this export path?
- Do you expect the 8-bit GPTQ ONNX model for Qwen3-1.7B to load and run out-of-the-box with that combination?
I can share exact stack traces, ONNX runtime snippets, and environment details if that would help. TIA!
Hi, thanks for QLLM. I’ve been quantizing Qwen/Qwen3-1.7B and hit a couple of issues around GPTQ and ONNX:
Summary
MatMulNBits/ protobuf parsing issues (and possibly IR-version interactions I added myself while experimenting).Environment
torch>= 2.0transformers4.46.3onnx(latest from PyPI at time of testing)onnxruntime-gpu(latest from PyPI at time of testing)qllminstalled from this repo:pip install -e .4-bit quantization (PyTorch and ONNX both output garbage)
Command:
Behavior:
decoder_merged.onnxfrom this run also loads in onnxruntime, but ONNX generation matches the bad behavior (garbage / repetitive tokens), consistent with the underlying 4-bit quantized model being wrong.This suggests an issue in the 4-bit ORT GPTQ/packing path (quantization or kernels or accuracy), rather than purely an export problem.
8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)
Command:
Behavior:
decoder_merged.onnxfails to load in onnxruntime.From code inspection and experiments:
The 4-bit ORT path uses something like
QuantLinearORT, which emitscom.microsoft::MatMulNBitswithbits=4and nopackingattribute; my onnxruntime build accepts this schema but, as noted, the quantized model already generates garbage.The 8-bit GPTQ path uses
QuantLinearGPTQ, which emitscom.microsoft::MatMulNBitswithbits=8and apacking/packing_s="gptq"attribute. onnxruntime reports errors such as:I tried to workaround this by editing the exported ONNX model with
onnxto strip thepackingattribute and re-saving. After that, onnxruntime started failing with protobuf parsing errors (e.g.INVALID_PROTOBUF/ external data issues), suggesting that manual graph editing corrupted the model or its external data.IR version / downgrade
--onnx_no_ir_downgradeto keep the natural IR (e.g. IR 13) so I could try newer onnxruntime builds.MatMulNBits+packingbehavior.MatMulNBits+packingcombination still looked like a blocker for loading the 8-bit ONNX.Questions
MatMulNBitsschema for GPTQ-style packing (attribute name/type, allowed values, etc.) that onnxruntime supports?packing/packing_sfor now, or is there a specific onnxruntime version / execution provider combination that’s expected to work with the current schema?I can share exact stack traces, ONNX runtime snippets, and environment details if that would help. TIA!