Skip to content

Qwen3-1.7B GPTQ: 4-bit quantized model outputs garbage (PyTorch + ONNX); 8-bit GPTQ ONNX fails to load (MatMulNBits / protobuf) #170

@Niketagarwal10

Description

@Niketagarwal10

Hi, thanks for QLLM. I’ve been quantizing Qwen/Qwen3-1.7B and hit a couple of issues around GPTQ and ONNX:

Summary

  • 4-bit ORT path: 4-bit quantization + ORT pack mode gives garbage generations already in PyTorch, and the exported ONNX model also produces repetitive/random tokens.
  • 8-bit GPTQ path: 8-bit GPTQ quantization works correctly in PyTorch (good generations), but the exported ONNX model fails to load in onnxruntime due to MatMulNBits / protobuf parsing issues (and possibly IR-version interactions I added myself while experimenting).

Environment

  • Python: 3.10+
  • torch >= 2.0
  • transformers 4.46.3
  • onnx (latest from PyPI at time of testing)
  • onnxruntime-gpu (latest from PyPI at time of testing)
  • qllm installed from this repo: pip install -e .

4-bit quantization (PyTorch and ONNX both output garbage)

Command:

python -m qllm \
  --model Qwen/Qwen3-1.7B \
  --quant_method gptq \
  --wbits 4 \
  --groupsize 128 \
  --pack_mode ORT \
  --dataset pileval \
  --nsamples 128 \
  --sym \
  --act-order \
  --save ./qwen3-1_7b-gptq4_ort \
  --export_onnx ./onnx_qwen3-1_7b-gptq4_ort \
  --verify_save

Behavior:

  • The 4-bit PyTorch quantized model itself generates highly repetitive, random-looking tokens (garbage), so the problem seems to start already in the quantized model, not just in ONNX.
  • decoder_merged.onnx from this run also loads in onnxruntime, but ONNX generation matches the bad behavior (garbage / repetitive tokens), consistent with the underlying 4-bit quantized model being wrong.

This suggests an issue in the 4-bit ORT GPTQ/packing path (quantization or kernels or accuracy), rather than purely an export problem.

8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)

Command:

python -m qllm \
  --model Qwen/Qwen3-1.7B \
  --quant_method gptq \
  --wbits 8 \
  --groupsize 128 \
  --pack_mode GPTQ \
  --dataset pileval \
  --nsamples 128 \
  --sym \
  --act-order \
  --save ./qwen3-1_7b-gptq8 \
  --export_onnx ./onnx_qwen3-1_7b-gptq8 \
  --verify_save

Behavior:

  • The 8-bit GPTQ PyTorch model loads and generates correct, sensible text (so 8-bit GPTQ itself looks good).
  • The exported decoder_merged.onnx fails to load in onnxruntime.

From code inspection and experiments:

  • The 4-bit ORT path uses something like QuantLinearORT, which emits com.microsoft::MatMulNBits with bits=4 and no packing attribute; my onnxruntime build accepts this schema but, as noted, the quantized model already generates garbage.

  • The 8-bit GPTQ path uses QuantLinearGPTQ, which emits com.microsoft::MatMulNBits with bits=8 and a packing/packing_s="gptq" attribute. onnxruntime reports errors such as:

    Unrecognized attribute: packing for operator MatMulNBits
    
  • I tried to workaround this by editing the exported ONNX model with onnx to strip the packing attribute and re-saving. After that, onnxruntime started failing with protobuf parsing errors (e.g. INVALID_PROTOBUF / external data issues), suggesting that manual graph editing corrupted the model or its external data.

IR version / downgrade

  • I manually added IR-downgrade logic in the exporter to force the merged ONNX model’s IR version down to 11 (to match an older onnxruntime build), and later added a flag --onnx_no_ir_downgrade to keep the natural IR (e.g. IR 13) so I could try newer onnxruntime builds.
  • I did see protobuf/parsing issues when loading the 8-bit GPTQ ONNX after downgrading IR, and I’m not sure yet whether those are caused by my IR-downgrade change or purely by the MatMulNBits + packing behavior.
  • I’m still in the process of setting up and testing with the “normal” (undowngraded) IR and a fresh, up-to-date onnxruntime-gpu to rule in or out the IR change as the root cause. At least in earlier experiments, the MatMulNBits + packing combination still looked like a blocker for loading the 8-bit ONNX.

Questions

  • For 4-bit ORT:
    • Is there a known issue with the current 4-bit ORT GPTQ/packing path or kernels that could explain a 4-bit quantized model (PyTorch) already generating garbage text?
  • For 8-bit GPTQ:
    • What is the expected MatMulNBits schema for GPTQ-style packing (attribute name/type, allowed values, etc.) that onnxruntime supports?
    • Should the exporter avoid emitting packing/packing_s for now, or is there a specific onnxruntime version / execution provider combination that’s expected to work with the current schema?
  • More generally:
    • What IR/opset + onnxruntime-gpu version combo do you recommend for this export path?
    • Do you expect the 8-bit GPTQ ONNX model for Qwen3-1.7B to load and run out-of-the-box with that combination?

I can share exact stack traces, ONNX runtime snippets, and environment details if that would help. TIA!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions