Qwen3-1.7B GPTQ: 4-bit quantized model outputs garbage (PyTorch + ONNX); 8-bit GPTQ ONNX fails to load (MatMulNBits / protobuf)

Hi, thanks for QLLM. I’ve been quantizing Qwen/Qwen3-1.7B and hit a couple of issues around GPTQ and ONNX:

### Summary

- **4-bit ORT path**: 4-bit quantization + ORT pack mode gives **garbage generations already in PyTorch**, and the exported ONNX model also produces repetitive/random tokens.
- **8-bit GPTQ path**: 8-bit GPTQ quantization works correctly in PyTorch (good generations), but the exported ONNX model fails to load in onnxruntime due to `MatMulNBits` / protobuf parsing issues (and possibly IR-version interactions I added myself while experimenting).

### Environment

- Python: 3.10+
- `torch` >= 2.0
- `transformers` 4.46.3
- `onnx` (latest from PyPI at time of testing)
- `onnxruntime-gpu` (latest from PyPI at time of testing)
- `qllm` installed from this repo: `pip install -e .`

### 4-bit quantization (PyTorch and ONNX both output garbage)

Command:

```bash
python -m qllm \
  --model Qwen/Qwen3-1.7B \
  --quant_method gptq \
  --wbits 4 \
  --groupsize 128 \
  --pack_mode ORT \
  --dataset pileval \
  --nsamples 128 \
  --sym \
  --act-order \
  --save ./qwen3-1_7b-gptq4_ort \
  --export_onnx ./onnx_qwen3-1_7b-gptq4_ort \
  --verify_save
```

Behavior:

- The **4-bit PyTorch quantized model itself** generates highly repetitive, random-looking tokens (garbage), so the problem seems to start already in the quantized model, not just in ONNX.
- `decoder_merged.onnx` from this run also loads in onnxruntime, but ONNX generation matches the bad behavior (garbage / repetitive tokens), consistent with the underlying 4-bit quantized model being wrong.

This suggests an issue in the 4-bit ORT GPTQ/packing path (quantization or kernels or accuracy), rather than purely an export problem.

### 8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)

Command:

```bash
python -m qllm \
  --model Qwen/Qwen3-1.7B \
  --quant_method gptq \
  --wbits 8 \
  --groupsize 128 \
  --pack_mode GPTQ \
  --dataset pileval \
  --nsamples 128 \
  --sym \
  --act-order \
  --save ./qwen3-1_7b-gptq8 \
  --export_onnx ./onnx_qwen3-1_7b-gptq8 \
  --verify_save
```

Behavior:

- The **8-bit GPTQ PyTorch model** loads and generates correct, sensible text (so 8-bit GPTQ itself looks good).
- The exported `decoder_merged.onnx` fails to load in onnxruntime.

From code inspection and experiments:

- The 4-bit ORT path uses something like `QuantLinearORT`, which emits `com.microsoft::MatMulNBits` with `bits=4` and **no `packing` attribute**; my onnxruntime build accepts this schema but, as noted, the quantized model already generates garbage.
- The 8-bit GPTQ path uses `QuantLinearGPTQ`, which emits `com.microsoft::MatMulNBits` with `bits=8` and a `packing`/`packing_s="gptq"` attribute. onnxruntime reports errors such as:

  ```text
  Unrecognized attribute: packing for operator MatMulNBits
  ```

- I tried to workaround this by editing the exported ONNX model with `onnx` to strip the `packing` attribute and re-saving. After that, onnxruntime started failing with protobuf parsing errors (e.g. `INVALID_PROTOBUF` / external data issues), suggesting that manual graph editing corrupted the model or its external data.

### IR version / downgrade

- I **manually** added IR-downgrade logic in the exporter to force the merged ONNX model’s IR version down to 11 (to match an older onnxruntime build), and later added a flag `--onnx_no_ir_downgrade` to keep the natural IR (e.g. IR 13) so I could try newer onnxruntime builds.
- I did see protobuf/parsing issues when loading the 8-bit GPTQ ONNX after downgrading IR, and I’m not sure yet whether those are caused by my IR-downgrade change or purely by the `MatMulNBits` + `packing` behavior.
- I’m still in the process of setting up and testing with the “normal” (undowngraded) IR and a fresh, up-to-date onnxruntime-gpu to rule in or out the IR change as the root cause. At least in earlier experiments, the `MatMulNBits` + `packing` combination still looked like a blocker for loading the 8-bit ONNX.

### Questions

- For **4-bit ORT**:
  - Is there a known issue with the current 4-bit ORT GPTQ/packing path or kernels that could explain a 4-bit quantized model (PyTorch) already generating garbage text?
- For **8-bit GPTQ**:
  - What is the expected `MatMulNBits` schema for GPTQ-style packing (attribute name/type, allowed values, etc.) that onnxruntime supports?
  - Should the exporter avoid emitting `packing`/`packing_s` for now, or is there a specific onnxruntime version / execution provider combination that’s expected to work with the current schema?
- More generally:
  - What IR/opset + onnxruntime-gpu version combo do you recommend for this export path?
  - Do you expect the 8-bit GPTQ ONNX model for Qwen3-1.7B to load and run out-of-the-box with that combination?

I can share exact stack traces, ONNX runtime snippets, and environment details if that would help. TIA!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-1.7B GPTQ: 4-bit quantized model outputs garbage (PyTorch + ONNX); 8-bit GPTQ ONNX fails to load (MatMulNBits / protobuf) #170

Summary

Environment

4-bit quantization (PyTorch and ONNX both output garbage)

8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)

IR version / downgrade

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen3-1.7B GPTQ: 4-bit quantized model outputs garbage (PyTorch + ONNX); 8-bit GPTQ ONNX fails to load (MatMulNBits / protobuf) #170

Description

Summary

Environment

4-bit quantization (PyTorch and ONNX both output garbage)

8-bit GPTQ quantization (PyTorch OK, ONNX fails to load)

IR version / downgrade

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions