Describe the bug
I am using Model-Optimizer/examples/torch_onnx/llm_export.py script to convert a .safetensors LLM model to the ONNX format and quantize it. The model is supposed to be later converted then into the TRT format for being used by TensorRT. The so produced ONNX model has "input_ids", "logits", present_key_values*" but is missing "position_ids", "attention_mask" and "past_kv*" nodes.
Steps/Code to reproduce bug
Install packages
python -m pip install nvidia-modelopt[all]
python -m pip install onnx==1.18.0
python -m pip install onnxruntime[gpu]==1.23.0
and all others on demand once requested during running llm_export.py. Set up paths:
export LD_LIBRARY_PATH=<path/to/cuda/libs>:<path/to/cudnn/lib>
export PATH=<path/to/cuda/bin>:$PATH
Clone the ModelOptimizer repo in order to use the example scripts
git clone https://github.com/NVIDIA/Model-Optimizer.git
Navigate to torch_onnx example
cd Model-Optimizer/examples/torch_onnx
and launch conversion of HF model to ONNX INT4 quantization:
python llm_export.py --hf_model_path=meta-llama/Llama-3.1-8B-Instruct --dtype=int4_awq --calib_size=512 --output_dir=models/Llama-3.1-8B-Instruct-ONNX-INT4
Result: the produced ONNX model is missing "position_ids", "attention_mask" and "past_kv*" nodes.
Expected behavior
A typical LLM model must have "input_ids", "attention_mask", "logits", past and present kv-cache nodes. In fact, some of them are missing.
System information
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? Ubuntu 20.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU memory size: enough
- Library versions (if applicable):
- Python: 3.12
- ModelOpt version or commit hash: >=0.39
- CUDA: 12.3
- PyTorch: 2.7.1+cu118
- Transformers: 4.57.3
- onnxruntime-gpu: 1.23.0
Describe the bug
I am using
Model-Optimizer/examples/torch_onnx/llm_export.pyscript to convert a .safetensors LLM model to the ONNX format and quantize it. The model is supposed to be later converted then into the TRT format for being used by TensorRT. The so produced ONNX model has "input_ids", "logits", present_key_values*" but is missing "position_ids", "attention_mask" and "past_kv*" nodes.Steps/Code to reproduce bug
Install packages
and all others on demand once requested during running llm_export.py. Set up paths:
Clone the ModelOptimizer repo in order to use the example scripts
Navigate to
torch_onnxexampleand launch conversion of HF model to ONNX INT4 quantization:
Result: the produced ONNX model is missing "position_ids", "attention_mask" and "past_kv*" nodes.
Expected behavior
A typical LLM model must have "input_ids", "attention_mask", "logits", past and present kv-cache nodes. In fact, some of them are missing.
System information