NVIDIA TensorRT Model Optimizer Examples

PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
PTQ for VLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular Vision Language Models (VLMs).
PTQ for ONNX Models shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in NVIDIA Blackwell platform).
QAT for CNNs demonstrates the recipe and workflow for Quantization-aware Training (QAT) of CNN models, which can further preserve model accuracy at low precisions like INT8, FP8 etc.
AutoDeploy for AutoQuant LLM models demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.

Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
- NVIDIA NeMo / NVIDIA Megatron-LM GPT-style models (e.g. Llama 3, Mistral NeMo, etc.)
- Hugging Face language models BERT and GPT-J
- Computer Vision models like NVIDIA Tao or MMDetection framework models.

Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.

Speculative Decoding demonstrates how to use speculative decoding to accelerate the text generation of large language models.

Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.

Evaluation for LLMs shows how to evaluate the performance of LLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
Evaluation for VLMs shows how to evaluate the performance of VLMs on popular benchmarks for quantized models or TensorRT-LLM engines.

Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).

Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.

Provide feedback