Skip to content

Latest commit

 

History

History
52 lines (34 loc) · 4.1 KB

File metadata and controls

52 lines (34 loc) · 4.1 KB

NVIDIA TensorRT Model Optimizer Examples

Quantization

  • PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
  • PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
  • PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
  • PTQ for VLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular Vision Language Models (VLMs).
  • PTQ for ONNX Models shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
  • QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in NVIDIA Blackwell platform).
  • QAT for CNNs demonstrates the recipe and workflow for Quantization-aware Training (QAT) of CNN models, which can further preserve model accuracy at low precisions like INT8, FP8 etc.
  • AutoDeploy for AutoQuant LLM models demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.

Pruning

  • Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:

Distillation

  • Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.

Speculative Decoding

  • Speculative Decoding demonstrates how to use speculative decoding to accelerate the text generation of large language models.

Sparsity

  • Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.

Evaluation

  • Evaluation for LLMs shows how to evaluate the performance of LLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
  • Evaluation for VLMs shows how to evaluate the performance of VLMs on popular benchmarks for quantized models or TensorRT-LLM engines.

Chaining

  • Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).

Model Hub

  • Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.

Windows

  • Windows contains examples for Model Optimizer on Windows.