NVIDIA
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 14 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎examples/llm_qat/ARGUMENTS.md‎
Lines changed: 38 additions & 0 deletions b/‎examples/llm_qat/ARGUMENTS.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎examples/llm_qat/README.md‎
Lines changed: 100 additions & 22 deletions b/‎examples/llm_qat/README.md‎
Lines changed: 100 additions & 22 deletions
diff --git a/‎examples/llm_qat/configs/finetune.yaml‎
Lines changed: 6 additions & 0 deletions b/‎examples/llm_qat/configs/finetune.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/llm_qat/configs/ptq_eval.yaml‎
Lines changed: 6 additions & 0 deletions b/‎examples/llm_qat/configs/ptq_eval.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/llm_qat/configs/qad_nvfp4.yaml‎
Lines changed: 9 additions & 0 deletions b/‎examples/llm_qat/configs/qad_nvfp4.yaml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎examples/llm_qat/configs/qat_nvfp4.yaml‎
Lines changed: 7 additions & 0 deletions b/‎examples/llm_qat/configs/qat_nvfp4.yaml‎
Lines changed: 7 additions & 0 deletions
@@ -127,6 +127,20 @@ repos:
       - id: markdownlint-cli2
         args: ["--fix"]
 
+  - repo: local
+    hooks:
+      - id: generate-arguments-md
+        name: Regenerate examples/llm_qat/ARGUMENTS.md
+        entry: bash -c 'python examples/llm_qat/main.py --generate_docs examples/llm_qat/ARGUMENTS.md && git diff --exit-code examples/llm_qat/ARGUMENTS.md'
+        language: system
+        files: >-
+          (?x)^(
+            examples/llm_qat/main\.py|
+            modelopt/torch/opt/plugins/transformers\.py|
+            modelopt/torch/quantization/plugins/transformers_trainer\.py
+          )$
+        pass_filenames: false
+
   ##### Manual hooks (Expect many false positives)
   # These hooks are only run with `pre-commit run --all-files --hook-stage manual <hook_id>`
 
 
@@ -0,0 +1,38 @@
+# Argument Reference
+
+_Auto-generated — do not edit by hand._
+
+## QuantizationArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--quant_cfg` | `str` | `None` | Specify the quantization format for PTQ/QAT. if specified, PTQ/QAT will be enabled with the specified quantization format |
+| `--calib_size` | `int` | `512` | Specify the calibration size for quantization. The calibration dataset is used to setup the quantization scale parameters for PTQ/QAT. |
+| `--compress` | `bool` | `False` | Whether to compress the model weights after quantization for QLoRA. This is useful for reducing the model size. |
+
+## DataArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--dataset` | `str` | `"Daring-Anteater"` | Specify the dataset. |
+| `--train_size` | `int` | `0` | Number of training samples to use. If `0`, use default training size. |
+| `--eval_size` | `int` | `0` | Number of evaluation samples to use. If `0`, use default evaluation size. |
+
+## ModelArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--model_name_or_path` | `str` | `"meta-llama/Llama-2-7b-hf"` |  |
+| `--teacher_model` | `str` | `None` | The name or path of the teacher model to use for distillation. |
+
+## TrainingArguments
+
+Extends [HuggingFace TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Only additional/overridden arguments are shown below.
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--cache_dir` | `str` | `None` |  |
+| `--model_max_length` | `int` | `4096` | Maximum sequence length. Sequences will be right padded (and possibly truncated). |
+| `--lora` | `bool` | `False` | Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |
+| `--distill` | `bool` | `False` | Select if training with distillation. |
+
@@ -39,6 +39,39 @@ Quantization aware distillation (QAD) can be used to further improve accuracy of
 
 The Llama3-8B fine-tuning and QAT below requires a minimum of 2 x 80GB GPUs per machine.
 
+#### How to Launch
+
+Use `accelerate launch` with a backend config and pass arguments via CLI or a YAML config file:
+
+```sh
+# With YAML config (recommended)
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+  --config configs/qat_nvfp4.yaml
+
+# With YAML + CLI overrides
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+  --config configs/qat_nvfp4.yaml --learning_rate 5e-5
+
+# CLI only (no YAML)
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+  --model_name_or_path meta-llama/Meta-Llama-3-8B \
+  --quant_cfg NVFP4_DEFAULT_CFG \
+  --num_train_epochs 2.0 \
+  --learning_rate 1e-5 \
+  --output_dir llama3-qat
+```
+
+#### Backend Configuration
+
+| Backend | Config File | Notes |
+|---------|------------|-------|
+| FSDP2 | `accelerate_config/fsdp2.yaml` | Recommended for multi-GPU |
+| FSDP1 | `accelerate_config/fsdp1.yaml` | Legacy FSDP |
+| DDP | `accelerate_config/ddp.yaml` | Add `--gradient_checkpointing True` |
+| DeepSpeed | `accelerate_config/deepspeed.yaml` | Add `--gradient_checkpointing True` |
+
+See [ARGUMENTS.md](ARGUMENTS.md) for the full argument reference.
+
 #### QAT Example Workflow
 
 In QAT, a model quantized using [mtq.quantize()](https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.quantize) can be directly fine-tuned with the original training pipeline. During QAT, the scaling factors inside quantizers are frozen and the model weights are fine-tuned.
@@ -103,6 +136,34 @@ python simple_qat_train.py --model meta-llama/Llama-3.2-3B
 
 To train larger models with distributed training, please refer to [End-to-end QAT Example](#end-to-end-qat-example).
 
+#### QATTrainer Example Workflow
+
+`QATTrainer` is a drop-in replacement for HuggingFace's `Trainer` that handles quantization internally — no need to manually call `mtq.quantize()`. Quantization is configured via `quant_args`.
+
+```python
+from modelopt.torch.quantization.plugins.transformers_trainer import QATTrainer, QuantizationArguments
+
+...
+
+# [Not shown] load model, tokenizer, data loaders etc
+quant_args = QuantizationArguments(quant_cfg="NVFP4_DEFAULT_CFG")
+
+trainer = QATTrainer(
+    model=model,
+    processing_class=tokenizer,
+    args=training_args,
+    quant_args=quant_args,
+    **data_module,
+)
+
+trainer.train()  # QATTrainer quantizes the model and runs QAT
+
+# Save the final model weights; An example usage
+trainer.save_model()
+```
+
+> **_NOTE:_** `QADTrainer` (shown below) extends `QATTrainer` with distillation support for cases where QAT alone is not enough.
+
 #### QAD Example Workflow
 
 Here is an example workflow for performing QAD:
@@ -182,22 +243,24 @@ This folder contains end-to-end runnable fine-tuning/QAT pipeline where Llama3-8
 First, we need to run un-quantized fine-tuning. Here is the command for that:
 
 ```sh
-./launch.sh --model meta-llama/Meta-Llama-3-8B \
-   --num_epochs 2.0 \
-   --lr 1e-5 \
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --model_name_or_path meta-llama/Meta-Llama-3-8B \
+   --num_train_epochs 2.0 \
+   --learning_rate 1e-5 \
    --do_train True \
    --output_dir llama3-finetune
 ```
 
 This will generate a fine-tuned checkpoint in `output_dir` specified above. You can load this checkpoint, quantize the model, evaluate PTQ results or run additional QAT.
-This can be accomplished by specifying the quantization format to the `launch.sh` script.
+This can be accomplished by specifying the quantization format.
 In this example, we are quantizing the model with INT4 block-wise weights and INT8 per-tensor activation quantization.
 
 To perform PTQ evaluation, run:
 
 ```sh
 # Load the checkpoint from previous fine-tuning stage, quantize the model and evaluate without additional training
-./launch.sh --model llama3-finetune \
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --model_name_or_path llama3-finetune \
    --do_train False \
    --quant_cfg NVFP4_DEFAULT_CFG
 ```
@@ -206,14 +269,22 @@ To perform QAT, run:
 
 ```sh
 # Load the quantized checkpoint from previous fine-tuning stage and run additional training (QAT)
-./launch.sh --model llama3-finetune \
-   --num_epochs 2.0 \
-   --lr 1e-5 \
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --model_name_or_path llama3-finetune \
+   --num_train_epochs 2.0 \
+   --learning_rate 1e-5 \
    --do_train True \
    --quant_cfg NVFP4_DEFAULT_CFG \
    --output_dir llama3-qat
 ```
 
+Or equivalently, using a YAML config:
+
+```sh
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --config configs/qat_nvfp4.yaml
+```
+
 You may alternatively perform QAT with any other quantization formats from **ModelOpt**. Please see more details on the supported quantization formats and how to use them as shown below:
 
 ```python
@@ -223,18 +294,14 @@ import modelopt.torch.quantization as mtq
 help(mtq.config)
 ```
 
-You could also add your own customized quantization format to `CUSTOM_QUANT_CFG` from `main.py` and perform QAT.
-
 > **_NOTE:_** QAT requires higher memory than the full-precision fine-tuning. A solution to avoid this extra memory usage is to use [activation checkpointing](https://pytorch.org/docs/stable/checkpoint.html) or gradient checkpointing. Activation checkpointing can be enabled easily with training frameworks such as Huggingface by adding an additional argument `gradient_checkpointing True`. Learn more [here](https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one#gradient-checkpointing). Activation checkpointing or gradient checkpointing is enabled by default in this example.
 
 > **_NOTE:_** Like any other model training, the QAT model accuracy can be further improved by optimizing the training
 > hyper-parameters such as learning rate, training duration etc.
 
-> **_NOTE:_** `launch.sh` defaults to use `LlamaDecoderLayer` as the transformer layer class. If your model uses a different class, you need to pass `--fsdp_transformer_layer_cls_to_wrap <your_layer_class>` to the `launch.sh` script. For example, for `Qwen/Qwen3-8B`, specify `--fsdp_transformer_layer_cls_to_wrap Qwen3DecoderLayer` as an additional argument.
-
 ### Results
 
-Here is an example result following the workflow above with slightly different hyper-parameters (We used an effective batch size of 128 by adjusting `--train_bs` and `--accum_steps` as per the available GPU memory).
+Here is an example result following the workflow above with slightly different hyper-parameters (We used an effective batch size of 128 by adjusting `--per_device_train_batch_size` and `--gradient_accumulation_steps` as per the available GPU memory).
 As we can see below, QAT has improved the validation perplexity.
 
 You could get slightly different numbers depending on your hyper-parameters - however you should be able to see consistent improvement
@@ -255,13 +322,22 @@ for QAT over PTQ alone.
 To perform QAD with logits loss, run:
 
 ```sh
-./launch.sh --model llama3-finetune \
-   --num_epochs 3 \
-   --lr 4e-5 \
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --config configs/qad_nvfp4.yaml
+```
+
+Or equivalently with CLI args:
+
+```sh
+accelerate launch --config-file accelerate_config/fsdp2.yaml main.py \
+   --model_name_or_path llama3-finetune \
+   --num_train_epochs 3 \
+   --learning_rate 4e-5 \
    --quant_cfg NVFP4_DEFAULT_CFG \
    --do_train True \
    --output_dir llama-qad \
-   --distill True
+   --distill True \
+   --teacher_model llama3-finetune
 ```
 
 > **_NOTE:_** QAD doesn't support FSDP1 (<https://docs.pytorch.org/docs/stable/fsdp.html>) backend - only FSDP2.
@@ -300,14 +376,15 @@ See more details on deployment of quantized model [here](../llm_ptq/README.md).
 
 ## End-to-end QLoRA with Real Quantization
 
-[QLoRA](https://arxiv.org/pdf/2305.14314) is a technique mainly intended for further reducing the training memory requirement of LoRA. In QLoRA, the LoRA backbone weights are quantized to reduce the model footprint. Unlike QAT which uses simulated quantization, QLoRA requires real quantization. To compress the model weights after quantization, we use the `mtq.compress()` function, which currently supports FP8, FP4, and INT4 formats. This feature can be enabled by passing `--compress True` to the `launch.sh` script. For detailed configuration options and patterns, please refer to the `modelopt.torch.quantization.compress` documentation.
+[QLoRA](https://arxiv.org/pdf/2305.14314) is a technique mainly intended for further reducing the training memory requirement of LoRA. In QLoRA, the LoRA backbone weights are quantized to reduce the model footprint. Unlike QAT which uses simulated quantization, QLoRA requires real quantization. To compress the model weights after quantization, we use the `mtq.compress()` function, which currently supports FP8, FP4, and INT4 formats. This feature can be enabled by passing `--compress True`. For detailed configuration options and patterns, please refer to the `modelopt.torch.quantization.compress` documentation.
 
 To evaluate QLoRA quantized model before training, run:
 
 ```sh
 # Load the HF checkpoint, quantize the model and evaluate without additional training
 # Also compress the model after quantization
-./launch.sh --model meta-llama/Meta-Llama-3-8B \
+accelerate launch --config-file accelerate_config/ddp.yaml main.py \
+   --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --do_train False \
    --quant_cfg NVFP4_DEFAULT_CFG \
    --compress True
@@ -318,9 +395,10 @@ To perform QLoRA training, run:
 ```sh
 # Load the HF checkpoint, quantize the model, add LoRA adapter, and run additional training
 # Also compress the model after quantization
-./launch.sh --model meta-llama/Meta-Llama-3-8B \
-   --num_epochs 0.5 \
-   --lr 1e-3 \
+accelerate launch --config-file accelerate_config/ddp.yaml main.py \
+   --model_name_or_path meta-llama/Meta-Llama-3-8B \
+   --num_train_epochs 0.5 \
+   --learning_rate 1e-3 \
    --do_train True \
    --output_dir llama3-fp4-qlora \
    --quant_cfg NVFP4_DEFAULT_CFG \
 
@@ -0,0 +1,6 @@
+# Full-precision fine-tuning (no quantization)
+model_name_or_path: meta-llama/Meta-Llama-3-8B
+num_train_epochs: 2.0
+learning_rate: 1e-5
+per_device_train_batch_size: 4
+output_dir: llama3-finetune
@@ -0,0 +1,6 @@
+# PTQ: Post-Training Quantization evaluation
+model_name_or_path: meta-llama/Meta-Llama-3-8B
+quant_cfg: NVFP4_DEFAULT_CFG
+do_train: false
+per_device_eval_batch_size: 4
+output_dir: llama3-ptq-eval
@@ -0,0 +1,9 @@
+# QAD: Quantization-Aware Distillation with NVFP4
+model_name_or_path: meta-llama/Meta-Llama-3-8B
+teacher_model: meta-llama/Meta-Llama-3-8B
+distill: true
+quant_cfg: NVFP4_DEFAULT_CFG
+num_train_epochs: 2.0
+learning_rate: 1e-5
+per_device_train_batch_size: 4
+output_dir: llama3-qad-nvfp4
@@ -0,0 +1,7 @@
+# QAT: Quantization-Aware Training with NVFP4
+model_name_or_path: meta-llama/Meta-Llama-3-8B
+quant_cfg: NVFP4_DEFAULT_CFG
+num_train_epochs: 2.0
+learning_rate: 1e-5
+per_device_train_batch_size: 4
+output_dir: llama3-qat-nvfp4