Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion examples/vllm_serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ You can either edit the `quant_config` dictionary in `vllm_serve_fakequant.py`,
| QUANT_FILE_PATH | Optional path to exported quantizer state dict `quantizer_state.pth` | None |
| MODELOPT_STATE_PATH | Optional path to exported `vllm_fq_modelopt_state.pth` (restores quantizer state and parameters) | None |
| CALIB_BATCH_SIZE | Calibration batch size | 1 |
| RECIPE_PATH | Optional path to a ModelOpt PTQ recipe YAML | None |

Set these variables in your shell or Docker environment as needed to customize calibration.

Expand Down Expand Up @@ -65,7 +66,7 @@ Step 1: export the model with bf16 weights and quantizer state. To export the mo
```bash
python ../llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
--qformat nvfp4 \
--recipe <PATH_TO_RECIPE> \
--calib_size 512 \
--export_path <EXPORT_DIR> \
--vllm_fakequant_export \
Expand Down
2 changes: 2 additions & 0 deletions examples/vllm_serve/fakequant_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
"quant_file_path": os.environ.get("QUANT_FILE_PATH", None),
"modelopt_state_path": os.environ.get("MODELOPT_STATE_PATH", None),
"calib_batch_size": int(os.environ.get("CALIB_BATCH_SIZE", 1)),
"recipe_path": os.environ.get("RECIPE_PATH", None),
}


Expand Down Expand Up @@ -138,6 +139,7 @@ def compile_or_warm_up_model(self) -> None:
quant_config["quant_cfg"]
or quant_config["kv_quant_cfg"]
or quant_config["modelopt_state_path"]
or quant_config["recipe_path"]
):
_fakequant_run_prolog_worker(self)
super().compile_or_warm_up_model()
40 changes: 25 additions & 15 deletions examples/vllm_serve/vllm_ptq_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from vllm.v1.core.sched.output import CachedRequestData, NewRequestData, SchedulerOutput

import modelopt.torch.quantization as mtq
from modelopt.recipe import ModelOptPTQRecipe, load_recipe


def _create_new_data_cls(data_cls, **kwargs):
Expand Down Expand Up @@ -141,22 +142,31 @@ def update_kv_cfg_for_mla(model: torch.nn.Module, kv_quant_cfg: list) -> list:
def get_quant_config(quant_config: dict[str, Any], model: Any) -> dict[str, Any]:
import copy

quant_cfg = (
copy.deepcopy(getattr(mtq, quant_config["quant_cfg"])) if quant_config["quant_cfg"] else {}
)
quant_kv_cfg = (
copy.deepcopy(getattr(mtq, quant_config["kv_quant_cfg"]))
if quant_config["kv_quant_cfg"]
else {}
)
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
assert isinstance(recipe, ModelOptPTQRecipe), (
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
Comment on lines +145 to +150
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current assertion usage in this path.
rg -n -C2 'assert isinstance\(recipe, ModelOptPTQRecipe\)' --type=py

# Demonstrate Python optimization removing assert checks.
python - <<'PY'
src = """def f(x):
    assert isinstance(x, int), "bad type"
    return x
"""
ns = {}
exec(compile(src, "<inline>", "exec", optimize=1), ns)
import dis
dis.dis(ns["f"])
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 1818


Use explicit exception instead of assert for runtime validation.

At line 147, using assert isinstance() is unsafe because assertions can be disabled when Python runs with optimization flags (e.g., python -O), allowing invalid recipe types to bypass this check silently. Use an explicit if/raise ValueError() pattern instead.

Proposed fix
     if quant_config["recipe_path"]:
         recipe = load_recipe(quant_config["recipe_path"])
-        assert isinstance(recipe, ModelOptPTQRecipe), (
-            f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
-        )
+        if not isinstance(recipe, ModelOptPTQRecipe):
+            raise ValueError(
+                f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
+            )
         quant_cfg = recipe.quantize
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
assert isinstance(recipe, ModelOptPTQRecipe), (
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
if not isinstance(recipe, ModelOptPTQRecipe):
raise ValueError(
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_ptq_utils.py` around lines 145 - 150, Replace the
unsafe assert in the recipe validation with an explicit runtime check: after
calling load_recipe(quant_config["recipe_path"]) and assigning to recipe, verify
type with if not isinstance(recipe, ModelOptPTQRecipe): raise a ValueError
containing the same descriptive message (including the actual type and recipe
path) and then set quant_cfg = recipe.quantize; this ensures load_recipe,
ModelOptPTQRecipe, quant_cfg and recipe behavior remains the same but prevents
the check from being skipped under Python optimizations.

else:
quant_cfg = (
copy.deepcopy(getattr(mtq, quant_config["quant_cfg"]))
if quant_config["quant_cfg"]
else {}
)
quant_kv_cfg = (
copy.deepcopy(getattr(mtq, quant_config["kv_quant_cfg"]))
if quant_config["kv_quant_cfg"]
else {}
)

# Check if model has MLA and update KV config accordingly
if quant_kv_cfg:
quant_kv_cfg["quant_cfg"] = update_kv_cfg_for_mla(model, quant_kv_cfg["quant_cfg"])
# Check if model has MLA and update KV config accordingly
if quant_kv_cfg:
quant_kv_cfg["quant_cfg"] = update_kv_cfg_for_mla(model, quant_kv_cfg["quant_cfg"])

if quant_kv_cfg:
quant_cfg = mtq.utils.update_quant_cfg_with_kv_cache_quant(
quant_cfg, quant_kv_cfg["quant_cfg"]
)
if quant_kv_cfg:
quant_cfg = mtq.utils.update_quant_cfg_with_kv_cache_quant(
quant_cfg, quant_kv_cfg["quant_cfg"]
)

return quant_cfg
1 change: 1 addition & 0 deletions examples/vllm_serve/vllm_serve_fakequant.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@
"KV_QUANT_CFG",
"MODELOPT_STATE_PATH",
"CALIB_BATCH_SIZE",
"RECIPE_PATH",
"TRUST_REMOTE_CODE",
}

Expand Down
Loading