You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading (#14)
* feat(onnx): add QUANT_FORMAT int8/fp8 PTQ via modelopt.onnx
- Add _collect_calibration_dets module-level helper that samples
detector inputs from the inference dataloader for ONNX calibration
- Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path;
invalid values are ignored with a warning
- Two-step export: always write FP32 ONNX first, then optionally apply
modelopt.onnx.quantization.quantize() for the requested format
- fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX
- Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration
sample count
- Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering
the calibration helper and QUANT_FORMAT routing logic
* fix(onnx): re-derive engine_path from final onnx_path after quant fallback
* review: fix run.py, temp file cleanup, YAPF, README ONNX section
- run.py: remove emoji from print statements (style inconsistency)
- run.py: remove no-op torch.compile(disable=True) calls
- run.py: extract _resolve_dir() helper to replace 4 copies of the
current_file/project_root path resolution pattern
- run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt()
which handles model_state_dict/state_dict/bare-dict formats and strips
the DDP "module." prefix — consistent with checkpoint_to_safetensors.py
- tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths
- YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py
- README: add ONNX export and quantization section documenting
ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* quantize only CNN layers
* fix(ci): YAPF, move nvidia-modelopt to train reqs, add prerequisite tests
- YAPF: reformat 3 long lines in logical_error_rate.py introduced by the
"quantize only CNN layers" commit (d7b8217)
- Move nvidia-modelopt[onnx] from requirements_public_inference.txt to
requirements_public_train.txt; it is only needed for ONNX PTQ export
(QUANT_FORMAT env var), not for pure inference, and has no Python 3.13
build — keeping it in inference reqs broke unit-tests/py3.13 in CI
- Add python_version<"3.13" marker so the CI train matrix installs it on
supported Python versions without failing on 3.13
- Add TestModeloptPrerequisite in test_onnx_quant_workflow.py:
- asserts nvidia-modelopt is declared in requirements_public_train.txt
- asserts it is absent from requirements_public_inference.txt
- conditionally checks the import is resolvable when the package is
present (skipped on Python 3.13+ and when not installed)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(onnx): add onnxruntime INT8 fallback for Python 3.13+
nvidia-modelopt does not support Python 3.13+. Add a conditional backend
dispatch so QUANT_FORMAT=int8 works on all supported Python versions:
- Add _ort_quantize_int8() module-level helper that uses
onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and
a CalibrationDataReader wrapping the pre-collected calib_dets array
- In the quantization block, branch on sys.version_info >= (3, 13):
- Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8
(no viable 3.13-compatible FP8 PTQ library available)
- Python <3.13: keep existing modelopt path unchanged
- Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt
- Expand TestOrtQuantizeInt8 tests:
- round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+
- dispatch test verifying _ort_quantize_int8 is called on 3.13+
- FP8-on-3.13 raises RuntimeError
- Expand TestModeloptPrerequisite: assert onnxruntime appears in train
requirements and both quant packages are absent from inference requirements
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(prereqs): document tensorrt as optional GPU dep, add fallback tests
tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed
in CPU-only CI, so it is not added as an active pip requirement.
Instead:
- Add a comment block in requirements_public_inference.txt documenting
tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths,
with the install command and a note about graceful fallback
- Add test_tensorrt_fallback.py with three test classes:
- TestTensorrtDocumented: asserts the requirements comment exists and
tensorrt is NOT an active pip requirement
- TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY
and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do
not propagate the exception to the caller
- TestTensorrtImportable: checks key TRT symbols (Logger, Runtime,
Builder, BuilderFlag, LayerInformationFormat) when tensorrt is
installed; skipped silently on CPU-only environments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(onnx): use import-based dispatch for modelopt/ort; install modelopt on py3.13
nvidia-modelopt works on Python 3.13 when installed with
--ignore-requires-python (confirmed by modelopt maintainers).
- logical_error_rate.py: replace sys.version_info dispatch with an
ImportError-based dispatch — try modelopt first (INT8+FP8), fall back
to _ort_quantize_int8 only when modelopt is not importable; FP8 raises
RuntimeError with the --ignore-requires-python install hint
- check_python_compat.sh: after the main requirements install, re-install
nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and
Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path
- requirements_public_train.txt: add comment documenting the 3.13 install
approach for manual setups
- test_onnx_quant_workflow.py:
- remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file
(now skips when onnxruntime is not installed, regardless of version)
- replace test_ort_quantize_int8_dispatch_on_py313 with
test_ort_quantize_int8_called_on_modelopt_import_error
- replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error
- remove py3.13 version guard from test_modelopt_importable_when_installed
- remove py3.13 version guard from test_ort_importable_when_installed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): pin ONNX IR version 8 in ort quantize test
modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all
Python versions. Newer ONNX packages (1.19+) default model.ir_version
to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10,
causing test_ort_quantize_int8_produces_output_file to fail on the GPU
CI for py3.11, py3.12, and py3.13.
Pin model.ir_version = 8 (the minimum required for opset 17) before
saving the test model so the calibration InferenceSession succeeds with
any onnxruntime version that supports IR ≤ 10.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(onnx): add end-to-end mq.quantize() tests for modelopt
Previous coverage only verified that modelopt.onnx.quantization was
importable. Add TestModeloptQuantize with two tests that actually call
mq.quantize() on a real ONNX model:
- test_mq_quantize_int8_produces_valid_onnx: verifies the output file is
created and passes onnx.checker (confirms modelopt works at runtime,
not just at import time — this is the key Python 3.13 regression check)
- test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were
inserted (output graph has more nodes than the FP32 source)
Both tests share a _build_tiny_model() helper that creates a minimal
Gemm ONNX model with input "dets" and 16 calibration rows, matching the
production calibration_data={"dets": calib_dets} call convention.
model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility.
Tests are skipped when nvidia-modelopt is not installed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): use float32 calibration data in TestModeloptQuantize
mq.quantize() runs an internal ONNX inference session to profile
MatMul nodes; feeding uint8 calibration data to a float-input model
caused InvalidArgument. Switch to np.random.randn(...).astype(float32).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(trt): raise RuntimeError when tensorrt missing for ONNX_WORKFLOW=2/3
Previously both TRT import sites caught ImportError inside a broad
`except Exception` block and silently fell back to PyTorch with a
print. This masked misconfiguration: the user explicitly selected
ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard
error.
Changes:
- USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError
with install hint; other TRT errors (bad engine file) still fall
back gracefully.
- EXPORT_AND_USE_TRT (workflow=2): same split.
- test_tensorrt_fallback.py: replace the old "falls back on ImportError"
tests with "raises RuntimeError on ImportError" tests; add chained
cause check and non-import fallback tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): fix ORT calibration for ort quantize test
ORT's MinMaxCalibrater augments the model to expose intermediate
tensors for calibration, but graph *inputs* are not included in the
augmented outputs. When the test model had dets->Gemm directly, ORT
never collected calibration stats for 'dets', causing:
ValueError: Quantization parameters are not specified for param dets.
Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the
Gemm input is an intermediate tensor that gets calibrated. Also
switch the calibration array to float32 (consistent with model dtype)
and add rewind() to _DetCalibReader in production code for calibration
methods that make multiple passes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): skip ort quantize output test when modelopt is installed
_ort_quantize_int8 is only invoked when modelopt is absent. When
modelopt IS installed its mq.quantize() call leaves ORT's execution-
provider state dirty (failed TRT EP init), causing the calibration
InferenceSession to run silently without producing stats, which makes
quantize_static raise:
ValueError: Quantization parameters are not specified for param dets.
The test is meaningless in that environment anyway — if modelopt is
present the ort path is never taken. Skip when modelopt is importable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* review: address PR #14 review comments
- README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop)
(bmhowe23 suggestion)
- LER: cast calib_dets to float32 before passing to mq.quantize();
_collect_calibration_dets returns uint8 but modelopt expects float
(sacpis: bug report on line 1077)
- LER: use Path.with_suffix('.engine') instead of str.replace
(sacpis nit on line 1104)
- LER: add pathlib.Path import
- test: remove spurious @skipUnless from _build_tiny_model helper;
it is not a test method and the decorator has no effect
(sacpis nit on line 299)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor: extract _parse_quant_format() helper from LER
Move the QUANT_FORMAT env-var read/validate/warn block into a
module-level helper so the test can call the real production logic
instead of re-implementing it.
- Add _parse_quant_format(rank=0) -> str in logical_error_rate.py
- Replace inline parsing block in run_inference_and_decode with a
single _parse_quant_format(rank=dist.rank) call
- Import _parse_quant_format in test_onnx_quant_workflow.py and
simplify _run_quant_block to delegate to it
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: guard against num_obs < 1 in _collect_calibration_dets
Python's [:, :-0] is equivalent to [:, :0] and silently returns an
empty tensor rather than the full row. Add an explicit check so the
caller gets a clear ValueError instead of a confusing width-mismatch
error downstream.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: fix tensorrt comment — missing TRT now raises RuntimeError
The comment said "Absent at runtime causes graceful fallback to the
PyTorch path", but since the TRT ImportError fix (ae0f3b1) both
ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Igor Baratta <ialmeidabara@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments