Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading by ivanbasov · Pull Request #14 · NVIDIA/Ising-Decoding

ivanbasov · 2026-03-11T15:29:19Z

Summary

Adds ONNX export of the pre-decoder inference pipeline with optional INT8 and FP8 post-training quantization via ModelOpt/TensorRT. Controlled by ONNX_WORKFLOW (0=torch only, 1=export ONNX, 2=export+TRT, 3=engine only) and QUANT_FORMAT (int8 / fp8) env vars.
Adds _collect_calibration_dets in evaluation/logical_error_rate.py to extract representative detector inputs from the test dataloader for PTQ calibration.
Adds code/export/ module: safetensors_utils.py (fp16/fp32 save/load) and checkpoint_to_safetensors.py (CLI to convert .pt → .safetensors).
Adds PREDECODER_SAFETENSORS_CHECKPOINT env var in workflows/run.py to load a model directly from a .safetensors file at inference time; model_id and dtype are read from file metadata.
Updates local_run.sh and README.md with SafeTensors and ONNX/quantization usage instructions.
Adds unit tests: test_safetensors_export.py (round-trip fp32/fp16, metadata auto-detect, error cases) and test_onnx_quant_workflow.py (calibration data collection, QUANT_FORMAT parsing, quantize routing).

Ported from internal MR !38.

- Add _collect_calibration_dets module-level helper that samples detector inputs from the inference dataloader for ONNX calibration - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path; invalid values are ignored with a warning - Two-step export: always write FP32 ONNX first, then optionally apply modelopt.onnx.quantization.quantize() for the requested format - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration sample count - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering the calibration helper and QUANT_FORMAT routing logic

…lback

- run.py: remove emoji from print statements (style inconsistency) - run.py: remove no-op torch.compile(disable=True) calls - run.py: extract _resolve_dir() helper to replace 4 copies of the current_file/project_root path resolution pattern - run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt() which handles model_state_dict/state_dict/bare-dict formats and strips the DDP "module." prefix — consistent with checkpoint_to_safetensors.py - tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths - YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py - README: add ONNX export and quantization section documenting ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edecoder into igor/onnx_quantization # Conflicts: # code/evaluation/logical_error_rate.py

…ests - YAPF: reformat 3 long lines in logical_error_rate.py introduced by the "quantize only CNN layers" commit (d7b8217) - Move nvidia-modelopt[onnx] from requirements_public_inference.txt to requirements_public_train.txt; it is only needed for ONNX PTQ export (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13 build — keeping it in inference reqs broke unit-tests/py3.13 in CI - Add python_version<"3.13" marker so the CI train matrix installs it on supported Python versions without failing on 3.13 - Add TestModeloptPrerequisite in test_onnx_quant_workflow.py: - asserts nvidia-modelopt is declared in requirements_public_train.txt - asserts it is absent from requirements_public_inference.txt - conditionally checks the import is resolvable when the package is present (skipped on Python 3.13+ and when not installed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nvidia-modelopt does not support Python 3.13+. Add a conditional backend dispatch so QUANT_FORMAT=int8 works on all supported Python versions: - Add _ort_quantize_int8() module-level helper that uses onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and a CalibrationDataReader wrapping the pre-collected calib_dets array - In the quantization block, branch on sys.version_info >= (3, 13): - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8 (no viable 3.13-compatible FP8 PTQ library available) - Python <3.13: keep existing modelopt path unchanged - Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt - Expand TestOrtQuantizeInt8 tests: - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+ - dispatch test verifying _ort_quantize_int8 is called on 3.13+ - FP8-on-3.13 raises RuntimeError - Expand TestModeloptPrerequisite: assert onnxruntime appears in train requirements and both quant packages are absent from inference requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-03-16T18:25:27Z

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed in CPU-only CI, so it is not added as an active pip requirement. Instead: - Add a comment block in requirements_public_inference.txt documenting tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths, with the install command and a note about graceful fallback - Add test_tensorrt_fallback.py with three test classes: - TestTensorrtDocumented: asserts the requirements comment exists and tensorrt is NOT an active pip requirement - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do not propagate the exception to the caller - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime, Builder, BuilderFlag, LayerInformationFormat) when tensorrt is installed; skipped silently on CPU-only environments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmhowe23 · 2026-03-16T19:05:38Z

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

I posted this question to the nvidia-modelopt team. Let's see what they say: NVIDIA/Model-Optimizer#217 (comment)

(In that thread, they claim it should work for 3.13, but it does not.)

…pt on py3.13 nvidia-modelopt works on Python 3.13 when installed with --ignore-requires-python (confirmed by modelopt maintainers). - logical_error_rate.py: replace sys.version_info dispatch with an ImportError-based dispatch — try modelopt first (INT8+FP8), fall back to _ort_quantize_int8 only when modelopt is not importable; FP8 raises RuntimeError with the --ignore-requires-python install hint - check_python_compat.sh: after the main requirements install, re-install nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path - requirements_public_train.txt: add comment documenting the 3.13 install approach for manual setups - test_onnx_quant_workflow.py: - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file (now skips when onnxruntime is not installed, regardless of version) - replace test_ort_quantize_int8_dispatch_on_py313 with test_ort_quantize_int8_called_on_modelopt_import_error - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error - remove py3.13 version guard from test_modelopt_importable_when_installed - remove py3.13 version guard from test_ort_importable_when_installed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all Python versions. Newer ONNX packages (1.19+) default model.ir_version to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10, causing test_ort_quantize_int8_produces_output_file to fail on the GPU CI for py3.11, py3.12, and py3.13. Pin model.ir_version = 8 (the minimum required for opset 17) before saving the test model so the calibration InferenceSession succeeds with any onnxruntime version that supports IR ≤ 10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-03-16T21:17:19Z

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

I posted this question to the nvidia-modelopt team. Let's see what they say: NVIDIA/Model-Optimizer#217 (comment)

(In that thread, they claim it should work for 3.13, but it does not.)

Thanks, @bmhowe23 ! It seems to work. However, there are still semi-manual installation for python 3.13 (requires --ignore-requires-python). As far as I understand, introducing pyproject.toml by itself does not improve the smoothness of the installation. However, if we are allowed to replace pip with uv pip, it could help with running all under a consisent configuration. Please let me know if we can/should move to uv pip.

Previous coverage only verified that modelopt.onnx.quantization was importable. Add TestModeloptQuantize with two tests that actually call mq.quantize() on a real ONNX model: - test_mq_quantize_int8_produces_valid_onnx: verifies the output file is created and passes onnx.checker (confirms modelopt works at runtime, not just at import time — this is the key Python 3.13 regression check) - test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were inserted (output graph has more nodes than the FP32 source) Both tests share a _build_tiny_model() helper that creates a minimal Gemm ONNX model with input "dets" and 16 calibration rows, matching the production calibration_data={"dets": calib_dets} call convention. model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility. Tests are skipped when nvidia-modelopt is not installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mq.quantize() runs an internal ONNX inference session to profile MatMul nodes; feeding uint8 calibration data to a float-input model caused InvalidArgument. Switch to np.random.randn(...).astype(float32). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously both TRT import sites caught ImportError inside a broad `except Exception` block and silently fell back to PyTorch with a print. This masked misconfiguration: the user explicitly selected ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard error. Changes: - USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError with install hint; other TRT errors (bad engine file) still fall back gracefully. - EXPORT_AND_USE_TRT (workflow=2): same split. - test_tensorrt_fallback.py: replace the old "falls back on ImportError" tests with "raises RuntimeError on ImportError" tests; add chained cause check and non-import fallback tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ORT's MinMaxCalibrater augments the model to expose intermediate tensors for calibration, but graph *inputs* are not included in the augmented outputs. When the test model had dets->Gemm directly, ORT never collected calibration stats for 'dets', causing: ValueError: Quantization parameters are not specified for param dets. Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the Gemm input is an intermediate tensor that gets calibrated. Also switch the calibration array to float32 (consistent with model dtype) and add rewind() to _DetCalibReader in production code for calibration methods that make multiple passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_ort_quantize_int8 is only invoked when modelopt is absent. When modelopt IS installed its mq.quantize() call leaves ORT's execution- provider state dirty (failed TRT EP init), causing the calibration InferenceSession to run silently without producing stats, which makes quantize_static raise: ValueError: Quantization parameters are not specified for param dets. The test is meaningless in that environment anyway — if modelopt is present the ort path is never taken. Skip when modelopt is importable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop) (bmhowe23 suggestion) - LER: cast calib_dets to float32 before passing to mq.quantize(); _collect_calibration_dets returns uint8 but modelopt expects float (sacpis: bug report on line 1077) - LER: use Path.with_suffix('.engine') instead of str.replace (sacpis nit on line 1104) - LER: add pathlib.Path import - test: remove spurious @skipUnless from _build_tiny_model helper; it is not a test method and the decorator has no effect (sacpis nit on line 299) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move the QUANT_FORMAT env-var read/validate/warn block into a module-level helper so the test can call the real production logic instead of re-implementing it. - Add _parse_quant_format(rank=0) -> str in logical_error_rate.py - Replace inline parsing block in run_inference_and_decode with a single _parse_quant_format(rank=dist.rank) call - Import _parse_quant_format in test_onnx_quant_workflow.py and simplify _run_quant_block to delegate to it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Python's [:, :-0] is equivalent to [:, :0] and silently returns an empty tensor rather than the full row. Add an explicit check so the caller gets a clear ValueError instead of a confusing width-mismatch error downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sacpis

Overall LGTM. Thanks @ivanbasov. Left a comment.

The comment said "Absent at runtime causes graceful fallback to the PyTorch path", but since the TRT ImportError fix (ae0f3b1) both ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-03-19T19:09:19Z

@bmhowe23 and @IgorBaratta , could you please review?

bmhowe23

This LGTM. Thanks, @IgorBaratta and @ivanbasov!

* Replace proprietary license headers with Apache-2.0 Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0 across all 70 tracked source files. Also updates spdx_headers.py to generate Apache-2.0 headers and replace old proprietary headers in-place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): apply Apache-2.0 headers to files added after branch cut Files added by PRs #13, #14, and #17 still carried the proprietary LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match the rest of the codebase after the header migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: apply YAPF formatting after header replacement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): restore full file content truncated during rebase The first rebase used --theirs to resolve header conflicts, which took the old PR branch content instead of main's newer content for 5 files. Restore from upstream/main and apply Apache-2.0 header correctly. Affected files: - code/qec/noise_model.py - code/qec/surface_code/homological_equivalence_torch.py - code/tests/mid/test_homological_equivalence.py - code/tests/test_noise_model.py - code/training/train.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18 PR #18 removed the unused v2 model architecture. Drop the corresponding test class and import to fix the ImportError in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ding (#14) * feat(onnx): add QUANT_FORMAT int8/fp8 PTQ via modelopt.onnx - Add _collect_calibration_dets module-level helper that samples detector inputs from the inference dataloader for ONNX calibration - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path; invalid values are ignored with a warning - Two-step export: always write FP32 ONNX first, then optionally apply modelopt.onnx.quantization.quantize() for the requested format - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration sample count - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering the calibration helper and QUANT_FORMAT routing logic * fix(onnx): re-derive engine_path from final onnx_path after quant fallback * review: fix run.py, temp file cleanup, YAPF, README ONNX section - run.py: remove emoji from print statements (style inconsistency) - run.py: remove no-op torch.compile(disable=True) calls - run.py: extract _resolve_dir() helper to replace 4 copies of the current_file/project_root path resolution pattern - run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt() which handles model_state_dict/state_dict/bare-dict formats and strips the DDP "module." prefix — consistent with checkpoint_to_safetensors.py - tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths - YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py - README: add ONNX export and quantization section documenting ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * quantize only CNN layers * fix(ci): YAPF, move nvidia-modelopt to train reqs, add prerequisite tests - YAPF: reformat 3 long lines in logical_error_rate.py introduced by the "quantize only CNN layers" commit (d7b8217) - Move nvidia-modelopt[onnx] from requirements_public_inference.txt to requirements_public_train.txt; it is only needed for ONNX PTQ export (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13 build — keeping it in inference reqs broke unit-tests/py3.13 in CI - Add python_version<"3.13" marker so the CI train matrix installs it on supported Python versions without failing on 3.13 - Add TestModeloptPrerequisite in test_onnx_quant_workflow.py: - asserts nvidia-modelopt is declared in requirements_public_train.txt - asserts it is absent from requirements_public_inference.txt - conditionally checks the import is resolvable when the package is present (skipped on Python 3.13+ and when not installed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(onnx): add onnxruntime INT8 fallback for Python 3.13+ nvidia-modelopt does not support Python 3.13+. Add a conditional backend dispatch so QUANT_FORMAT=int8 works on all supported Python versions: - Add _ort_quantize_int8() module-level helper that uses onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and a CalibrationDataReader wrapping the pre-collected calib_dets array - In the quantization block, branch on sys.version_info >= (3, 13): - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8 (no viable 3.13-compatible FP8 PTQ library available) - Python <3.13: keep existing modelopt path unchanged - Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt - Expand TestOrtQuantizeInt8 tests: - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+ - dispatch test verifying _ort_quantize_int8 is called on 3.13+ - FP8-on-3.13 raises RuntimeError - Expand TestModeloptPrerequisite: assert onnxruntime appears in train requirements and both quant packages are absent from inference requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(prereqs): document tensorrt as optional GPU dep, add fallback tests tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed in CPU-only CI, so it is not added as an active pip requirement. Instead: - Add a comment block in requirements_public_inference.txt documenting tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths, with the install command and a note about graceful fallback - Add test_tensorrt_fallback.py with three test classes: - TestTensorrtDocumented: asserts the requirements comment exists and tensorrt is NOT an active pip requirement - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do not propagate the exception to the caller - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime, Builder, BuilderFlag, LayerInformationFormat) when tensorrt is installed; skipped silently on CPU-only environments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(onnx): use import-based dispatch for modelopt/ort; install modelopt on py3.13 nvidia-modelopt works on Python 3.13 when installed with --ignore-requires-python (confirmed by modelopt maintainers). - logical_error_rate.py: replace sys.version_info dispatch with an ImportError-based dispatch — try modelopt first (INT8+FP8), fall back to _ort_quantize_int8 only when modelopt is not importable; FP8 raises RuntimeError with the --ignore-requires-python install hint - check_python_compat.sh: after the main requirements install, re-install nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path - requirements_public_train.txt: add comment documenting the 3.13 install approach for manual setups - test_onnx_quant_workflow.py: - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file (now skips when onnxruntime is not installed, regardless of version) - replace test_ort_quantize_int8_dispatch_on_py313 with test_ort_quantize_int8_called_on_modelopt_import_error - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error - remove py3.13 version guard from test_modelopt_importable_when_installed - remove py3.13 version guard from test_ort_importable_when_installed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): pin ONNX IR version 8 in ort quantize test modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all Python versions. Newer ONNX packages (1.19+) default model.ir_version to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10, causing test_ort_quantize_int8_produces_output_file to fail on the GPU CI for py3.11, py3.12, and py3.13. Pin model.ir_version = 8 (the minimum required for opset 17) before saving the test model so the calibration InferenceSession succeeds with any onnxruntime version that supports IR ≤ 10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(onnx): add end-to-end mq.quantize() tests for modelopt Previous coverage only verified that modelopt.onnx.quantization was importable. Add TestModeloptQuantize with two tests that actually call mq.quantize() on a real ONNX model: - test_mq_quantize_int8_produces_valid_onnx: verifies the output file is created and passes onnx.checker (confirms modelopt works at runtime, not just at import time — this is the key Python 3.13 regression check) - test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were inserted (output graph has more nodes than the FP32 source) Both tests share a _build_tiny_model() helper that creates a minimal Gemm ONNX model with input "dets" and 16 calibration rows, matching the production calibration_data={"dets": calib_dets} call convention. model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility. Tests are skipped when nvidia-modelopt is not installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): use float32 calibration data in TestModeloptQuantize mq.quantize() runs an internal ONNX inference session to profile MatMul nodes; feeding uint8 calibration data to a float-input model caused InvalidArgument. Switch to np.random.randn(...).astype(float32). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(trt): raise RuntimeError when tensorrt missing for ONNX_WORKFLOW=2/3 Previously both TRT import sites caught ImportError inside a broad `except Exception` block and silently fell back to PyTorch with a print. This masked misconfiguration: the user explicitly selected ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard error. Changes: - USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError with install hint; other TRT errors (bad engine file) still fall back gracefully. - EXPORT_AND_USE_TRT (workflow=2): same split. - test_tensorrt_fallback.py: replace the old "falls back on ImportError" tests with "raises RuntimeError on ImportError" tests; add chained cause check and non-import fallback tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): fix ORT calibration for ort quantize test ORT's MinMaxCalibrater augments the model to expose intermediate tensors for calibration, but graph *inputs* are not included in the augmented outputs. When the test model had dets->Gemm directly, ORT never collected calibration stats for 'dets', causing: ValueError: Quantization parameters are not specified for param dets. Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the Gemm input is an intermediate tensor that gets calibrated. Also switch the calibration array to float32 (consistent with model dtype) and add rewind() to _DetCalibReader in production code for calibration methods that make multiple passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): skip ort quantize output test when modelopt is installed _ort_quantize_int8 is only invoked when modelopt is absent. When modelopt IS installed its mq.quantize() call leaves ORT's execution- provider state dirty (failed TRT EP init), causing the calibration InferenceSession to run silently without producing stats, which makes quantize_static raise: ValueError: Quantization parameters are not specified for param dets. The test is meaningless in that environment anyway — if modelopt is present the ort path is never taken. Skip when modelopt is importable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * review: address PR #14 review comments - README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop) (bmhowe23 suggestion) - LER: cast calib_dets to float32 before passing to mq.quantize(); _collect_calibration_dets returns uint8 but modelopt expects float (sacpis: bug report on line 1077) - LER: use Path.with_suffix('.engine') instead of str.replace (sacpis nit on line 1104) - LER: add pathlib.Path import - test: remove spurious @skipUnless from _build_tiny_model helper; it is not a test method and the decorator has no effect (sacpis nit on line 299) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: extract _parse_quant_format() helper from LER Move the QUANT_FORMAT env-var read/validate/warn block into a module-level helper so the test can call the real production logic instead of re-implementing it. - Add _parse_quant_format(rank=0) -> str in logical_error_rate.py - Replace inline parsing block in run_inference_and_decode with a single _parse_quant_format(rank=dist.rank) call - Import _parse_quant_format in test_onnx_quant_workflow.py and simplify _run_quant_block to delegate to it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: guard against num_obs < 1 in _collect_calibration_dets Python's [:, :-0] is equivalent to [:, :0] and silently returns an empty tensor rather than the full row. Add an explicit check so the caller gets a clear ValueError instead of a confusing width-mismatch error downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: fix tensorrt comment — missing TRT now raises RuntimeError The comment said "Absent at runtime causes graceful fallback to the PyTorch path", but since the TRT ImportError fix (ae0f3b1) both ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Igor Baratta <ialmeidabara@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Replace proprietary license headers with Apache-2.0 Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0 across all 70 tracked source files. Also updates spdx_headers.py to generate Apache-2.0 headers and replace old proprietary headers in-place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): apply Apache-2.0 headers to files added after branch cut Files added by PRs #13, #14, and #17 still carried the proprietary LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match the rest of the codebase after the header migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: apply YAPF formatting after header replacement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): restore full file content truncated during rebase The first rebase used --theirs to resolve header conflicts, which took the old PR branch content instead of main's newer content for 5 files. Restore from upstream/main and apply Apache-2.0 header correctly. Affected files: - code/qec/noise_model.py - code/qec/surface_code/homological_equivalence_torch.py - code/tests/mid/test_homological_equivalence.py - code/tests/test_noise_model.py - code/training/train.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18 PR #18 removed the unused v2 model architecture. Drop the corresponding test class and import to fix the ImportError in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ding (#14) * feat(onnx): add QUANT_FORMAT int8/fp8 PTQ via modelopt.onnx - Add _collect_calibration_dets module-level helper that samples detector inputs from the inference dataloader for ONNX calibration - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path; invalid values are ignored with a warning - Two-step export: always write FP32 ONNX first, then optionally apply modelopt.onnx.quantization.quantize() for the requested format - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration sample count - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering the calibration helper and QUANT_FORMAT routing logic * fix(onnx): re-derive engine_path from final onnx_path after quant fallback * review: fix run.py, temp file cleanup, YAPF, README ONNX section - run.py: remove emoji from print statements (style inconsistency) - run.py: remove no-op torch.compile(disable=True) calls - run.py: extract _resolve_dir() helper to replace 4 copies of the current_file/project_root path resolution pattern - run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt() which handles model_state_dict/state_dict/bare-dict formats and strips the DDP "module." prefix — consistent with checkpoint_to_safetensors.py - tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths - YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py - README: add ONNX export and quantization section documenting ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * quantize only CNN layers * fix(ci): YAPF, move nvidia-modelopt to train reqs, add prerequisite tests - YAPF: reformat 3 long lines in logical_error_rate.py introduced by the "quantize only CNN layers" commit (d7b8217) - Move nvidia-modelopt[onnx] from requirements_public_inference.txt to requirements_public_train.txt; it is only needed for ONNX PTQ export (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13 build — keeping it in inference reqs broke unit-tests/py3.13 in CI - Add python_version<"3.13" marker so the CI train matrix installs it on supported Python versions without failing on 3.13 - Add TestModeloptPrerequisite in test_onnx_quant_workflow.py: - asserts nvidia-modelopt is declared in requirements_public_train.txt - asserts it is absent from requirements_public_inference.txt - conditionally checks the import is resolvable when the package is present (skipped on Python 3.13+ and when not installed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(onnx): add onnxruntime INT8 fallback for Python 3.13+ nvidia-modelopt does not support Python 3.13+. Add a conditional backend dispatch so QUANT_FORMAT=int8 works on all supported Python versions: - Add _ort_quantize_int8() module-level helper that uses onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and a CalibrationDataReader wrapping the pre-collected calib_dets array - In the quantization block, branch on sys.version_info >= (3, 13): - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8 (no viable 3.13-compatible FP8 PTQ library available) - Python <3.13: keep existing modelopt path unchanged - Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt - Expand TestOrtQuantizeInt8 tests: - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+ - dispatch test verifying _ort_quantize_int8 is called on 3.13+ - FP8-on-3.13 raises RuntimeError - Expand TestModeloptPrerequisite: assert onnxruntime appears in train requirements and both quant packages are absent from inference requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(prereqs): document tensorrt as optional GPU dep, add fallback tests tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed in CPU-only CI, so it is not added as an active pip requirement. Instead: - Add a comment block in requirements_public_inference.txt documenting tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths, with the install command and a note about graceful fallback - Add test_tensorrt_fallback.py with three test classes: - TestTensorrtDocumented: asserts the requirements comment exists and tensorrt is NOT an active pip requirement - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do not propagate the exception to the caller - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime, Builder, BuilderFlag, LayerInformationFormat) when tensorrt is installed; skipped silently on CPU-only environments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(onnx): use import-based dispatch for modelopt/ort; install modelopt on py3.13 nvidia-modelopt works on Python 3.13 when installed with --ignore-requires-python (confirmed by modelopt maintainers). - logical_error_rate.py: replace sys.version_info dispatch with an ImportError-based dispatch — try modelopt first (INT8+FP8), fall back to _ort_quantize_int8 only when modelopt is not importable; FP8 raises RuntimeError with the --ignore-requires-python install hint - check_python_compat.sh: after the main requirements install, re-install nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path - requirements_public_train.txt: add comment documenting the 3.13 install approach for manual setups - test_onnx_quant_workflow.py: - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file (now skips when onnxruntime is not installed, regardless of version) - replace test_ort_quantize_int8_dispatch_on_py313 with test_ort_quantize_int8_called_on_modelopt_import_error - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error - remove py3.13 version guard from test_modelopt_importable_when_installed - remove py3.13 version guard from test_ort_importable_when_installed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): pin ONNX IR version 8 in ort quantize test modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all Python versions. Newer ONNX packages (1.19+) default model.ir_version to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10, causing test_ort_quantize_int8_produces_output_file to fail on the GPU CI for py3.11, py3.12, and py3.13. Pin model.ir_version = 8 (the minimum required for opset 17) before saving the test model so the calibration InferenceSession succeeds with any onnxruntime version that supports IR ≤ 10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(onnx): add end-to-end mq.quantize() tests for modelopt Previous coverage only verified that modelopt.onnx.quantization was importable. Add TestModeloptQuantize with two tests that actually call mq.quantize() on a real ONNX model: - test_mq_quantize_int8_produces_valid_onnx: verifies the output file is created and passes onnx.checker (confirms modelopt works at runtime, not just at import time — this is the key Python 3.13 regression check) - test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were inserted (output graph has more nodes than the FP32 source) Both tests share a _build_tiny_model() helper that creates a minimal Gemm ONNX model with input "dets" and 16 calibration rows, matching the production calibration_data={"dets": calib_dets} call convention. model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility. Tests are skipped when nvidia-modelopt is not installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): use float32 calibration data in TestModeloptQuantize mq.quantize() runs an internal ONNX inference session to profile MatMul nodes; feeding uint8 calibration data to a float-input model caused InvalidArgument. Switch to np.random.randn(...).astype(float32). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(trt): raise RuntimeError when tensorrt missing for ONNX_WORKFLOW=2/3 Previously both TRT import sites caught ImportError inside a broad `except Exception` block and silently fell back to PyTorch with a print. This masked misconfiguration: the user explicitly selected ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard error. Changes: - USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError with install hint; other TRT errors (bad engine file) still fall back gracefully. - EXPORT_AND_USE_TRT (workflow=2): same split. - test_tensorrt_fallback.py: replace the old "falls back on ImportError" tests with "raises RuntimeError on ImportError" tests; add chained cause check and non-import fallback tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): fix ORT calibration for ort quantize test ORT's MinMaxCalibrater augments the model to expose intermediate tensors for calibration, but graph *inputs* are not included in the augmented outputs. When the test model had dets->Gemm directly, ORT never collected calibration stats for 'dets', causing: ValueError: Quantization parameters are not specified for param dets. Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the Gemm input is an intermediate tensor that gets calibrated. Also switch the calibration array to float32 (consistent with model dtype) and add rewind() to _DetCalibReader in production code for calibration methods that make multiple passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): skip ort quantize output test when modelopt is installed _ort_quantize_int8 is only invoked when modelopt is absent. When modelopt IS installed its mq.quantize() call leaves ORT's execution- provider state dirty (failed TRT EP init), causing the calibration InferenceSession to run silently without producing stats, which makes quantize_static raise: ValueError: Quantization parameters are not specified for param dets. The test is meaningless in that environment anyway — if modelopt is present the ort path is never taken. Skip when modelopt is importable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * review: address PR #14 review comments - README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop) (bmhowe23 suggestion) - LER: cast calib_dets to float32 before passing to mq.quantize(); _collect_calibration_dets returns uint8 but modelopt expects float (sacpis: bug report on line 1077) - LER: use Path.with_suffix('.engine') instead of str.replace (sacpis nit on line 1104) - LER: add pathlib.Path import - test: remove spurious @skipUnless from _build_tiny_model helper; it is not a test method and the decorator has no effect (sacpis nit on line 299) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: extract _parse_quant_format() helper from LER Move the QUANT_FORMAT env-var read/validate/warn block into a module-level helper so the test can call the real production logic instead of re-implementing it. - Add _parse_quant_format(rank=0) -> str in logical_error_rate.py - Replace inline parsing block in run_inference_and_decode with a single _parse_quant_format(rank=dist.rank) call - Import _parse_quant_format in test_onnx_quant_workflow.py and simplify _run_quant_block to delegate to it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: guard against num_obs < 1 in _collect_calibration_dets Python's [:, :-0] is equivalent to [:, :0] and silently returns an empty tensor rather than the full row. Add an explicit check so the caller gets a clear ValueError instead of a confusing width-mismatch error downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: fix tensorrt comment — missing TRT now raises RuntimeError The comment said "Absent at runtime causes graceful fallback to the PyTorch path", but since the TRT ImportError fix (ae0f3b1) both ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Igor Baratta <ialmeidabara@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Replace proprietary license headers with Apache-2.0 Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0 across all 70 tracked source files. Also updates spdx_headers.py to generate Apache-2.0 headers and replace old proprietary headers in-place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): apply Apache-2.0 headers to files added after branch cut Files added by PRs #13, #14, and #17 still carried the proprietary LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match the rest of the codebase after the header migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: apply YAPF formatting after header replacement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(headers): restore full file content truncated during rebase The first rebase used --theirs to resolve header conflicts, which took the old PR branch content instead of main's newer content for 5 files. Restore from upstream/main and apply Apache-2.0 header correctly. Affected files: - code/qec/noise_model.py - code/qec/surface_code/homological_equivalence_torch.py - code/tests/mid/test_homological_equivalence.py - code/tests/test_noise_model.py - code/training/train.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18 PR #18 removed the unused v2 model architecture. Drop the corresponding test class and import to fix the ImportError in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

IgorBaratta and others added 3 commits March 11, 2026 05:58

fix(onnx): re-derive engine_path from final onnx_path after quant fal…

dd82393

…lback

ivanbasov changed the title ~~onnx quantization~~ Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading Mar 11, 2026

ivanbasov requested review from IgorBaratta and bmhowe23 March 11, 2026 15:44

IgorBaratta and others added 4 commits March 13, 2026 07:56

quantize only CNN layers

d7b8217

Merge branch 'igor/onnx_quantization' of github.com:NVIDIA/quantum-pr…

1ce0713

…edecoder into igor/onnx_quantization # Conflicts: # code/evaluation/logical_error_rate.py

ivanbasov and others added 2 commits March 16, 2026 12:39

ivanbasov and others added 5 commits March 16, 2026 14:47

bmhowe23 reviewed Mar 17, 2026

View reviewed changes

Comment thread README.md Outdated

sacpis reviewed Mar 17, 2026

View reviewed changes

ivanbasov and others added 3 commits March 18, 2026 17:55

ivanbasov requested review from bmhowe23 and sacpis March 19, 2026 01:07

sacpis approved these changes Mar 19, 2026

View reviewed changes

Comment thread code/requirements_public_inference.txt Outdated

bmhowe23 approved these changes Mar 19, 2026

View reviewed changes

IgorBaratta approved these changes Mar 19, 2026

View reviewed changes

ivanbasov merged commit 993e797 into main Mar 19, 2026
12 checks passed

ivanbasov deleted the igor/onnx_quantization branch March 19, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading#14

Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading#14
ivanbasov merged 19 commits into
mainfrom
igor/onnx_quantization

ivanbasov commented Mar 11, 2026 •

edited

Loading

Uh oh!

ivanbasov commented Mar 16, 2026

Uh oh!

bmhowe23 commented Mar 16, 2026 •

edited

Loading

Uh oh!

ivanbasov commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sacpis left a comment

Uh oh!

Uh oh!

ivanbasov commented Mar 19, 2026

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ivanbasov commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

ivanbasov commented Mar 16, 2026

Uh oh!

bmhowe23 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanbasov commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sacpis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ivanbasov commented Mar 19, 2026

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivanbasov commented Mar 11, 2026 •

edited

Loading

bmhowe23 commented Mar 16, 2026 •

edited

Loading