Skip to content

Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading#14

Merged
ivanbasov merged 19 commits into
mainfrom
igor/onnx_quantization
Mar 19, 2026
Merged

Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading#14
ivanbasov merged 19 commits into
mainfrom
igor/onnx_quantization

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

@ivanbasov ivanbasov commented Mar 11, 2026

Summary

  • Adds ONNX export of the pre-decoder inference pipeline with optional INT8 and FP8 post-training quantization via ModelOpt/TensorRT. Controlled by ONNX_WORKFLOW (0=torch only, 1=export ONNX, 2=export+TRT, 3=engine only) and QUANT_FORMAT (int8 / fp8) env vars.
  • Adds _collect_calibration_dets in evaluation/logical_error_rate.py to extract representative detector inputs from the test dataloader for PTQ calibration.
  • Adds code/export/ module: safetensors_utils.py (fp16/fp32 save/load) and checkpoint_to_safetensors.py (CLI to convert .pt → .safetensors).
  • Adds PREDECODER_SAFETENSORS_CHECKPOINT env var in workflows/run.py to load a model directly from a .safetensors file at inference time; model_id and dtype are read from file metadata.
  • Updates local_run.sh and README.md with SafeTensors and ONNX/quantization usage instructions.
  • Adds unit tests: test_safetensors_export.py (round-trip fp32/fp16, metadata auto-detect, error cases) and test_onnx_quant_workflow.py (calibration data collection, QUANT_FORMAT parsing, quantize routing).

Ported from internal MR !38.

IgorBaratta and others added 3 commits March 11, 2026 05:58
  - Add _collect_calibration_dets module-level helper that samples
    detector inputs from the inference dataloader for ONNX calibration
  - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path;
    invalid values are ignored with a warning
  - Two-step export: always write FP32 ONNX first, then optionally apply
    modelopt.onnx.quantization.quantize() for the requested format
  - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX
  - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration
    sample count
  - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering
    the calibration helper and QUANT_FORMAT routing logic
- run.py: remove emoji from print statements (style inconsistency)
- run.py: remove no-op torch.compile(disable=True) calls
- run.py: extract _resolve_dir() helper to replace 4 copies of the
  current_file/project_root path resolution pattern
- run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt()
  which handles model_state_dict/state_dict/bare-dict formats and strips
  the DDP "module." prefix — consistent with checkpoint_to_safetensors.py
- tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths
- YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py
- README: add ONNX export and quantization section documenting
  ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov changed the title onnx quantization Add ONNX export, INT8/FP8 quantization, and SafeTensors inference loading Mar 11, 2026
IgorBaratta and others added 4 commits March 13, 2026 07:56
…edecoder into igor/onnx_quantization

# Conflicts:
#	code/evaluation/logical_error_rate.py
…ests

- YAPF: reformat 3 long lines in logical_error_rate.py introduced by the
  "quantize only CNN layers" commit (d7b8217)
- Move nvidia-modelopt[onnx] from requirements_public_inference.txt to
  requirements_public_train.txt; it is only needed for ONNX PTQ export
  (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13
  build — keeping it in inference reqs broke unit-tests/py3.13 in CI
- Add python_version<"3.13" marker so the CI train matrix installs it on
  supported Python versions without failing on 3.13
- Add TestModeloptPrerequisite in test_onnx_quant_workflow.py:
  - asserts nvidia-modelopt is declared in requirements_public_train.txt
  - asserts it is absent from requirements_public_inference.txt
  - conditionally checks the import is resolvable when the package is
    present (skipped on Python 3.13+ and when not installed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvidia-modelopt does not support Python 3.13+. Add a conditional backend
dispatch so QUANT_FORMAT=int8 works on all supported Python versions:

- Add _ort_quantize_int8() module-level helper that uses
  onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and
  a CalibrationDataReader wrapping the pre-collected calib_dets array
- In the quantization block, branch on sys.version_info >= (3, 13):
  - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8
    (no viable 3.13-compatible FP8 PTQ library available)
  - Python <3.13: keep existing modelopt path unchanged
- Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt
- Expand TestOrtQuantizeInt8 tests:
  - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+
  - dispatch test verifying _ort_quantize_int8 is called on 3.13+
  - FP8-on-3.13 raises RuntimeError
- Expand TestModeloptPrerequisite: assert onnxruntime appears in train
  requirements and both quant packages are absent from inference requirements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed
in CPU-only CI, so it is not added as an active pip requirement.
Instead:

- Add a comment block in requirements_public_inference.txt documenting
  tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths,
  with the install command and a note about graceful fallback
- Add test_tensorrt_fallback.py with three test classes:
  - TestTensorrtDocumented: asserts the requirements comment exists and
    tensorrt is NOT an active pip requirement
  - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY
    and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do
    not propagate the exception to the caller
  - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime,
    Builder, BuilderFlag, LayerInformationFormat) when tensorrt is
    installed; skipped silently on CPU-only environments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Mar 16, 2026

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

I posted this question to the nvidia-modelopt team. Let's see what they say: NVIDIA/Model-Optimizer#217 (comment)

(In that thread, they claim it should work for 3.13, but it does not.)

ivanbasov and others added 2 commits March 16, 2026 12:39
…pt on py3.13

nvidia-modelopt works on Python 3.13 when installed with
--ignore-requires-python (confirmed by modelopt maintainers).

- logical_error_rate.py: replace sys.version_info dispatch with an
  ImportError-based dispatch — try modelopt first (INT8+FP8), fall back
  to _ort_quantize_int8 only when modelopt is not importable; FP8 raises
  RuntimeError with the --ignore-requires-python install hint
- check_python_compat.sh: after the main requirements install, re-install
  nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and
  Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path
- requirements_public_train.txt: add comment documenting the 3.13 install
  approach for manual setups
- test_onnx_quant_workflow.py:
  - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file
    (now skips when onnxruntime is not installed, regardless of version)
  - replace test_ort_quantize_int8_dispatch_on_py313 with
    test_ort_quantize_int8_called_on_modelopt_import_error
  - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error
  - remove py3.13 version guard from test_modelopt_importable_when_installed
  - remove py3.13 version guard from test_ort_importable_when_installed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all
Python versions.  Newer ONNX packages (1.19+) default model.ir_version
to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10,
causing test_ort_quantize_int8_produces_output_file to fail on the GPU
CI for py3.11, py3.12, and py3.13.

Pin model.ir_version = 8 (the minimum required for opset 17) before
saving the test model so the calibration InferenceSession succeeds with
any onnxruntime version that supports IR ≤ 10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

@bmhowe23 and @IgorBaratta, please note that nvidia-modelopt used for quantization into FP is available only for <python3.13. So, I have removed its support for python3.13. I added a quantization to INT8 for python3.13 via onnxruntime but not sure if we need it. Please let me know if we just need to remove quantization at python 3.13 at all or you see other solutions.

I posted this question to the nvidia-modelopt team. Let's see what they say: NVIDIA/Model-Optimizer#217 (comment)

(In that thread, they claim it should work for 3.13, but it does not.)

Thanks, @bmhowe23 ! It seems to work. However, there are still semi-manual installation for python 3.13 (requires --ignore-requires-python). As far as I understand, introducing pyproject.toml by itself does not improve the smoothness of the installation. However, if we are allowed to replace pip with uv pip, it could help with running all under a consisent configuration. Please let me know if we can/should move to uv pip.

ivanbasov and others added 5 commits March 16, 2026 14:47
Previous coverage only verified that modelopt.onnx.quantization was
importable.  Add TestModeloptQuantize with two tests that actually call
mq.quantize() on a real ONNX model:

- test_mq_quantize_int8_produces_valid_onnx: verifies the output file is
  created and passes onnx.checker (confirms modelopt works at runtime,
  not just at import time — this is the key Python 3.13 regression check)
- test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were
  inserted (output graph has more nodes than the FP32 source)

Both tests share a _build_tiny_model() helper that creates a minimal
Gemm ONNX model with input "dets" and 16 calibration rows, matching the
production calibration_data={"dets": calib_dets} call convention.
model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility.
Tests are skipped when nvidia-modelopt is not installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mq.quantize() runs an internal ONNX inference session to profile
MatMul nodes; feeding uint8 calibration data to a float-input model
caused InvalidArgument. Switch to np.random.randn(...).astype(float32).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously both TRT import sites caught ImportError inside a broad
`except Exception` block and silently fell back to PyTorch with a
print.  This masked misconfiguration: the user explicitly selected
ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard
error.

Changes:
- USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError
  with install hint; other TRT errors (bad engine file) still fall
  back gracefully.
- EXPORT_AND_USE_TRT (workflow=2): same split.
- test_tensorrt_fallback.py: replace the old "falls back on ImportError"
  tests with "raises RuntimeError on ImportError" tests; add chained
  cause check and non-import fallback tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ORT's MinMaxCalibrater augments the model to expose intermediate
tensors for calibration, but graph *inputs* are not included in the
augmented outputs.  When the test model had dets->Gemm directly, ORT
never collected calibration stats for 'dets', causing:
  ValueError: Quantization parameters are not specified for param dets.

Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the
Gemm input is an intermediate tensor that gets calibrated.  Also
switch the calibration array to float32 (consistent with model dtype)
and add rewind() to _DetCalibReader in production code for calibration
methods that make multiple passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_ort_quantize_int8 is only invoked when modelopt is absent.  When
modelopt IS installed its mq.quantize() call leaves ORT's execution-
provider state dirty (failed TRT EP init), causing the calibration
InferenceSession to run silently without producing stats, which makes
quantize_static raise:
  ValueError: Quantization parameters are not specified for param dets.

The test is meaningless in that environment anyway — if modelopt is
present the ort path is never taken.  Skip when modelopt is importable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread README.md Outdated
Comment thread code/evaluation/logical_error_rate.py Outdated
Comment thread code/evaluation/logical_error_rate.py Outdated
Comment thread code/evaluation/logical_error_rate.py
Comment thread code/tests/test_onnx_quant_workflow.py Outdated
Comment thread code/tests/test_onnx_quant_workflow.py Outdated
Comment thread code/evaluation/logical_error_rate.py
ivanbasov and others added 3 commits March 18, 2026 17:55
- README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop)
  (bmhowe23 suggestion)
- LER: cast calib_dets to float32 before passing to mq.quantize();
  _collect_calibration_dets returns uint8 but modelopt expects float
  (sacpis: bug report on line 1077)
- LER: use Path.with_suffix('.engine') instead of str.replace
  (sacpis nit on line 1104)
- LER: add pathlib.Path import
- test: remove spurious @skipUnless from _build_tiny_model helper;
  it is not a test method and the decorator has no effect
  (sacpis nit on line 299)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the QUANT_FORMAT env-var read/validate/warn block into a
module-level helper so the test can call the real production logic
instead of re-implementing it.

- Add _parse_quant_format(rank=0) -> str in logical_error_rate.py
- Replace inline parsing block in run_inference_and_decode with a
  single _parse_quant_format(rank=dist.rank) call
- Import _parse_quant_format in test_onnx_quant_workflow.py and
  simplify _run_quant_block to delegate to it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python's [:, :-0] is equivalent to [:, :0] and silently returns an
empty tensor rather than the full row.  Add an explicit check so the
caller gets a clear ValueError instead of a confusing width-mismatch
error downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested review from bmhowe23 and sacpis March 19, 2026 01:07
Copy link
Copy Markdown
Collaborator

@sacpis sacpis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks @ivanbasov. Left a comment.

Comment thread code/requirements_public_inference.txt Outdated
The comment said "Absent at runtime causes graceful fallback to the
PyTorch path", but since the TRT ImportError fix (ae0f3b1) both
ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

@bmhowe23 and @IgorBaratta , could you please review?

Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Thanks, @IgorBaratta and @ivanbasov!

@ivanbasov ivanbasov merged commit 993e797 into main Mar 19, 2026
12 checks passed
@ivanbasov ivanbasov deleted the igor/onnx_quantization branch March 19, 2026 22:58
ivanbasov added a commit that referenced this pull request Mar 23, 2026
* Replace proprietary license headers with Apache-2.0

Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0
across all 70 tracked source files. Also updates spdx_headers.py to
generate Apache-2.0 headers and replace old proprietary headers in-place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): apply Apache-2.0 headers to files added after branch cut

Files added by PRs #13, #14, and #17 still carried the proprietary
LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match
the rest of the codebase after the header migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: apply YAPF formatting after header replacement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): restore full file content truncated during rebase

The first rebase used --theirs to resolve header conflicts, which took
the old PR branch content instead of main's newer content for 5 files.
Restore from upstream/main and apply Apache-2.0 header correctly.

Affected files:
- code/qec/noise_model.py
- code/qec/surface_code/homological_equivalence_torch.py
- code/tests/mid/test_homological_equivalence.py
- code/tests/test_noise_model.py
- code/training/train.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18

PR #18 removed the unused v2 model architecture. Drop the corresponding
test class and import to fix the ImportError in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ding (#14)

* feat(onnx): add QUANT_FORMAT int8/fp8 PTQ via modelopt.onnx
  - Add _collect_calibration_dets module-level helper that samples
    detector inputs from the inference dataloader for ONNX calibration
  - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path;
    invalid values are ignored with a warning
  - Two-step export: always write FP32 ONNX first, then optionally apply
    modelopt.onnx.quantization.quantize() for the requested format
  - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX
  - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration
    sample count
  - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering
    the calibration helper and QUANT_FORMAT routing logic

* fix(onnx): re-derive engine_path from final onnx_path after quant fallback

* review: fix run.py, temp file cleanup, YAPF, README ONNX section

- run.py: remove emoji from print statements (style inconsistency)
- run.py: remove no-op torch.compile(disable=True) calls
- run.py: extract _resolve_dir() helper to replace 4 copies of the
  current_file/project_root path resolution pattern
- run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt()
  which handles model_state_dict/state_dict/bare-dict formats and strips
  the DDP "module." prefix — consistent with checkpoint_to_safetensors.py
- tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths
- YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py
- README: add ONNX export and quantization section documenting
  ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* quantize only CNN layers

* fix(ci): YAPF, move nvidia-modelopt to train reqs, add prerequisite tests

- YAPF: reformat 3 long lines in logical_error_rate.py introduced by the
  "quantize only CNN layers" commit (d7b8217)
- Move nvidia-modelopt[onnx] from requirements_public_inference.txt to
  requirements_public_train.txt; it is only needed for ONNX PTQ export
  (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13
  build — keeping it in inference reqs broke unit-tests/py3.13 in CI
- Add python_version<"3.13" marker so the CI train matrix installs it on
  supported Python versions without failing on 3.13
- Add TestModeloptPrerequisite in test_onnx_quant_workflow.py:
  - asserts nvidia-modelopt is declared in requirements_public_train.txt
  - asserts it is absent from requirements_public_inference.txt
  - conditionally checks the import is resolvable when the package is
    present (skipped on Python 3.13+ and when not installed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(onnx): add onnxruntime INT8 fallback for Python 3.13+

nvidia-modelopt does not support Python 3.13+. Add a conditional backend
dispatch so QUANT_FORMAT=int8 works on all supported Python versions:

- Add _ort_quantize_int8() module-level helper that uses
  onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and
  a CalibrationDataReader wrapping the pre-collected calib_dets array
- In the quantization block, branch on sys.version_info >= (3, 13):
  - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8
    (no viable 3.13-compatible FP8 PTQ library available)
  - Python <3.13: keep existing modelopt path unchanged
- Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt
- Expand TestOrtQuantizeInt8 tests:
  - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+
  - dispatch test verifying _ort_quantize_int8 is called on 3.13+
  - FP8-on-3.13 raises RuntimeError
- Expand TestModeloptPrerequisite: assert onnxruntime appears in train
  requirements and both quant packages are absent from inference requirements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(prereqs): document tensorrt as optional GPU dep, add fallback tests

tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed
in CPU-only CI, so it is not added as an active pip requirement.
Instead:

- Add a comment block in requirements_public_inference.txt documenting
  tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths,
  with the install command and a note about graceful fallback
- Add test_tensorrt_fallback.py with three test classes:
  - TestTensorrtDocumented: asserts the requirements comment exists and
    tensorrt is NOT an active pip requirement
  - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY
    and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do
    not propagate the exception to the caller
  - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime,
    Builder, BuilderFlag, LayerInformationFormat) when tensorrt is
    installed; skipped silently on CPU-only environments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(onnx): use import-based dispatch for modelopt/ort; install modelopt on py3.13

nvidia-modelopt works on Python 3.13 when installed with
--ignore-requires-python (confirmed by modelopt maintainers).

- logical_error_rate.py: replace sys.version_info dispatch with an
  ImportError-based dispatch — try modelopt first (INT8+FP8), fall back
  to _ort_quantize_int8 only when modelopt is not importable; FP8 raises
  RuntimeError with the --ignore-requires-python install hint
- check_python_compat.sh: after the main requirements install, re-install
  nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and
  Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path
- requirements_public_train.txt: add comment documenting the 3.13 install
  approach for manual setups
- test_onnx_quant_workflow.py:
  - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file
    (now skips when onnxruntime is not installed, regardless of version)
  - replace test_ort_quantize_int8_dispatch_on_py313 with
    test_ort_quantize_int8_called_on_modelopt_import_error
  - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error
  - remove py3.13 version guard from test_modelopt_importable_when_installed
  - remove py3.13 version guard from test_ort_importable_when_installed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): pin ONNX IR version 8 in ort quantize test

modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all
Python versions.  Newer ONNX packages (1.19+) default model.ir_version
to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10,
causing test_ort_quantize_int8_produces_output_file to fail on the GPU
CI for py3.11, py3.12, and py3.13.

Pin model.ir_version = 8 (the minimum required for opset 17) before
saving the test model so the calibration InferenceSession succeeds with
any onnxruntime version that supports IR ≤ 10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(onnx): add end-to-end mq.quantize() tests for modelopt

Previous coverage only verified that modelopt.onnx.quantization was
importable.  Add TestModeloptQuantize with two tests that actually call
mq.quantize() on a real ONNX model:

- test_mq_quantize_int8_produces_valid_onnx: verifies the output file is
  created and passes onnx.checker (confirms modelopt works at runtime,
  not just at import time — this is the key Python 3.13 regression check)
- test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were
  inserted (output graph has more nodes than the FP32 source)

Both tests share a _build_tiny_model() helper that creates a minimal
Gemm ONNX model with input "dets" and 16 calibration rows, matching the
production calibration_data={"dets": calib_dets} call convention.
model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility.
Tests are skipped when nvidia-modelopt is not installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): use float32 calibration data in TestModeloptQuantize

mq.quantize() runs an internal ONNX inference session to profile
MatMul nodes; feeding uint8 calibration data to a float-input model
caused InvalidArgument. Switch to np.random.randn(...).astype(float32).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(trt): raise RuntimeError when tensorrt missing for ONNX_WORKFLOW=2/3

Previously both TRT import sites caught ImportError inside a broad
`except Exception` block and silently fell back to PyTorch with a
print.  This masked misconfiguration: the user explicitly selected
ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard
error.

Changes:
- USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError
  with install hint; other TRT errors (bad engine file) still fall
  back gracefully.
- EXPORT_AND_USE_TRT (workflow=2): same split.
- test_tensorrt_fallback.py: replace the old "falls back on ImportError"
  tests with "raises RuntimeError on ImportError" tests; add chained
  cause check and non-import fallback tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): fix ORT calibration for ort quantize test

ORT's MinMaxCalibrater augments the model to expose intermediate
tensors for calibration, but graph *inputs* are not included in the
augmented outputs.  When the test model had dets->Gemm directly, ORT
never collected calibration stats for 'dets', causing:
  ValueError: Quantization parameters are not specified for param dets.

Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the
Gemm input is an intermediate tensor that gets calibrated.  Also
switch the calibration array to float32 (consistent with model dtype)
and add rewind() to _DetCalibReader in production code for calibration
methods that make multiple passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): skip ort quantize output test when modelopt is installed

_ort_quantize_int8 is only invoked when modelopt is absent.  When
modelopt IS installed its mq.quantize() call leaves ORT's execution-
provider state dirty (failed TRT EP init), causing the calibration
InferenceSession to run silently without producing stats, which makes
quantize_static raise:
  ValueError: Quantization parameters are not specified for param dets.

The test is meaningless in that environment anyway — if modelopt is
present the ort path is never taken.  Skip when modelopt is importable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* review: address PR #14 review comments

- README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop)
  (bmhowe23 suggestion)
- LER: cast calib_dets to float32 before passing to mq.quantize();
  _collect_calibration_dets returns uint8 but modelopt expects float
  (sacpis: bug report on line 1077)
- LER: use Path.with_suffix('.engine') instead of str.replace
  (sacpis nit on line 1104)
- LER: add pathlib.Path import
- test: remove spurious @skipUnless from _build_tiny_model helper;
  it is not a test method and the decorator has no effect
  (sacpis nit on line 299)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: extract _parse_quant_format() helper from LER

Move the QUANT_FORMAT env-var read/validate/warn block into a
module-level helper so the test can call the real production logic
instead of re-implementing it.

- Add _parse_quant_format(rank=0) -> str in logical_error_rate.py
- Replace inline parsing block in run_inference_and_decode with a
  single _parse_quant_format(rank=dist.rank) call
- Import _parse_quant_format in test_onnx_quant_workflow.py and
  simplify _run_quant_block to delegate to it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: guard against num_obs < 1 in _collect_calibration_dets

Python's [:, :-0] is equivalent to [:, :0] and silently returns an
empty tensor rather than the full row.  Add an explicit check so the
caller gets a clear ValueError instead of a confusing width-mismatch
error downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: fix tensorrt comment — missing TRT now raises RuntimeError

The comment said "Absent at runtime causes graceful fallback to the
PyTorch path", but since the TRT ImportError fix (ae0f3b1) both
ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Igor Baratta <ialmeidabara@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* Replace proprietary license headers with Apache-2.0

Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0
across all 70 tracked source files. Also updates spdx_headers.py to
generate Apache-2.0 headers and replace old proprietary headers in-place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): apply Apache-2.0 headers to files added after branch cut

Files added by PRs #13, #14, and #17 still carried the proprietary
LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match
the rest of the codebase after the header migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: apply YAPF formatting after header replacement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): restore full file content truncated during rebase

The first rebase used --theirs to resolve header conflicts, which took
the old PR branch content instead of main's newer content for 5 files.
Restore from upstream/main and apply Apache-2.0 header correctly.

Affected files:
- code/qec/noise_model.py
- code/qec/surface_code/homological_equivalence_torch.py
- code/tests/mid/test_homological_equivalence.py
- code/tests/test_noise_model.py
- code/training/train.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18

PR #18 removed the unused v2 model architecture. Drop the corresponding
test class and import to fix the ImportError in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ding (#14)

* feat(onnx): add QUANT_FORMAT int8/fp8 PTQ via modelopt.onnx
  - Add _collect_calibration_dets module-level helper that samples
    detector inputs from the inference dataloader for ONNX calibration
  - Parse QUANT_FORMAT env var (int8, fp8) in OnnxWorkflow export path;
    invalid values are ignored with a warning
  - Two-step export: always write FP32 ONNX first, then optionally apply
    modelopt.onnx.quantization.quantize() for the requested format
  - fp8 is fail-fast on error; int8 silently falls back to FP32 ONNX
  - Add QUANT_CALIB_SAMPLES env var (default 256) to control calibration
    sample count
  - Add test_onnx_quant_workflow.py: 13 CPU-only unit tests covering
    the calibration helper and QUANT_FORMAT routing logic

* fix(onnx): re-derive engine_path from final onnx_path after quant fallback

* review: fix run.py, temp file cleanup, YAPF, README ONNX section

- run.py: remove emoji from print statements (style inconsistency)
- run.py: remove no-op torch.compile(disable=True) calls
- run.py: extract _resolve_dir() helper to replace 4 copies of the
  current_file/project_root path resolution pattern
- run.py: replace bare torch.load/load_state_dict with _load_state_dict_from_pt()
  which handles model_state_dict/state_dict/bare-dict formats and strips
  the DDP "module." prefix — consistent with checkpoint_to_safetensors.py
- tests: add addCleanup(os.unlink) for all NamedTemporaryFile paths
- YAPF: reformat logical_error_rate.py and test_onnx_quant_workflow.py
- README: add ONNX export and quantization section documenting
  ONNX_WORKFLOW modes, QUANT_FORMAT, QUANT_CALIB_SAMPLES

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* quantize only CNN layers

* fix(ci): YAPF, move nvidia-modelopt to train reqs, add prerequisite tests

- YAPF: reformat 3 long lines in logical_error_rate.py introduced by the
  "quantize only CNN layers" commit (d7b8217)
- Move nvidia-modelopt[onnx] from requirements_public_inference.txt to
  requirements_public_train.txt; it is only needed for ONNX PTQ export
  (QUANT_FORMAT env var), not for pure inference, and has no Python 3.13
  build — keeping it in inference reqs broke unit-tests/py3.13 in CI
- Add python_version<"3.13" marker so the CI train matrix installs it on
  supported Python versions without failing on 3.13
- Add TestModeloptPrerequisite in test_onnx_quant_workflow.py:
  - asserts nvidia-modelopt is declared in requirements_public_train.txt
  - asserts it is absent from requirements_public_inference.txt
  - conditionally checks the import is resolvable when the package is
    present (skipped on Python 3.13+ and when not installed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(onnx): add onnxruntime INT8 fallback for Python 3.13+

nvidia-modelopt does not support Python 3.13+. Add a conditional backend
dispatch so QUANT_FORMAT=int8 works on all supported Python versions:

- Add _ort_quantize_int8() module-level helper that uses
  onnxruntime.quantization.quantize_static() with QDQ/QInt8 format and
  a CalibrationDataReader wrapping the pre-collected calib_dets array
- In the quantization block, branch on sys.version_info >= (3, 13):
  - Python 3.13+: call _ort_quantize_int8(); raise immediately for FP8
    (no viable 3.13-compatible FP8 PTQ library available)
  - Python <3.13: keep existing modelopt path unchanged
- Add onnxruntime (python_version >= "3.13") to requirements_public_train.txt
- Expand TestOrtQuantizeInt8 tests:
  - round-trip test (build tiny Gemm ONNX, quantize, validate) on 3.13+
  - dispatch test verifying _ort_quantize_int8 is called on 3.13+
  - FP8-on-3.13 raises RuntimeError
- Expand TestModeloptPrerequisite: assert onnxruntime appears in train
  requirements and both quant packages are absent from inference requirements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(prereqs): document tensorrt as optional GPU dep, add fallback tests

tensorrt is a heavy CUDA-only SDK (~500 MB) that cannot be pip-installed
in CPU-only CI, so it is not added as an active pip requirement.
Instead:

- Add a comment block in requirements_public_inference.txt documenting
  tensorrt as an optional prerequisite for ONNX_WORKFLOW=2/3 paths,
  with the install command and a note about graceful fallback
- Add test_tensorrt_fallback.py with three test classes:
  - TestTensorrtDocumented: asserts the requirements comment exists and
    tensorrt is NOT an active pip requirement
  - TestTensorrtFallback: verifies both TRT import sites (USE_ENGINE_ONLY
    and EXPORT_AND_USE_TRT) set trt_context=None on ImportError and do
    not propagate the exception to the caller
  - TestTensorrtImportable: checks key TRT symbols (Logger, Runtime,
    Builder, BuilderFlag, LayerInformationFormat) when tensorrt is
    installed; skipped silently on CPU-only environments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(onnx): use import-based dispatch for modelopt/ort; install modelopt on py3.13

nvidia-modelopt works on Python 3.13 when installed with
--ignore-requires-python (confirmed by modelopt maintainers).

- logical_error_rate.py: replace sys.version_info dispatch with an
  ImportError-based dispatch — try modelopt first (INT8+FP8), fall back
  to _ort_quantize_int8 only when modelopt is not importable; FP8 raises
  RuntimeError with the --ignore-requires-python install hint
- check_python_compat.sh: after the main requirements install, re-install
  nvidia-modelopt[onnx] with --ignore-requires-python when MODE=train and
  Python >= 3.13, so GPU CI on 3.13 uses the full modelopt path
- requirements_public_train.txt: add comment documenting the 3.13 install
  approach for manual setups
- test_onnx_quant_workflow.py:
  - remove py3.13-specific skip from test_ort_quantize_int8_produces_output_file
    (now skips when onnxruntime is not installed, regardless of version)
  - replace test_ort_quantize_int8_dispatch_on_py313 with
    test_ort_quantize_int8_called_on_modelopt_import_error
  - replace test_fp8_raises_on_py313 with test_fp8_raises_on_modelopt_import_error
  - remove py3.13 version guard from test_modelopt_importable_when_installed
  - remove py3.13 version guard from test_ort_importable_when_installed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): pin ONNX IR version 8 in ort quantize test

modelopt[onnx] pulls in onnxruntime-gpu~=1.22.0 as a dependency on all
Python versions.  Newer ONNX packages (1.19+) default model.ir_version
to 12, but onnxruntime-gpu 1.22.0 only supports up to IR version 10,
causing test_ort_quantize_int8_produces_output_file to fail on the GPU
CI for py3.11, py3.12, and py3.13.

Pin model.ir_version = 8 (the minimum required for opset 17) before
saving the test model so the calibration InferenceSession succeeds with
any onnxruntime version that supports IR ≤ 10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(onnx): add end-to-end mq.quantize() tests for modelopt

Previous coverage only verified that modelopt.onnx.quantization was
importable.  Add TestModeloptQuantize with two tests that actually call
mq.quantize() on a real ONNX model:

- test_mq_quantize_int8_produces_valid_onnx: verifies the output file is
  created and passes onnx.checker (confirms modelopt works at runtime,
  not just at import time — this is the key Python 3.13 regression check)
- test_mq_quantize_int8_output_differs_from_fp32: verifies QDQ nodes were
  inserted (output graph has more nodes than the FP32 source)

Both tests share a _build_tiny_model() helper that creates a minimal
Gemm ONNX model with input "dets" and 16 calibration rows, matching the
production calibration_data={"dets": calib_dets} call convention.
model.ir_version is pinned to 8 for onnxruntime-gpu 1.22.0 compatibility.
Tests are skipped when nvidia-modelopt is not installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): use float32 calibration data in TestModeloptQuantize

mq.quantize() runs an internal ONNX inference session to profile
MatMul nodes; feeding uint8 calibration data to a float-input model
caused InvalidArgument. Switch to np.random.randn(...).astype(float32).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(trt): raise RuntimeError when tensorrt missing for ONNX_WORKFLOW=2/3

Previously both TRT import sites caught ImportError inside a broad
`except Exception` block and silently fell back to PyTorch with a
print.  This masked misconfiguration: the user explicitly selected
ONNX_WORKFLOW=2 or 3, so a missing tensorrt install is always a hard
error.

Changes:
- USE_ENGINE_ONLY (workflow=3): ImportError now raises RuntimeError
  with install hint; other TRT errors (bad engine file) still fall
  back gracefully.
- EXPORT_AND_USE_TRT (workflow=2): same split.
- test_tensorrt_fallback.py: replace the old "falls back on ImportError"
  tests with "raises RuntimeError on ImportError" tests; add chained
  cause check and non-import fallback tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): fix ORT calibration for ort quantize test

ORT's MinMaxCalibrater augments the model to expose intermediate
tensors for calibration, but graph *inputs* are not included in the
augmented outputs.  When the test model had dets->Gemm directly, ORT
never collected calibration stats for 'dets', causing:
  ValueError: Quantization parameters are not specified for param dets.

Fix: insert a Relu node (dets -> Relu -> dets_relu -> Gemm) so the
Gemm input is an intermediate tensor that gets calibrated.  Also
switch the calibration array to float32 (consistent with model dtype)
and add rewind() to _DetCalibReader in production code for calibration
methods that make multiple passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): skip ort quantize output test when modelopt is installed

_ort_quantize_int8 is only invoked when modelopt is absent.  When
modelopt IS installed its mq.quantize() call leaves ORT's execution-
provider state dirty (failed TRT EP init), causing the calibration
InferenceSession to run silently without producing stats, which makes
quantize_static raise:
  ValueError: Quantization parameters are not specified for param dets.

The test is meaningless in that environment anyway — if modelopt is
present the ort path is never taken.  Skip when modelopt is importable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* review: address PR #14 review comments

- README: ONNX_WORKFLOW=1 runs PyTorch inference after export (not stop)
  (bmhowe23 suggestion)
- LER: cast calib_dets to float32 before passing to mq.quantize();
  _collect_calibration_dets returns uint8 but modelopt expects float
  (sacpis: bug report on line 1077)
- LER: use Path.with_suffix('.engine') instead of str.replace
  (sacpis nit on line 1104)
- LER: add pathlib.Path import
- test: remove spurious @skipUnless from _build_tiny_model helper;
  it is not a test method and the decorator has no effect
  (sacpis nit on line 299)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: extract _parse_quant_format() helper from LER

Move the QUANT_FORMAT env-var read/validate/warn block into a
module-level helper so the test can call the real production logic
instead of re-implementing it.

- Add _parse_quant_format(rank=0) -> str in logical_error_rate.py
- Replace inline parsing block in run_inference_and_decode with a
  single _parse_quant_format(rank=dist.rank) call
- Import _parse_quant_format in test_onnx_quant_workflow.py and
  simplify _run_quant_block to delegate to it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: guard against num_obs < 1 in _collect_calibration_dets

Python's [:, :-0] is equivalent to [:, :0] and silently returns an
empty tensor rather than the full row.  Add an explicit check so the
caller gets a clear ValueError instead of a confusing width-mismatch
error downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: fix tensorrt comment — missing TRT now raises RuntimeError

The comment said "Absent at runtime causes graceful fallback to the
PyTorch path", but since the TRT ImportError fix (ae0f3b1) both
ONNX_WORKFLOW=2 and =3 raise RuntimeError instead of falling back.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Igor Baratta <ialmeidabara@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* Replace proprietary license headers with Apache-2.0

Update all SPDX headers from LicenseRef-NvidiaProprietary to Apache-2.0
across all 70 tracked source files. Also updates spdx_headers.py to
generate Apache-2.0 headers and replace old proprietary headers in-place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): apply Apache-2.0 headers to files added after branch cut

Files added by PRs #13, #14, and #17 still carried the proprietary
LicenseRef-NvidiaProprietary header. Replace with Apache-2.0 to match
the rest of the codebase after the header migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: apply YAPF formatting after header replacement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(headers): restore full file content truncated during rebase

The first rebase used --theirs to resolve header conflicts, which took
the old PR branch content instead of main's newer content for 5 files.
Restore from upstream/main and apply Apache-2.0 header correctly.

Affected files:
- code/qec/noise_model.py
- code/qec/surface_code/homological_equivalence_torch.py
- code/tests/mid/test_homological_equivalence.py
- code/tests/test_noise_model.py
- code/training/train.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): remove PreDecoderModelMemory_v2 test removed by PR #18

PR #18 removed the unused v2 model architecture. Drop the corresponding
test class and import to fix the ImportError in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants