feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests by ivanbasov · Pull Request #34 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-01T18:17:52Z

Summary

TRT integration in decoder_ablation: extends decoder_ablation_study to honour the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline — neural pre-decoder via TensorRT (FP16/INT8/FP8) feeding residual syndromes directly to cudaq-qec and other global decoders.
16 new tests covering env-var parsing, engine-missing fallback, ONNX export path, and a full mock-TRT execution path (CPU-safe via patched tensorrt module + _MockCUDADevice).
README + local_run.sh docs: new "Decoder ablation study with cudaq-qec" section with command examples for each ONNX_WORKFLOW mode and a decoder variant table.

Motivation

PR #17 added the decoder_ablation workflow with cudaq-qec global decoders, but the pre-decoder always ran in PyTorch even though the inference workflow already supported full TRT execution. This PR closes that gap so users can benchmark the complete GPU pipeline end-to-end.

PR #32 (open) only touches cuStabilizer DEM sampling — no overlap here.

What changed

`code/evaluation/failure_analysis.py`

Import OnnxWorkflow, PreDecoderMemoryEvalModule, _parse_quant_format from logical_error_rate.
In decoder_ablation_study: read ONNX_WORKFLOW env-var; for workflow ≠ 0, wrap model in PreDecoderMemoryEvalModule, export ONNX from the already-loaded stim_dets (no extra dataloader needed), optionally build/load TRT engine.
In the batch loop: when trt_context is set, skip loading trainX/x_syn_diff/z_syn_diff and feed baseline_detectors_batch directly to context.execute_v2; parse L_and_residual_dets output (col 0 = pre_L, cols 1: = residual) for the global decoder pass.

`code/tests/test_failure_analysis.py`

Class	Tests	What it covers
`TestOnnxWorkflowParsing`	3	Env-var parsing, valid/invalid values
`TestDecoderAblationStudyTRTFallback`	4	`ONNX_WORKFLOW=3` with missing engine → PyTorch fallback
`TestDecoderAblationStudyOnnxExport`	2	`ONNX_WORKFLOW=1` export attempt + export-failure fallback
`TestDecoderAblationStudyTRTExecution`	7	Full mock-TRT path (mock `tensorrt` module, `_MockCUDADevice`, `torch.Tensor.to` redirect)

`code/scripts/local_run.sh` / `README.md`

All four ONNX_WORKFLOW modes documented for both inference and decoder_ablation.
New README section "Decoder ablation study with cudaq-qec" with TRT + cudaq pipeline examples and a decoder variant table.

Test plan

Existing test_failure_analysis.py tests still pass
New tests pass (require ldpc, stim, beliefmatching — CI-only deps)
Manual smoke: ONNX_WORKFLOW=2 WORKFLOW=decoder_ablation bash code/scripts/local_run.sh on a GPU node with tensorrt and cudaq-qec installed

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

…umentation Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline where the neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec decoders handle the residual syndromes. - failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study; add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a direct TRT batch execution path that feeds raw stim_dets into the engine and reads L_and_residual_dets without calling _model_forward_and_residual - test_failure_analysis.py: 16 new tests across 4 classes covering env-var parsing, graceful fallback when the engine file is missing, ONNX export path (workflow=1), and full mock-TRT execution path (CPU-safe via patched tensorrt module and _MockCUDADevice) - local_run.sh: document TRT + decoder_ablation command examples - README.md: new "Decoder ablation study with cudaq-qec" section with TRT + cudaq-qec full GPU pipeline examples and decoder variant table Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…re_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…TRT tests torch.zeros/empty called with device=_MockCUDADevice raised TypeError; extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU for all tensor creation calls in addition to Tensor.to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arange/full/ones etc. also receive device=_MockCUDADevice from the call chain; replace per-function patches with an ExitStack loop over all common torch factory function names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… compat Instead of patching every torch factory function, make _MockCUDADevice a real torch.device subclass backed by cpu so all C-level tensor ops work natively. Override the type property to return "cuda" for branch coverage. Only torch.cuda.synchronize needs stubbing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…points torch.device cannot be subclassed (TypeError at import). Revert to a plain Python class for _MockCUDADevice and restore comprehensive patching via ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/ as_tensor/…), and torch.cuda.synchronize (no-op stub). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- test_failure_analysis.py: remove bare os.environ/os.path usages (original file never imported os; use Path and simplified assertions) - failure_analysis.py: move all_baseline_weights.extend() to after the T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate baseline weight counts — restores behaviour of the original code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sacpis

Overall LGTM. Thanks @ivanbasov. Left few comments.

- Extract TRT setup to _setup_trt_for_ablation() helper function - Move T < 2 guard outside if/else so both TRT and PyTorch paths skip short rounds consistently - Cache _trt_out_ncols before batch loop to avoid per-batch engine query - Use pinned-memory H2D transfer (torch.as_tensor + pin_memory + non_blocking=True) instead of from_numpy().to() - Single D2H transfer: L_and_residual_out.cpu().numpy() then slice, avoiding two separate round trips - Add note about execute_v2 deprecation in TRT >= 10 - Use setUpClass in TRTFallback and TRTExecution test classes to run the ablation study once per class instead of once per test method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-04-02T22:10:08Z

Thank you, @sacpis ! I processed all the comment. Could you please review once again?

sacpis

Left a few comments.

- Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half) — boundary detectors add one extra half-width round - Fix distributed sync: replace barrier() with broadcast_object_list() so non-zero ranks learn about rank-0 ONNX export failures and skip the TRT build instead of hitting FileNotFoundError silently - Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2 (export + engine build + TRT inference) with mocked onnx.export and tensorrt, using setUpClass for a single shared run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-04-03T00:37:25Z

Left a few comments.

Thanks again! Processed

sacpis

LGTM. Thanks @ivanbasov.

…tests (#34) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat(decoder_ablation): add TRT pre-decoder backend and cudaq-qec documentation Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline where the neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec decoders handle the residual syndromes. - failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study; add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a direct TRT batch execution path that feeds raw stim_dets into the engine and reads L_and_residual_dets without calling _model_forward_and_residual - test_failure_analysis.py: 16 new tests across 4 classes covering env-var parsing, graceful fallback when the engine file is missing, ONNX export path (workflow=1), and full mock-TRT execution path (CPU-safe via patched tensorrt module and _MockCUDADevice) - local_run.sh: document TRT + decoder_ablation command examples - README.md: new "Decoder ablation study with cudaq-qec" section with TRT + cudaq-qec full GPU pipeline examples and decoder variant table Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): add missing os import in tests; fix yapf formatting in failure_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch torch.zeros and torch.empty for _MockCUDADevice in TRT tests torch.zeros/empty called with device=_MockCUDADevice raised TypeError; extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU for all tensor creation calls in addition to Tensor.to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch all torch factory fns for _MockCUDADevice in TRT tests arange/full/ones etc. also receive device=_MockCUDADevice from the call chain; replace per-function patches with an ExitStack loop over all common torch factory function names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): make _MockCUDADevice inherit torch.device("cpu") for full compat Instead of patching every torch factory function, make _MockCUDADevice a real torch.device subclass backed by cpu so all C-level tensor ops work natively. Override the type property to return "cuda" for branch coverage. Only torch.cuda.synchronize needs stubbing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): revert to plain _MockCUDADevice; patch all torch C-entry-points torch.device cannot be subclassed (TypeError at import). Revert to a plain Python class for _MockCUDADevice and restore comprehensive patching via ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/ as_tensor/…), and torch.cuda.synchronize (no-op stub). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(review): remove os import from tests; fix weight accumulation order - test_failure_analysis.py: remove bare os.environ/os.path usages (original file never imported os; use Path and simplified assertions) - failure_analysis.py: move all_baseline_weights.extend() to after the T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate baseline weight counts — restores behaviour of the original code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: address PR review comments from sacpis - Extract TRT setup to _setup_trt_for_ablation() helper function - Move T < 2 guard outside if/else so both TRT and PyTorch paths skip short rounds consistently - Cache _trt_out_ncols before batch loop to avoid per-batch engine query - Use pinned-memory H2D transfer (torch.as_tensor + pin_memory + non_blocking=True) instead of from_numpy().to() - Single D2H transfer: L_and_residual_out.cpu().numpy() then slice, avoiding two separate round trips - Add note about execute_v2 deprecation in TRT >= 10 - Use setUpClass in TRTFallback and TRTExecution test classes to run the ablation study once per class instead of once per test method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: shape comments, distributed sync bug, add ONNX_WORKFLOW=2 test - Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half) — boundary detectors add one extra half-width round - Fix distributed sync: replace barrier() with broadcast_object_list() so non-zero ranks learn about rank-0 ONNX export failures and skip the TRT build instead of hitting FileNotFoundError silently - Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2 (export + engine build + TRT inference) with mocked onnx.export and tensorrt, using setUpClass for a single shared run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 8 commits March 30, 2026 11:54

Revert "fix(ci): disable torch.compile in orientation training to pre…

9d3fa08

…vent segfault" This reverts commit 7f0f6c8.

fix(ci): add missing os import in tests; fix yapf formatting in failu…

6e2568b

…re_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov requested review from bmhowe23 and sacpis April 2, 2026 00:03

ivanbasov marked this pull request as ready for review April 2, 2026 00:03

sacpis reviewed Apr 2, 2026

View reviewed changes

sacpis reviewed Apr 3, 2026

View reviewed changes

Comment thread code/evaluation/failure_analysis.py Outdated

Comment thread code/evaluation/failure_analysis.py Outdated

Comment thread code/evaluation/failure_analysis.py

Comment thread code/tests/test_failure_analysis.py

sacpis approved these changes Apr 3, 2026

View reviewed changes

ivanbasov merged commit 8a884c6 into NVIDIA:main Apr 3, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests#34

feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests#34
ivanbasov merged 11 commits into
NVIDIA:mainfrom
ivanbasov:worktree-trt_decoder

ivanbasov commented Apr 1, 2026

Uh oh!

sacpis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

sacpis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanbasov commented Apr 3, 2026

Uh oh!

sacpis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanbasov commented Apr 1, 2026

Summary

Motivation

What changed

code/evaluation/failure_analysis.py

code/tests/test_failure_analysis.py

code/scripts/local_run.sh / README.md

Test plan

Uh oh!

sacpis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

sacpis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanbasov commented Apr 3, 2026

Uh oh!

sacpis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`code/evaluation/failure_analysis.py`

`code/tests/test_failure_analysis.py`

`code/scripts/local_run.sh` / `README.md`