Skip to content

feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests#34

Merged
ivanbasov merged 11 commits into
NVIDIA:mainfrom
ivanbasov:worktree-trt_decoder
Apr 3, 2026
Merged

feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests#34
ivanbasov merged 11 commits into
NVIDIA:mainfrom
ivanbasov:worktree-trt_decoder

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • TRT integration in decoder_ablation: extends decoder_ablation_study to honour the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline — neural pre-decoder via TensorRT (FP16/INT8/FP8) feeding residual syndromes directly to cudaq-qec and other global decoders.
  • 16 new tests covering env-var parsing, engine-missing fallback, ONNX export path, and a full mock-TRT execution path (CPU-safe via patched tensorrt module + _MockCUDADevice).
  • README + local_run.sh docs: new "Decoder ablation study with cudaq-qec" section with command examples for each ONNX_WORKFLOW mode and a decoder variant table.

Motivation

PR #17 added the decoder_ablation workflow with cudaq-qec global decoders, but the pre-decoder always ran in PyTorch even though the inference workflow already supported full TRT execution. This PR closes that gap so users can benchmark the complete GPU pipeline end-to-end.

PR #32 (open) only touches cuStabilizer DEM sampling — no overlap here.

What changed

code/evaluation/failure_analysis.py

  • Import OnnxWorkflow, PreDecoderMemoryEvalModule, _parse_quant_format from logical_error_rate.
  • In decoder_ablation_study: read ONNX_WORKFLOW env-var; for workflow ≠ 0, wrap model in PreDecoderMemoryEvalModule, export ONNX from the already-loaded stim_dets (no extra dataloader needed), optionally build/load TRT engine.
  • In the batch loop: when trt_context is set, skip loading trainX/x_syn_diff/z_syn_diff and feed baseline_detectors_batch directly to context.execute_v2; parse L_and_residual_dets output (col 0 = pre_L, cols 1: = residual) for the global decoder pass.

code/tests/test_failure_analysis.py

Class Tests What it covers
TestOnnxWorkflowParsing 3 Env-var parsing, valid/invalid values
TestDecoderAblationStudyTRTFallback 4 ONNX_WORKFLOW=3 with missing engine → PyTorch fallback
TestDecoderAblationStudyOnnxExport 2 ONNX_WORKFLOW=1 export attempt + export-failure fallback
TestDecoderAblationStudyTRTExecution 7 Full mock-TRT path (mock tensorrt module, _MockCUDADevice, torch.Tensor.to redirect)

code/scripts/local_run.sh / README.md

  • All four ONNX_WORKFLOW modes documented for both inference and decoder_ablation.
  • New README section "Decoder ablation study with cudaq-qec" with TRT + cudaq pipeline examples and a decoder variant table.

Test plan

  • Existing test_failure_analysis.py tests still pass
  • New tests pass (require ldpc, stim, beliefmatching — CI-only deps)
  • Manual smoke: ONNX_WORKFLOW=2 WORKFLOW=decoder_ablation bash code/scripts/local_run.sh on a GPU node with tensorrt and cudaq-qec installed

🤖 Generated with Claude Code

ivanbasov and others added 8 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…umentation

Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var
used by the inference workflow, enabling a full GPU pipeline where the
neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec
decoders handle the residual syndromes.

- failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study;
  add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a
  direct TRT batch execution path that feeds raw stim_dets into the engine
  and reads L_and_residual_dets without calling _model_forward_and_residual
- test_failure_analysis.py: 16 new tests across 4 classes covering env-var
  parsing, graceful fallback when the engine file is missing, ONNX export
  path (workflow=1), and full mock-TRT execution path (CPU-safe via patched
  tensorrt module and _MockCUDADevice)
- local_run.sh: document TRT + decoder_ablation command examples
- README.md: new "Decoder ablation study with cudaq-qec" section with
  TRT + cudaq-qec full GPU pipeline examples and decoder variant table

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re_analysis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TRT tests

torch.zeros/empty called with device=_MockCUDADevice raised TypeError;
extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU
for all tensor creation calls in addition to Tensor.to.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arange/full/ones etc. also receive device=_MockCUDADevice from the call
chain; replace per-function patches with an ExitStack loop over all
common torch factory function names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… compat

Instead of patching every torch factory function, make _MockCUDADevice a
real torch.device subclass backed by cpu so all C-level tensor ops work
natively. Override the type property to return "cuda" for branch coverage.
Only torch.cuda.synchronize needs stubbing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…points

torch.device cannot be subclassed (TypeError at import). Revert to a plain
Python class for _MockCUDADevice and restore comprehensive patching via
ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/
as_tensor/…), and torch.cuda.synchronize (no-op stub).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested review from bmhowe23 and sacpis April 2, 2026 00:03
@ivanbasov ivanbasov marked this pull request as ready for review April 2, 2026 00:03
- test_failure_analysis.py: remove bare os.environ/os.path usages
  (original file never imported os; use Path and simplified assertions)
- failure_analysis.py: move all_baseline_weights.extend() to after the
  T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate
  baseline weight counts — restores behaviour of the original code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@sacpis sacpis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks @ivanbasov. Left few comments.

Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py
Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/tests/test_failure_analysis.py
- Extract TRT setup to _setup_trt_for_ablation() helper function
- Move T < 2 guard outside if/else so both TRT and PyTorch paths skip
  short rounds consistently
- Cache _trt_out_ncols before batch loop to avoid per-batch engine query
- Use pinned-memory H2D transfer (torch.as_tensor + pin_memory +
  non_blocking=True) instead of from_numpy().to()
- Single D2H transfer: L_and_residual_out.cpu().numpy() then slice,
  avoiding two separate round trips
- Add note about execute_v2 deprecation in TRT >= 10
- Use setUpClass in TRTFallback and TRTExecution test classes to run the
  ablation study once per class instead of once per test method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

Thank you, @sacpis ! I processed all the comment. Could you please review once again?

Copy link
Copy Markdown
Collaborator

@sacpis sacpis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.

Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py Outdated
Comment thread code/evaluation/failure_analysis.py
Comment thread code/tests/test_failure_analysis.py
- Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half)
  — boundary detectors add one extra half-width round
- Fix distributed sync: replace barrier() with broadcast_object_list()
  so non-zero ranks learn about rank-0 ONNX export failures and skip
  the TRT build instead of hitting FileNotFoundError silently
- Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2
  (export + engine build + TRT inference) with mocked onnx.export and
  tensorrt, using setUpClass for a single shared run

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

Left a few comments.

Thanks again! Processed

Copy link
Copy Markdown
Collaborator

@sacpis sacpis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @ivanbasov.

@ivanbasov ivanbasov merged commit 8a884c6 into NVIDIA:main Apr 3, 2026
12 checks passed
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…tests (#34)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat(decoder_ablation): add TRT pre-decoder backend and cudaq-qec documentation

Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var
used by the inference workflow, enabling a full GPU pipeline where the
neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec
decoders handle the residual syndromes.

- failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study;
  add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a
  direct TRT batch execution path that feeds raw stim_dets into the engine
  and reads L_and_residual_dets without calling _model_forward_and_residual
- test_failure_analysis.py: 16 new tests across 4 classes covering env-var
  parsing, graceful fallback when the engine file is missing, ONNX export
  path (workflow=1), and full mock-TRT execution path (CPU-safe via patched
  tensorrt module and _MockCUDADevice)
- local_run.sh: document TRT + decoder_ablation command examples
- README.md: new "Decoder ablation study with cudaq-qec" section with
  TRT + cudaq-qec full GPU pipeline examples and decoder variant table

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): add missing os import in tests; fix yapf formatting in failure_analysis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): patch torch.zeros and torch.empty for _MockCUDADevice in TRT tests

torch.zeros/empty called with device=_MockCUDADevice raised TypeError;
extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU
for all tensor creation calls in addition to Tensor.to.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): patch all torch factory fns for _MockCUDADevice in TRT tests

arange/full/ones etc. also receive device=_MockCUDADevice from the call
chain; replace per-function patches with an ExitStack loop over all
common torch factory function names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): make _MockCUDADevice inherit torch.device("cpu") for full compat

Instead of patching every torch factory function, make _MockCUDADevice a
real torch.device subclass backed by cpu so all C-level tensor ops work
natively. Override the type property to return "cuda" for branch coverage.
Only torch.cuda.synchronize needs stubbing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): revert to plain _MockCUDADevice; patch all torch C-entry-points

torch.device cannot be subclassed (TypeError at import). Revert to a plain
Python class for _MockCUDADevice and restore comprehensive patching via
ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/
as_tensor/…), and torch.cuda.synchronize (no-op stub).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(review): remove os import from tests; fix weight accumulation order

- test_failure_analysis.py: remove bare os.environ/os.path usages
  (original file never imported os; use Path and simplified assertions)
- failure_analysis.py: move all_baseline_weights.extend() to after the
  T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate
  baseline weight counts — restores behaviour of the original code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: address PR review comments from sacpis

- Extract TRT setup to _setup_trt_for_ablation() helper function
- Move T < 2 guard outside if/else so both TRT and PyTorch paths skip
  short rounds consistently
- Cache _trt_out_ncols before batch loop to avoid per-batch engine query
- Use pinned-memory H2D transfer (torch.as_tensor + pin_memory +
  non_blocking=True) instead of from_numpy().to()
- Single D2H transfer: L_and_residual_out.cpu().numpy() then slice,
  avoiding two separate round trips
- Add note about execute_v2 deprecation in TRT >= 10
- Use setUpClass in TRTFallback and TRTExecution test classes to run the
  ablation study once per class instead of once per test method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: shape comments, distributed sync bug, add ONNX_WORKFLOW=2 test

- Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half)
  — boundary detectors add one extra half-width round
- Fix distributed sync: replace barrier() with broadcast_object_list()
  so non-zero ranks learn about rank-0 ONNX export failures and skip
  the TRT build instead of hitting FileNotFoundError silently
- Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2
  (export + engine build + TRT inference) with mocked onnx.export and
  tensorrt, using setUpClass for a single shared run

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…tests (#34)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat(decoder_ablation): add TRT pre-decoder backend and cudaq-qec documentation

Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var
used by the inference workflow, enabling a full GPU pipeline where the
neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec
decoders handle the residual syndromes.

- failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study;
  add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a
  direct TRT batch execution path that feeds raw stim_dets into the engine
  and reads L_and_residual_dets without calling _model_forward_and_residual
- test_failure_analysis.py: 16 new tests across 4 classes covering env-var
  parsing, graceful fallback when the engine file is missing, ONNX export
  path (workflow=1), and full mock-TRT execution path (CPU-safe via patched
  tensorrt module and _MockCUDADevice)
- local_run.sh: document TRT + decoder_ablation command examples
- README.md: new "Decoder ablation study with cudaq-qec" section with
  TRT + cudaq-qec full GPU pipeline examples and decoder variant table

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): add missing os import in tests; fix yapf formatting in failure_analysis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): patch torch.zeros and torch.empty for _MockCUDADevice in TRT tests

torch.zeros/empty called with device=_MockCUDADevice raised TypeError;
extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU
for all tensor creation calls in addition to Tensor.to.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): patch all torch factory fns for _MockCUDADevice in TRT tests

arange/full/ones etc. also receive device=_MockCUDADevice from the call
chain; replace per-function patches with an ExitStack loop over all
common torch factory function names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): make _MockCUDADevice inherit torch.device("cpu") for full compat

Instead of patching every torch factory function, make _MockCUDADevice a
real torch.device subclass backed by cpu so all C-level tensor ops work
natively. Override the type property to return "cuda" for branch coverage.
Only torch.cuda.synchronize needs stubbing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tests): revert to plain _MockCUDADevice; patch all torch C-entry-points

torch.device cannot be subclassed (TypeError at import). Revert to a plain
Python class for _MockCUDADevice and restore comprehensive patching via
ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/
as_tensor/…), and torch.cuda.synchronize (no-op stub).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(review): remove os import from tests; fix weight accumulation order

- test_failure_analysis.py: remove bare os.environ/os.path usages
  (original file never imported os; use Path and simplified assertions)
- failure_analysis.py: move all_baseline_weights.extend() to after the
  T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate
  baseline weight counts — restores behaviour of the original code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: address PR review comments from sacpis

- Extract TRT setup to _setup_trt_for_ablation() helper function
- Move T < 2 guard outside if/else so both TRT and PyTorch paths skip
  short rounds consistently
- Cache _trt_out_ncols before batch loop to avoid per-batch engine query
- Use pinned-memory H2D transfer (torch.as_tensor + pin_memory +
  non_blocking=True) instead of from_numpy().to()
- Single D2H transfer: L_and_residual_out.cpu().numpy() then slice,
  avoiding two separate round trips
- Add note about execute_v2 deprecation in TRT >= 10
- Use setUpClass in TRTFallback and TRTExecution test classes to run the
  ablation study once per class instead of once per test method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: shape comments, distributed sync bug, add ONNX_WORKFLOW=2 test

- Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half)
  — boundary detectors add one extra half-width round
- Fix distributed sync: replace barrier() with broadcast_object_list()
  so non-zero ranks learn about rank-0 ONNX export failures and skip
  the TRT build instead of hitting FileNotFoundError silently
- Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2
  (export + engine build + TRT inference) with mocked onnx.export and
  tensorrt, using setUpClass for a single shared run

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants