feat(decoder_ablation): TRT pre-decoder backend + cudaq-qec docs and tests#34
Merged
Conversation
…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent segfault" This reverts commit 7f0f6c8.
…umentation Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline where the neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec decoders handle the residual syndromes. - failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study; add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a direct TRT batch execution path that feeds raw stim_dets into the engine and reads L_and_residual_dets without calling _model_forward_and_residual - test_failure_analysis.py: 16 new tests across 4 classes covering env-var parsing, graceful fallback when the engine file is missing, ONNX export path (workflow=1), and full mock-TRT execution path (CPU-safe via patched tensorrt module and _MockCUDADevice) - local_run.sh: document TRT + decoder_ablation command examples - README.md: new "Decoder ablation study with cudaq-qec" section with TRT + cudaq-qec full GPU pipeline examples and decoder variant table Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TRT tests torch.zeros/empty called with device=_MockCUDADevice raised TypeError; extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU for all tensor creation calls in addition to Tensor.to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arange/full/ones etc. also receive device=_MockCUDADevice from the call chain; replace per-function patches with an ExitStack loop over all common torch factory function names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… compat Instead of patching every torch factory function, make _MockCUDADevice a real torch.device subclass backed by cpu so all C-level tensor ops work natively. Override the type property to return "cuda" for branch coverage. Only torch.cuda.synchronize needs stubbing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…points torch.device cannot be subclassed (TypeError at import). Revert to a plain Python class for _MockCUDADevice and restore comprehensive patching via ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/ as_tensor/…), and torch.cuda.synchronize (no-op stub). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_failure_analysis.py: remove bare os.environ/os.path usages (original file never imported os; use Path and simplified assertions) - failure_analysis.py: move all_baseline_weights.extend() to after the T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate baseline weight counts — restores behaviour of the original code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sacpis
reviewed
Apr 2, 2026
Collaborator
sacpis
left a comment
There was a problem hiding this comment.
Overall LGTM. Thanks @ivanbasov. Left few comments.
- Extract TRT setup to _setup_trt_for_ablation() helper function - Move T < 2 guard outside if/else so both TRT and PyTorch paths skip short rounds consistently - Cache _trt_out_ncols before batch loop to avoid per-batch engine query - Use pinned-memory H2D transfer (torch.as_tensor + pin_memory + non_blocking=True) instead of from_numpy().to() - Single D2H transfer: L_and_residual_out.cpu().numpy() then slice, avoiding two separate round trips - Add note about execute_v2 deprecation in TRT >= 10 - Use setUpClass in TRTFallback and TRTExecution test classes to run the ablation study once per class instead of once per test method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
Author
|
Thank you, @sacpis ! I processed all the comment. Could you please review once again? |
sacpis
reviewed
Apr 3, 2026
- Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half) — boundary detectors add one extra half-width round - Fix distributed sync: replace barrier() with broadcast_object_list() so non-zero ranks learn about rank-0 ONNX export failures and skip the TRT build instead of hitting FileNotFoundError silently - Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2 (export + engine build + TRT inference) with mocked onnx.export and tensorrt, using setUpClass for a single shared run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
Author
Thanks again! Processed |
sacpis
approved these changes
Apr 3, 2026
Collaborator
sacpis
left a comment
There was a problem hiding this comment.
LGTM. Thanks @ivanbasov.
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
…tests (#34) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat(decoder_ablation): add TRT pre-decoder backend and cudaq-qec documentation Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline where the neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec decoders handle the residual syndromes. - failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study; add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a direct TRT batch execution path that feeds raw stim_dets into the engine and reads L_and_residual_dets without calling _model_forward_and_residual - test_failure_analysis.py: 16 new tests across 4 classes covering env-var parsing, graceful fallback when the engine file is missing, ONNX export path (workflow=1), and full mock-TRT execution path (CPU-safe via patched tensorrt module and _MockCUDADevice) - local_run.sh: document TRT + decoder_ablation command examples - README.md: new "Decoder ablation study with cudaq-qec" section with TRT + cudaq-qec full GPU pipeline examples and decoder variant table Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): add missing os import in tests; fix yapf formatting in failure_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch torch.zeros and torch.empty for _MockCUDADevice in TRT tests torch.zeros/empty called with device=_MockCUDADevice raised TypeError; extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU for all tensor creation calls in addition to Tensor.to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch all torch factory fns for _MockCUDADevice in TRT tests arange/full/ones etc. also receive device=_MockCUDADevice from the call chain; replace per-function patches with an ExitStack loop over all common torch factory function names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): make _MockCUDADevice inherit torch.device("cpu") for full compat Instead of patching every torch factory function, make _MockCUDADevice a real torch.device subclass backed by cpu so all C-level tensor ops work natively. Override the type property to return "cuda" for branch coverage. Only torch.cuda.synchronize needs stubbing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): revert to plain _MockCUDADevice; patch all torch C-entry-points torch.device cannot be subclassed (TypeError at import). Revert to a plain Python class for _MockCUDADevice and restore comprehensive patching via ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/ as_tensor/…), and torch.cuda.synchronize (no-op stub). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(review): remove os import from tests; fix weight accumulation order - test_failure_analysis.py: remove bare os.environ/os.path usages (original file never imported os; use Path and simplified assertions) - failure_analysis.py: move all_baseline_weights.extend() to after the T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate baseline weight counts — restores behaviour of the original code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: address PR review comments from sacpis - Extract TRT setup to _setup_trt_for_ablation() helper function - Move T < 2 guard outside if/else so both TRT and PyTorch paths skip short rounds consistently - Cache _trt_out_ncols before batch loop to avoid per-batch engine query - Use pinned-memory H2D transfer (torch.as_tensor + pin_memory + non_blocking=True) instead of from_numpy().to() - Single D2H transfer: L_and_residual_out.cpu().numpy() then slice, avoiding two separate round trips - Add note about execute_v2 deprecation in TRT >= 10 - Use setUpClass in TRTFallback and TRTExecution test classes to run the ablation study once per class instead of once per test method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: shape comments, distributed sync bug, add ONNX_WORKFLOW=2 test - Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half) — boundary detectors add one extra half-width round - Fix distributed sync: replace barrier() with broadcast_object_list() so non-zero ranks learn about rank-0 ONNX export failures and skip the TRT build instead of hitting FileNotFoundError silently - Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2 (export + engine build + TRT inference) with mocked onnx.export and tensorrt, using setUpClass for a single shared run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
…tests (#34) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat(decoder_ablation): add TRT pre-decoder backend and cudaq-qec documentation Extends decoder_ablation_study to support the same ONNX_WORKFLOW env-var used by the inference workflow, enabling a full GPU pipeline where the neural pre-decoder runs via TensorRT (FP16/INT8/FP8) while cudaq-qec decoders handle the residual syndromes. - failure_analysis.py: honour ONNX_WORKFLOW=1/2/3 in decoder_ablation_study; add PreDecoderMemoryEvalModule wrapping, TRT engine export/load, and a direct TRT batch execution path that feeds raw stim_dets into the engine and reads L_and_residual_dets without calling _model_forward_and_residual - test_failure_analysis.py: 16 new tests across 4 classes covering env-var parsing, graceful fallback when the engine file is missing, ONNX export path (workflow=1), and full mock-TRT execution path (CPU-safe via patched tensorrt module and _MockCUDADevice) - local_run.sh: document TRT + decoder_ablation command examples - README.md: new "Decoder ablation study with cudaq-qec" section with TRT + cudaq-qec full GPU pipeline examples and decoder variant table Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): add missing os import in tests; fix yapf formatting in failure_analysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch torch.zeros and torch.empty for _MockCUDADevice in TRT tests torch.zeros/empty called with device=_MockCUDADevice raised TypeError; extend _patch_tensor_to_for_mock_cuda to redirect mock device to CPU for all tensor creation calls in addition to Tensor.to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): patch all torch factory fns for _MockCUDADevice in TRT tests arange/full/ones etc. also receive device=_MockCUDADevice from the call chain; replace per-function patches with an ExitStack loop over all common torch factory function names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): make _MockCUDADevice inherit torch.device("cpu") for full compat Instead of patching every torch factory function, make _MockCUDADevice a real torch.device subclass backed by cpu so all C-level tensor ops work natively. Override the type property to return "cuda" for branch coverage. Only torch.cuda.synchronize needs stubbing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tests): revert to plain _MockCUDADevice; patch all torch C-entry-points torch.device cannot be subclassed (TypeError at import). Revert to a plain Python class for _MockCUDADevice and restore comprehensive patching via ExitStack: Tensor.to, nn.Module.to, all factory fns (zeros/arange/full/ as_tensor/…), and torch.cuda.synchronize (no-op stub). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(review): remove os import from tests; fix weight accumulation order - test_failure_analysis.py: remove bare os.environ/os.path usages (original file never imported os; use Path and simplified assertions) - failure_analysis.py: move all_baseline_weights.extend() to after the T < 2 guard so skipped batches (PyTorch path, T < 2) do not inflate baseline weight counts — restores behaviour of the original code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: address PR review comments from sacpis - Extract TRT setup to _setup_trt_for_ablation() helper function - Move T < 2 guard outside if/else so both TRT and PyTorch paths skip short rounds consistently - Cache _trt_out_ncols before batch loop to avoid per-batch engine query - Use pinned-memory H2D transfer (torch.as_tensor + pin_memory + non_blocking=True) instead of from_numpy().to() - Single D2H transfer: L_and_residual_out.cpu().numpy() then slice, avoiding two separate round trips - Add note about execute_v2 deprecation in TRT >= 10 - Use setUpClass in TRTFallback and TRTExecution test classes to run the ablation study once per class instead of once per test method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: shape comments, distributed sync bug, add ONNX_WORKFLOW=2 test - Fix shape comments: stim_dets is (N, (2*T+1)*half) not (2*T*half) — boundary detectors add one extra half-width round - Fix distributed sync: replace barrier() with broadcast_object_list() so non-zero ranks learn about rank-0 ONNX export failures and skip the TRT build instead of hitting FileNotFoundError silently - Add TestDecoderAblationStudyExportAndBuildTRT covering ONNX_WORKFLOW=2 (export + engine build + TRT inference) with mocked onnx.export and tensorrt, using setUpClass for a single shared run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
decoder_ablation: extendsdecoder_ablation_studyto honour the sameONNX_WORKFLOWenv-var used by theinferenceworkflow, enabling a full GPU pipeline — neural pre-decoder via TensorRT (FP16/INT8/FP8) feeding residual syndromes directly tocudaq-qecand other global decoders.tensorrtmodule +_MockCUDADevice).local_run.shdocs: new "Decoder ablation study with cudaq-qec" section with command examples for eachONNX_WORKFLOWmode and a decoder variant table.Motivation
PR #17 added the
decoder_ablationworkflow withcudaq-qecglobal decoders, but the pre-decoder always ran in PyTorch even though theinferenceworkflow already supported full TRT execution. This PR closes that gap so users can benchmark the complete GPU pipeline end-to-end.PR #32 (open) only touches cuStabilizer DEM sampling — no overlap here.
What changed
code/evaluation/failure_analysis.pyOnnxWorkflow,PreDecoderMemoryEvalModule,_parse_quant_formatfromlogical_error_rate.decoder_ablation_study: readONNX_WORKFLOWenv-var; for workflow ≠ 0, wrap model inPreDecoderMemoryEvalModule, export ONNX from the already-loadedstim_dets(no extra dataloader needed), optionally build/load TRT engine.trt_contextis set, skip loadingtrainX/x_syn_diff/z_syn_diffand feedbaseline_detectors_batchdirectly tocontext.execute_v2; parseL_and_residual_detsoutput (col 0 =pre_L, cols 1: = residual) for the global decoder pass.code/tests/test_failure_analysis.pyTestOnnxWorkflowParsingTestDecoderAblationStudyTRTFallbackONNX_WORKFLOW=3with missing engine → PyTorch fallbackTestDecoderAblationStudyOnnxExportONNX_WORKFLOW=1export attempt + export-failure fallbackTestDecoderAblationStudyTRTExecutiontensorrtmodule,_MockCUDADevice,torch.Tensor.toredirect)code/scripts/local_run.sh/README.mdONNX_WORKFLOWmodes documented for bothinferenceanddecoder_ablation.Test plan
test_failure_analysis.pytests still passldpc,stim,beliefmatching— CI-only deps)ONNX_WORKFLOW=2 WORKFLOW=decoder_ablation bash code/scripts/local_run.shon a GPU node withtensorrtandcudaq-qecinstalled🤖 Generated with Claude Code