fix(ler): force num_workers=0 when torch.compile is active to prevent segfault by ivanbasov · Pull Request #41 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-03T00:35:46Z

Summary

torch.compile(model) is applied at line 924 of code/evaluation/logical_error_rate.py, before the DataLoader is created at line 1057 with num_workers=16 and multiprocessing_context="spawn".
Spawning new processes after torch.compile creates a CUDA context conflict in the child processes → segfault + ~20 leaked semaphores.
The SDR evaluation path is not affected because its DataLoader is created prior to calling torch.compile.

Fix: immediately after the existing container/worker-override logic block (line 1048), check whether _applied_compile is True and num_workers > 0. If so, force num_workers=0 so data loading runs in the main process, avoiding the fork/spawn-after-compile hazard entirely. No workflow YAML changes are needed; the previous CI workaround (commit 7f0f6c8) was already reverted (9d3fa08) in favour of this proper code fix.

Failed CI run that motivated this fix: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304

Test plan

Run LER evaluation with torch.compile enabled and num_workers > 0 configured — confirm no segfault and no semaphore leaks.
Confirm num_workers is forced to 0 in logs when _applied_compile=True.
Run SDR evaluation to confirm it is unaffected.
Run existing CI suite to confirm no regressions.

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

bmhowe23 · 2026-04-03T20:10:59Z

What was the first PR on main that seems to have introduced this problem?

ivanbasov · 2026-04-03T20:41:19Z

What was the first PR on main that seems to have introduced this problem?

It seems that the problem is with my fork's main not with the upstream/main. So, I somehow missed the fix I made in a branch and already merged into the upstream main. It just prevented me to run long-runnintests againt my branch not affected globally.

… segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… segfault (#41) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(metrics): guard compute_syndrome_density signature against sdr_as_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10 Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): disable SDR gate during orientation training to ensure best_model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30 10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 2 commits March 30, 2026 11:54

Revert "fix(ci): disable torch.compile in orientation training to pre…

9d3fa08

…vent segfault" This reverts commit 7f0f6c8.

ivanbasov requested a review from bmhowe23 April 3, 2026 18:45

Merge remote-tracking branch 'upstream/main'

838d14f

ivanbasov and others added 6 commits April 3, 2026 13:42

ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30

0cdd8aa

10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov force-pushed the fix/ler-compile-spawn-segfault branch from 2912b81 to 0cdd8aa Compare April 3, 2026 20:43

bmhowe23 approved these changes Apr 3, 2026

View reviewed changes

ivanbasov merged commit 1be4aee into NVIDIA:main Apr 3, 2026
22 checks passed

ivanbasov deleted the fix/ler-compile-spawn-segfault branch April 3, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#41

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#41
ivanbasov merged 9 commits into
NVIDIA:mainfrom
ivanbasov:fix/ler-compile-spawn-segfault

ivanbasov commented Apr 3, 2026

Uh oh!

bmhowe23 commented Apr 3, 2026

Uh oh!

ivanbasov commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanbasov commented Apr 3, 2026

Summary

Test plan

Uh oh!

bmhowe23 commented Apr 3, 2026

Uh oh!

ivanbasov commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants