fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#41
Merged
Merged
Conversation
…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent segfault" This reverts commit 7f0f6c8.
Collaborator
|
What was the first PR on main that seems to have introduced this problem? |
Member
Author
It seems that the problem is with my fork's main not with the upstream/main. So, I somehow missed the fix I made in a branch and already merged into the upstream main. It just prevented me to run long-runnintests againt my branch not affected globally. |
… segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2912b81 to
0cdd8aa
Compare
bmhowe23
approved these changes
Apr 3, 2026
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
… segfault (#41) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(metrics): guard compute_syndrome_density signature against sdr_as_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10 Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): disable SDR gate during orientation training to ensure best_model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30 10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
… segfault (#41) * fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(metrics): guard compute_syndrome_density signature against sdr_as_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10 Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): disable SDR gate during orientation training to ensure best_model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30 10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch.compile(model)is applied at line 924 ofcode/evaluation/logical_error_rate.py, before the DataLoader is created at line 1057 withnum_workers=16andmultiprocessing_context="spawn".torch.compilecreates a CUDA context conflict in the child processes → segfault + ~20 leaked semaphores.torch.compile.Fix: immediately after the existing container/worker-override logic block (line 1048), check whether
_applied_compileisTrueandnum_workers > 0. If so, forcenum_workers=0so data loading runs in the main process, avoiding the fork/spawn-after-compile hazard entirely. No workflow YAML changes are needed; the previous CI workaround (commit 7f0f6c8) was already reverted (9d3fa08) in favour of this proper code fix.Failed CI run that motivated this fix: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304
Test plan
torch.compileenabled andnum_workers > 0configured — confirm no segfault and no semaphore leaks.num_workersis forced to0in logs when_applied_compile=True.🤖 Generated with Claude Code