Commit 1be4aee
fix(ler): force num_workers=0 when torch.compile is active to prevent segfault (#41)
* fix(ci): disable torch.compile in orientation training to prevent segfault
torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"
This reverts commit 7f0f6c8.
* fix(ler): force num_workers=0 when torch.compile is active to prevent segfault
Spawning DataLoader workers (multiprocessing_context="spawn") after
torch.compile has been applied causes a CUDA context conflict in the
spawned subprocesses, resulting in a segfault and ~20 leaked semaphores.
In the LER evaluation path the model is compiled (line 924) before the
DataLoader is created (line 1057), so the conflict is triggered. The
SDR path is unaffected because its DataLoader is created prior to
torch.compile.
Fix: when _applied_compile is True and num_workers > 0, reset
num_workers to 0 so that data loading happens in the main process,
avoiding the fork/spawn-after-compile hazard entirely.
Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304
Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(metrics): guard compute_syndrome_density signature against sdr_as_percent
sdr_as_percent has been accidentally passed to compute_syndrome_density() twice,
each time causing a TypeError only caught by long-running GPU tests. Add a
short-tier signature inspection test so the next attempt fails fast on every
pre-merge CI run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply
PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind
PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training
CI worked. run_orientations_long.sh does not set it, so orientation-inference
CI silently ignored all overrides and trained with default settings (100 epochs,
millions of samples), hitting the 1h30m timeout.
Remove the gate so these env vars are always honoured.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10
Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val
overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget.
1 epoch was too thin; 100 (the code default) would time out.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): disable SDR gate during orientation training to ensure best_model is saved
With only 32k training samples, the model's syndrome density reduction
stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold
in train.py. This causes every epoch to be rejected by the SDR gate even
though validation loss improves, leaving best_model/ empty and causing
inference to fail with "No valid PreDecoderModelMemory files found".
Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None,
so sdr_not_computed=True, bypassing the gate and allowing the best_model
checkpoint to be saved.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30
10 epochs completed in ~16 min, leaving headroom for 30 epochs within
the 1-hour job budget.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent 8a884c6 commit 1be4aee
4 files changed
Lines changed: 61 additions & 34 deletions
File tree
- .github/workflows
- code
- evaluation
- tests
- training
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
189 | 189 | | |
190 | 190 | | |
191 | 191 | | |
192 | | - | |
| 192 | + | |
| 193 | + | |
193 | 194 | | |
194 | 195 | | |
195 | 196 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1055 | 1055 | | |
1056 | 1056 | | |
1057 | 1057 | | |
| 1058 | + | |
| 1059 | + | |
| 1060 | + | |
| 1061 | + | |
| 1062 | + | |
1058 | 1063 | | |
1059 | 1064 | | |
1060 | 1065 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
25 | | - | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
687 | 687 | | |
688 | 688 | | |
689 | 689 | | |
690 | | - | |
691 | | - | |
692 | | - | |
693 | | - | |
694 | | - | |
695 | | - | |
696 | | - | |
697 | | - | |
698 | | - | |
699 | | - | |
700 | | - | |
701 | | - | |
702 | | - | |
703 | | - | |
704 | | - | |
705 | | - | |
706 | | - | |
707 | | - | |
708 | | - | |
709 | | - | |
710 | | - | |
711 | | - | |
712 | | - | |
713 | | - | |
714 | | - | |
715 | | - | |
716 | | - | |
717 | | - | |
718 | | - | |
719 | | - | |
720 | | - | |
721 | | - | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
722 | 723 | | |
723 | 724 | | |
724 | 725 | | |
| |||
0 commit comments