Skip to content

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#41

Merged
ivanbasov merged 9 commits into
NVIDIA:mainfrom
ivanbasov:fix/ler-compile-spawn-segfault
Apr 3, 2026
Merged

fix(ler): force num_workers=0 when torch.compile is active to prevent segfault#41
ivanbasov merged 9 commits into
NVIDIA:mainfrom
ivanbasov:fix/ler-compile-spawn-segfault

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • torch.compile(model) is applied at line 924 of code/evaluation/logical_error_rate.py, before the DataLoader is created at line 1057 with num_workers=16 and multiprocessing_context="spawn".
  • Spawning new processes after torch.compile creates a CUDA context conflict in the child processes → segfault + ~20 leaked semaphores.
  • The SDR evaluation path is not affected because its DataLoader is created prior to calling torch.compile.

Fix: immediately after the existing container/worker-override logic block (line 1048), check whether _applied_compile is True and num_workers > 0. If so, force num_workers=0 so data loading runs in the main process, avoiding the fork/spawn-after-compile hazard entirely. No workflow YAML changes are needed; the previous CI workaround (commit 7f0f6c8) was already reverted (9d3fa08) in favour of this proper code fix.

Failed CI run that motivated this fix: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304

Test plan

  • Run LER evaluation with torch.compile enabled and num_workers > 0 configured — confirm no segfault and no semaphore leaks.
  • Confirm num_workers is forced to 0 in logs when _applied_compile=True.
  • Run SDR evaluation to confirm it is unaffected.
  • Run existing CI suite to confirm no regressions.

🤖 Generated with Claude Code

ivanbasov and others added 2 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested a review from bmhowe23 April 3, 2026 18:45
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 3, 2026

What was the first PR on main that seems to have introduced this problem?

@ivanbasov
Copy link
Copy Markdown
Member Author

What was the first PR on main that seems to have introduced this problem?

It seems that the problem is with my fork's main not with the upstream/main. So, I somehow missed the fix I made in a branch and already merged into the upstream main. It just prevented me to run long-runnintests againt my branch not affected globally.

ivanbasov and others added 6 commits April 3, 2026 13:42
… segfault

Spawning DataLoader workers (multiprocessing_context="spawn") after
torch.compile has been applied causes a CUDA context conflict in the
spawned subprocesses, resulting in a segfault and ~20 leaked semaphores.

In the LER evaluation path the model is compiled (line 924) before the
DataLoader is created (line 1057), so the conflict is triggered.  The
SDR path is unaffected because its DataLoader is created prior to
torch.compile.

Fix: when _applied_compile is True and num_workers > 0, reset
num_workers to 0 so that data loading happens in the main process,
avoiding the fork/spawn-after-compile hazard entirely.

Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s_percent

sdr_as_percent has been accidentally passed to compute_syndrome_density() twice,
each time causing a TypeError only caught by long-running GPU tests.  Add a
short-tier signature inspection test so the next attempt fails fast on every
pre-merge CI run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… apply

PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind
PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training
CI worked. run_orientations_long.sh does not set it, so orientation-inference
CI silently ignored all overrides and trained with default settings (100 epochs,
millions of samples), hitting the 1h30m timeout.

Remove the gate so these env vars are always honoured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val
overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget.
1 epoch was too thin; 100 (the code default) would time out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…model is saved

With only 32k training samples, the model's syndrome density reduction
stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold
in train.py. This causes every epoch to be rejected by the SDR gate even
though validation loss improves, leaving best_model/ empty and causing
inference to fail with "No valid PreDecoderModelMemory files found".

Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None,
so sdr_not_computed=True, bypassing the gate and allowing the best_model
checkpoint to be saved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 epochs completed in ~16 min, leaving headroom for 30 epochs within
the 1-hour job budget.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov force-pushed the fix/ler-compile-spawn-segfault branch from 2912b81 to 0cdd8aa Compare April 3, 2026 20:43
@ivanbasov ivanbasov merged commit 1be4aee into NVIDIA:main Apr 3, 2026
22 checks passed
@ivanbasov ivanbasov deleted the fix/ler-compile-spawn-segfault branch April 3, 2026 21:44
ivanbasov added a commit that referenced this pull request Apr 10, 2026
… segfault (#41)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* fix(ler): force num_workers=0 when torch.compile is active to prevent segfault

Spawning DataLoader workers (multiprocessing_context="spawn") after
torch.compile has been applied causes a CUDA context conflict in the
spawned subprocesses, resulting in a segfault and ~20 leaked semaphores.

In the LER evaluation path the model is compiled (line 924) before the
DataLoader is created (line 1057), so the conflict is triggered.  The
SDR path is unaffected because its DataLoader is created prior to
torch.compile.

Fix: when _applied_compile is True and num_workers > 0, reset
num_workers to 0 so that data loading happens in the main process,
avoiding the fork/spawn-after-compile hazard entirely.

Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(metrics): guard compute_syndrome_density signature against sdr_as_percent

sdr_as_percent has been accidentally passed to compute_syndrome_density() twice,
each time causing a TypeError only caught by long-running GPU tests.  Add a
short-tier signature inspection test so the next attempt fails fast on every
pre-merge CI run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply

PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind
PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training
CI worked. run_orientations_long.sh does not set it, so orientation-inference
CI silently ignored all overrides and trained with default settings (100 epochs,
millions of samples), hitting the 1h30m timeout.

Remove the gate so these env vars are always honoured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10

Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val
overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget.
1 epoch was too thin; 100 (the code default) would time out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): disable SDR gate during orientation training to ensure best_model is saved

With only 32k training samples, the model's syndrome density reduction
stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold
in train.py. This causes every epoch to be rejected by the SDR gate even
though validation loss improves, leaving best_model/ empty and causing
inference to fail with "No valid PreDecoderModelMemory files found".

Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None,
so sdr_not_computed=True, bypassing the gate and allowing the best_model
checkpoint to be saved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30

10 epochs completed in ~16 min, leaving headroom for 30 epochs within
the 1-hour job budget.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
… segfault (#41)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* fix(ler): force num_workers=0 when torch.compile is active to prevent segfault

Spawning DataLoader workers (multiprocessing_context="spawn") after
torch.compile has been applied causes a CUDA context conflict in the
spawned subprocesses, resulting in a segfault and ~20 leaked semaphores.

In the LER evaluation path the model is compiled (line 924) before the
DataLoader is created (line 1057), so the conflict is triggered.  The
SDR path is unaffected because its DataLoader is created prior to
torch.compile.

Fix: when _applied_compile is True and num_workers > 0, reset
num_workers to 0 so that data loading happens in the main process,
avoiding the fork/spawn-after-compile hazard entirely.

Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(metrics): guard compute_syndrome_density signature against sdr_as_percent

sdr_as_percent has been accidentally passed to compute_syndrome_density() twice,
each time causing a TypeError only caught by long-running GPU tests.  Add a
short-tier signature inspection test so the next attempt fails fast on every
pre-merge CI run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply

PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind
PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training
CI worked. run_orientations_long.sh does not set it, so orientation-inference
CI silently ignored all overrides and trained with default settings (100 epochs,
millions of samples), hitting the 1h30m timeout.

Remove the gate so these env vars are always honoured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10

Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val
overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget.
1 epoch was too thin; 100 (the code default) would time out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): disable SDR gate during orientation training to ensure best_model is saved

With only 32k training samples, the model's syndrome density reduction
stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold
in train.py. This causes every epoch to be rejected by the SDR gate even
though validation loss improves, leaving best_model/ empty and causing
inference to fail with "No valid PreDecoderModelMemory files found".

Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None,
so sdr_not_computed=True, bypassing the gate and allowing the best_model
checkpoint to be saved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30

10 epochs completed in ~16 min, leaving headroom for 30 epochs within
the 1-hour job budget.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants