Skip to content

Commit 1be4aee

Browse files
ivanbasovclaude
andauthored
fix(ler): force num_workers=0 when torch.compile is active to prevent segfault (#41)
* fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * fix(ler): force num_workers=0 when torch.compile is active to prevent segfault Spawning DataLoader workers (multiprocessing_context="spawn") after torch.compile has been applied causes a CUDA context conflict in the spawned subprocesses, resulting in a segfault and ~20 leaked semaphores. In the LER evaluation path the model is compiled (line 924) before the DataLoader is created (line 1057), so the conflict is triggered. The SDR path is unaffected because its DataLoader is created prior to torch.compile. Fix: when _applied_compile is True and num_workers > 0, reset num_workers to 0 so that data loading happens in the main process, avoiding the fork/spawn-after-compile hazard entirely. Fixes: https://github.com/NVIDIA/quantum-predecoder/actions/runs/23924233265/job/69777476304 Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(metrics): guard compute_syndrome_density signature against sdr_as_percent sdr_as_percent has been accidentally passed to compute_syndrome_density() twice, each time causing a TypeError only caught by long-running GPU tests. Add a short-tier signature inspection test so the next attempt fails fast on every pre-merge CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(train): remove PREDECODER_TIMING_RUN gate so env overrides always apply PREDECODER_TRAIN_EPOCHS, PREDECODER_TRAIN_SAMPLES, etc. were gated behind PREDECODER_TIMING_RUN=1. smoke_run.sh sets that flag, so full-epoch-training CI worked. run_orientations_long.sh does not set it, so orientation-inference CI silently ignored all overrides and trained with default settings (100 epochs, millions of samples), hitting the 1h30m timeout. Remove the gate so these env vars are always honoured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 1 to 10 Each epoch with 32768 samples takes ~40s (3s training + ~37s fixed SDR/LER/val overhead). 4 orientations x 10 epochs x 40s ≈ 27 min, well within the 1h budget. 1 epoch was too thin; 100 (the code default) would time out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): disable SDR gate during orientation training to ensure best_model is saved With only 32k training samples, the model's syndrome density reduction stays at ~1.00x across all 10 epochs — below the hardcoded 1.5x threshold in train.py. This causes every epoch to be rejected by the SDR gate even though validation loss improves, leaving best_model/ empty and causing inference to fail with "No valid PreDecoderModelMemory files found". Setting PREDECODER_DISABLE_SDR=1 sets syndrome_density_reduction=None, so sdr_not_computed=True, bypassing the gate and allowing the best_model checkpoint to be saved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci(orientation): increase PREDECODER_TRAIN_EPOCHS from 10 to 30 10 epochs completed in ~16 min, leaving headroom for 30 epochs within the 1-hour job budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 8a884c6 commit 1be4aee

4 files changed

Lines changed: 61 additions & 34 deletions

File tree

.github/workflows/long-running-tests.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,8 @@ jobs:
189189
PREDECODER_TRAIN_SAMPLES: "32768"
190190
PREDECODER_VAL_SAMPLES: "4096"
191191
PREDECODER_TEST_SAMPLES: "4096"
192-
PREDECODER_TRAIN_EPOCHS: "1"
192+
PREDECODER_TRAIN_EPOCHS: "30"
193+
PREDECODER_DISABLE_SDR: "1"
193194

194195
- name: Multi-orientation inference (O1–O4) with LER output check
195196
shell: bash

code/evaluation/logical_error_rate.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1055,6 +1055,11 @@ def run_inference_and_decode_pre_decoder_memory(model, device, dist, cfg) -> dic
10551055
pass
10561056
except Exception:
10571057
pass
1058+
# torch.compile + spawn workers causes a segfault (CUDA context conflict in
1059+
# spawned subprocesses after the model is compiled). Fall back to in-process
1060+
# loading when torch.compile has been applied.
1061+
if _applied_compile and int(test_loader_kwargs.get("num_workers", 0)) > 0:
1062+
test_loader_kwargs["num_workers"] = 0
10581063
# Handle prefetch_factor when num_workers=0
10591064
if test_loader_kwargs.get('num_workers', 0) == 0:
10601065
test_loader_kwargs.pop('prefetch_factor', None)

code/tests/test_metrics_extras.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
# limitations under the License.
1515
"""Tests for evaluation.metrics (configure_metrics, _extract_reduction_factor)."""
1616

17+
import inspect
1718
import sys
1819
import unittest
1920
from pathlib import Path
@@ -22,7 +23,7 @@
2223
if str(_repo_code) not in sys.path:
2324
sys.path.insert(0, str(_repo_code))
2425

25-
from evaluation.metrics import configure_metrics, _extract_reduction_factor
26+
from evaluation.metrics import configure_metrics, _extract_reduction_factor, compute_syndrome_density
2627

2728

2829
class TestConfigureMetrics(unittest.TestCase):
@@ -63,3 +64,22 @@ def test_extract_from_empty_dict_returns_none(self):
6364
def test_extract_from_nested_stim(self):
6465
result = {"other": 1, "stim": {"reduction factor (X/Z)": 2.5}}
6566
self.assertEqual(_extract_reduction_factor(result), 2.5)
67+
68+
69+
class TestComputeSyndromeDensitySignature(unittest.TestCase):
70+
"""Regression guard: sdr_as_percent must not appear in compute_syndrome_density().
71+
72+
This kwarg is a display-only flag owned by train.py (controls "%" vs "x" in log
73+
output). It has been accidentally passed to compute_syndrome_density() twice,
74+
causing a TypeError that is only caught by long-running GPU tests. This test
75+
keeps the contract cheap to verify on every short CI run.
76+
"""
77+
78+
def test_sdr_as_percent_not_a_parameter(self):
79+
sig = inspect.signature(compute_syndrome_density)
80+
self.assertNotIn(
81+
"sdr_as_percent",
82+
sig.parameters,
83+
"sdr_as_percent is a display-only flag in train.py and must not be added "
84+
"to compute_syndrome_density(); passing it causes TypeError at runtime.",
85+
)

code/training/train.py

Lines changed: 33 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -687,38 +687,39 @@ def init_process_group_with_timeout(*args, **kwargs):
687687
if getattr(cfg.train, "epochs", None) is None:
688688
cfg.train.epochs = 100
689689

690-
# Optional timing-mode overrides (env-based) for short measurement runs.
691-
if os.environ.get("PREDECODER_TIMING_RUN", "0") == "1":
692-
train_samples_env = os.environ.get("PREDECODER_TRAIN_SAMPLES")
693-
val_samples_env = os.environ.get("PREDECODER_VAL_SAMPLES")
694-
test_samples_env = os.environ.get("PREDECODER_TEST_SAMPLES")
695-
epochs_env = os.environ.get("PREDECODER_TRAIN_EPOCHS")
696-
try:
697-
if train_samples_env:
698-
cfg.train.num_samples = int(train_samples_env)
699-
except Exception:
700-
pass
701-
try:
702-
if val_samples_env:
703-
cfg.val.num_samples = int(val_samples_env)
704-
except Exception:
705-
pass
706-
try:
707-
if test_samples_env:
708-
cfg.test.num_samples = int(test_samples_env)
709-
except Exception:
710-
pass
711-
try:
712-
if epochs_env:
713-
cfg.train.epochs = int(epochs_env)
714-
except Exception:
715-
pass
716-
milestones_env = os.environ.get("PREDECODER_LR_MILESTONES")
717-
try:
718-
if milestones_env:
719-
cfg.lr_scheduler.milestones = [float(x) for x in milestones_env.split(",")]
720-
except Exception:
721-
pass
690+
# Env-based overrides for samples, epochs, and LR milestones.
691+
# These apply unconditionally so that CI jobs and quick local runs can
692+
# override config values without needing PREDECODER_TIMING_RUN=1.
693+
train_samples_env = os.environ.get("PREDECODER_TRAIN_SAMPLES")
694+
val_samples_env = os.environ.get("PREDECODER_VAL_SAMPLES")
695+
test_samples_env = os.environ.get("PREDECODER_TEST_SAMPLES")
696+
epochs_env = os.environ.get("PREDECODER_TRAIN_EPOCHS")
697+
try:
698+
if train_samples_env:
699+
cfg.train.num_samples = int(train_samples_env)
700+
except Exception:
701+
pass
702+
try:
703+
if val_samples_env:
704+
cfg.val.num_samples = int(val_samples_env)
705+
except Exception:
706+
pass
707+
try:
708+
if test_samples_env:
709+
cfg.test.num_samples = int(test_samples_env)
710+
except Exception:
711+
pass
712+
try:
713+
if epochs_env:
714+
cfg.train.epochs = int(epochs_env)
715+
except Exception:
716+
pass
717+
milestones_env = os.environ.get("PREDECODER_LR_MILESTONES")
718+
try:
719+
if milestones_env:
720+
cfg.lr_scheduler.milestones = [float(x) for x in milestones_env.split(",")]
721+
except Exception:
722+
pass
722723

723724
if dist.rank == 0:
724725
print(f"Effective workflow.task: {cfg.workflow.task}")

0 commit comments

Comments
 (0)