fix: guard against double torch.compile by kvmto · Pull Request #39 · NVIDIA/Ising-Decoding

kvmto · 2026-04-02T11:04:02Z

Summary

Fix segfault from double torch.compile when SDR runs before LER on the same model
Guard both compile sites with _is_compiled() check
Clean up SDR's DataLoader workers to stop semaphore leaks

Test plan

Ran 1-epoch train with PREDECODER_DISABLE_SDR=0 PREDECODER_LER_FINAL_ONLY=0 on 2x A6000 — SDR 44.96x, LER 0.003010, no segfault, no leaked semaphores

When SDR runs before LER, the same model object gets torch.compile'd twice, producing a nested OptimizedModule that segfaults during the first forward pass. Skip compilation when the model is already compiled. Also eagerly tear down SDR's DataLoader workers before LER starts to prevent leaked /dev/shm semaphores. Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23 · 2026-04-02T13:55:56Z

Does this fix the nightly CI failure that @ivanbasov was seeing?

bmhowe23 · 2026-04-02T14:02:54Z

Does this fix the nightly CI failure that @ivanbasov was seeing?

Never mind. I see an internal email now.

bmhowe23

LGTM

When SDR runs before LER, the same model object gets torch.compile'd twice, producing a nested OptimizedModule that segfaults during the first forward pass. Skip compilation when the model is already compiled. Also eagerly tear down SDR's DataLoader workers before LER starts to prevent leaked /dev/shm semaphores. Signed-off-by: kvmto <kmato@nvidia.com>

kvmto requested review from bmhowe23 and ivanbasov and removed request for ivanbasov April 2, 2026 11:11

bmhowe23 approved these changes Apr 2, 2026

View reviewed changes

ivanbasov approved these changes Apr 2, 2026

View reviewed changes

kvmto merged commit f6c7864 into NVIDIA:main Apr 2, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard against double torch.compile #39

fix: guard against double torch.compile #39
kvmto merged 1 commit into
NVIDIA:mainfrom
kvmto:fix/double-compile-segfault

kvmto commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kvmto commented Apr 2, 2026

Summary

Test plan

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants