Skip to content

fix: guard against double torch.compile #39

Merged
kvmto merged 1 commit into
NVIDIA:mainfrom
kvmto:fix/double-compile-segfault
Apr 2, 2026
Merged

fix: guard against double torch.compile #39
kvmto merged 1 commit into
NVIDIA:mainfrom
kvmto:fix/double-compile-segfault

Conversation

@kvmto
Copy link
Copy Markdown
Collaborator

@kvmto kvmto commented Apr 2, 2026

Summary

  • Fix segfault from double torch.compile when SDR runs before LER on the same model
  • Guard both compile sites with _is_compiled() check
  • Clean up SDR's DataLoader workers to stop semaphore leaks

Test plan

  • Ran 1-epoch train with PREDECODER_DISABLE_SDR=0 PREDECODER_LER_FINAL_ONLY=0 on 2x A6000 — SDR 44.96x, LER 0.003010, no segfault, no leaked semaphores

When SDR runs before LER, the same model object gets torch.compile'd
twice, producing a nested OptimizedModule that segfaults during the
first forward pass. Skip compilation when the model is already compiled.

Also eagerly tear down SDR's DataLoader workers before LER starts to
prevent leaked /dev/shm semaphores.

Signed-off-by: kvmto <kmato@nvidia.com>
@kvmto kvmto requested review from bmhowe23 and ivanbasov and removed request for ivanbasov April 2, 2026 11:11
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

Does this fix the nightly CI failure that @ivanbasov was seeing?

@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

Does this fix the nightly CI failure that @ivanbasov was seeing?

Never mind. I see an internal email now.

Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kvmto kvmto merged commit f6c7864 into NVIDIA:main Apr 2, 2026
13 checks passed
ivanbasov pushed a commit that referenced this pull request Apr 10, 2026
When SDR runs before LER, the same model object gets torch.compile'd
twice, producing a nested OptimizedModule that segfaults during the
first forward pass. Skip compilation when the model is already compiled.

Also eagerly tear down SDR's DataLoader workers before LER starts to
prevent leaked /dev/shm semaphores.

Signed-off-by: kvmto <kmato@nvidia.com>
ivanbasov pushed a commit that referenced this pull request Apr 10, 2026
When SDR runs before LER, the same model object gets torch.compile'd
twice, producing a nested OptimizedModule that segfaults during the
first forward pass. Skip compilation when the model is already compiled.

Also eagerly tear down SDR's DataLoader workers before LER starts to
prevent leaked /dev/shm semaphores.

Signed-off-by: kvmto <kmato@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants