Skip to content

Fix three silent hang causes in unet3d training benchmark#394

Merged
russfellows merged 5 commits into
mainfrom
FileSystemGuy-une3d-hang
Jun 1, 2026
Merged

Fix three silent hang causes in unet3d training benchmark#394
russfellows merged 5 commits into
mainfrom
FileSystemGuy-une3d-hang

Conversation

@FileSystemGuy

@FileSystemGuy FileSystemGuy commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • utils.py: Replace readline() with buffer.read1() in CommandExecutor to prevent an indefinite block when DLIO or MPI writes \r-terminated progress output or any partial line without a trailing newline.
  • cluster_collector.py: Wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank unconditionally reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error.
  • utils.py (2e74360): Replace the post-loop blocking TextIOWrapper.read() drain calls with a select()+read1() loop (0.5s timeout). PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init; when mpirun exits, those orphaned grandchildren still hold the pipe write-end open, causing the original read() to block indefinitely. The select() approach returns immediately on EOF in the normal (no-orphan) case — zero added latency — and times out after 0.5s when orphans are present, un-sticking the orchestrator.
  • tests/unit/test_cluster_collector.py: Add TestMPICollectorScriptMain (6 tests) covering the gather-always-called invariant, using exec() + mocked mpi4py — no real MPI processes required.

Test plan

  • uv run pytest tests/unit/test_cluster_collector.py::TestMPICollectorScriptMain -v — all 6 new tests pass
  • uv run pytest tests/unit -q — no regressions (956 passed, 8 pre-existing failures unrelated to these changes)
  • uv run pytest tests/unit -q after 2e74360 — 176 tests pass, no regressions

- utils.py: replace readline() with buffer.read1() in CommandExecutor to
  prevent blocking on partial/\r-terminated output from DLIO/MPI processes

- cluster_collector.py: wrap collect_local_info() in try/except inside
  MPI_COLLECTOR_SCRIPT so every rank always reaches comm.gather(), preventing
  a deadlock when any rank exits early due to a collection error

- tests/unit/test_cluster_collector.py: add TestMPICollectorScriptMain with
  6 tests covering the gather-always-called invariant
@FileSystemGuy FileSystemGuy requested a review from a team May 26, 2026 18:58
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena
idevasena previously approved these changes May 26, 2026

@idevasena idevasena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it on my end and fix works @FileSystemGuy

FileSystemGuy and others added 2 commits May 26, 2026 12:53
Handoff for the third unet3d hang fix. The pipe-drain blocking .read()
replacement (select+read1 with 0.5s timeout) is applied but not yet
committed. See .planning/.continue-here.md for full context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locking on orphaned DataLoader workers

PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init.
When mpirun exits (its direct children done), those orphaned grandchildren
still hold the pipe write-end open, causing TextIOWrapper.read() in the
post-loop drain to block indefinitely. Replacing the blocking .read() calls
with a select()+read1() loop (0.5s timeout) un-sticks the orchestrator;
the normal (no-orphan) case completes immediately on EOF with zero added latency.
@FileSystemGuy FileSystemGuy changed the title Fix two silent hang causes in unet3d training benchmark Fix three silent hang causes in unet3d training benchmark May 26, 2026
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

Please review: @russfellows

@russfellows russfellows left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #394 — Fix three silent hang causes in unet3d training benchmark
Author: FileSystemGuy | +184 / -20 | 3 files | Status: blocked, review requested from @russfellows (that's you)

Summary: Fixes three distinct I/O blocking/hang causes in CommandExecutor and the MPI collector script, plus adds 6 unit tests.

Assessment: ✅ LGTM — this is high quality and directly addresses real hangs.

The three fixes are independent and each addresses a real problem:

readline() → read1(65536): Correct. readline() blocks until \n; DLIO/MPI progress output uses \r-terminated lines that never trigger it, causing a hang. read1() on the raw buffer returns whatever bytes are available immediately. The decode('utf-8', errors='replace') is the right way to handle it.

comm.gather() always reached in MPI_COLLECTOR_SCRIPT: Correct. MPI collective operations require all ranks to participate; an unguarded exception in any rank causes all others to deadlock waiting at gather(). The try/except with a sentinel dict ({'_collection_error': str(e)}) is exactly the right pattern. Tests validate the sentinel structure.

Post-loop drain using select()+read1() instead of .read(): Correct. The root cause explanation is accurate — PyTorch DataLoader workers are forked inside DLIO MPI ranks; those grandchildren inherit the pipe's write-end and hold it open after mpirun exits, causing the original .read() to block forever. The select(timeout=0.5) approach is correct: zero added latency on the normal path (EOF is immediately readable when write-end is closed), 0.5s grace period when orphans are present.

Note: This PR overlaps thematically with our PR #400 (they're both fixing hang/crash issues in the training pipeline). These can merge independently. However, PR #394's fix #3 (orphaned DataLoader worker grandchildren holding pipe open) is arguably the primary cause of #391, complementing our fix of the fork-safety issue inside dlio_benchmark itself. Both fixes are needed.

@russfellows russfellows merged commit 25fb40b into main Jun 1, 2026
2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 1, 2026
@russfellows russfellows deleted the FileSystemGuy-une3d-hang branch June 1, 2026 00:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants