Fix three silent hang causes in unet3d training benchmark#394
Conversation
- utils.py: replace readline() with buffer.read1() in CommandExecutor to prevent blocking on partial/\r-terminated output from DLIO/MPI processes - cluster_collector.py: wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank always reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error - tests/unit/test_cluster_collector.py: add TestMPICollectorScriptMain with 6 tests covering the gather-always-called invariant
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
idevasena
left a comment
There was a problem hiding this comment.
Tested it on my end and fix works @FileSystemGuy
Handoff for the third unet3d hang fix. The pipe-drain blocking .read() replacement (select+read1 with 0.5s timeout) is applied but not yet committed. See .planning/.continue-here.md for full context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locking on orphaned DataLoader workers PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init. When mpirun exits (its direct children done), those orphaned grandchildren still hold the pipe write-end open, causing TextIOWrapper.read() in the post-loop drain to block indefinitely. Replacing the blocking .read() calls with a select()+read1() loop (0.5s timeout) un-sticks the orchestrator; the normal (no-orphan) case completes immediately on EOF with zero added latency.
|
Please review: @russfellows |
russfellows
left a comment
There was a problem hiding this comment.
PR #394 — Fix three silent hang causes in unet3d training benchmark
Author: FileSystemGuy | +184 / -20 | 3 files | Status: blocked, review requested from @russfellows (that's you)
Summary: Fixes three distinct I/O blocking/hang causes in CommandExecutor and the MPI collector script, plus adds 6 unit tests.
Assessment: ✅ LGTM — this is high quality and directly addresses real hangs.
The three fixes are independent and each addresses a real problem:
readline() → read1(65536): Correct. readline() blocks until \n; DLIO/MPI progress output uses \r-terminated lines that never trigger it, causing a hang. read1() on the raw buffer returns whatever bytes are available immediately. The decode('utf-8', errors='replace') is the right way to handle it.
comm.gather() always reached in MPI_COLLECTOR_SCRIPT: Correct. MPI collective operations require all ranks to participate; an unguarded exception in any rank causes all others to deadlock waiting at gather(). The try/except with a sentinel dict ({'_collection_error': str(e)}) is exactly the right pattern. Tests validate the sentinel structure.
Post-loop drain using select()+read1() instead of .read(): Correct. The root cause explanation is accurate — PyTorch DataLoader workers are forked inside DLIO MPI ranks; those grandchildren inherit the pipe's write-end and hold it open after mpirun exits, causing the original .read() to block forever. The select(timeout=0.5) approach is correct: zero added latency on the normal path (EOF is immediately readable when write-end is closed), 0.5s grace period when orphans are present.
Note: This PR overlaps thematically with our PR #400 (they're both fixing hang/crash issues in the training pipeline). These can merge independently. However, PR #394's fix #3 (orphaned DataLoader worker grandchildren holding pipe open) is arguably the primary cause of #391, complementing our fix of the fork-safety issue inside dlio_benchmark itself. Both fixes are needed.
Summary
utils.py: Replacereadline()withbuffer.read1()inCommandExecutorto prevent an indefinite block when DLIO or MPI writes\r-terminated progress output or any partial line without a trailing newline.cluster_collector.py: Wrapcollect_local_info()in try/except insideMPI_COLLECTOR_SCRIPTso every rank unconditionally reachescomm.gather(), preventing a deadlock when any rank exits early due to a collection error.utils.py(2e74360): Replace the post-loop blockingTextIOWrapper.read()drain calls with aselect()+read1()loop (0.5s timeout). PyTorch DataLoader workers are forked inside DLIO MPI ranks afterMPI_Init; when mpirun exits, those orphaned grandchildren still hold the pipe write-end open, causing the originalread()to block indefinitely. Theselect()approach returns immediately on EOF in the normal (no-orphan) case — zero added latency — and times out after 0.5s when orphans are present, un-sticking the orchestrator.tests/unit/test_cluster_collector.py: AddTestMPICollectorScriptMain(6 tests) covering the gather-always-called invariant, using exec() + mocked mpi4py — no real MPI processes required.Test plan
uv run pytest tests/unit/test_cluster_collector.py::TestMPICollectorScriptMain -v— all 6 new tests passuv run pytest tests/unit -q— no regressions (956 passed, 8 pre-existing failures unrelated to these changes)uv run pytest tests/unit -qafter 2e74360 — 176 tests pass, no regressions