Fix three silent hang causes in unet3d training benchmark by FileSystemGuy · Pull Request #394 · mlcommons/storage

FileSystemGuy · 2026-05-26T18:58:03Z

Summary

utils.py: Replace readline() with buffer.read1() in CommandExecutor to prevent an indefinite block when DLIO or MPI writes \r-terminated progress output or any partial line without a trailing newline.
cluster_collector.py: Wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank unconditionally reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error.
utils.py (2e74360): Replace the post-loop blocking TextIOWrapper.read() drain calls with a select()+read1() loop (0.5s timeout). PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init; when mpirun exits, those orphaned grandchildren still hold the pipe write-end open, causing the original read() to block indefinitely. The select() approach returns immediately on EOF in the normal (no-orphan) case — zero added latency — and times out after 0.5s when orphans are present, un-sticking the orchestrator.
tests/unit/test_cluster_collector.py: Add TestMPICollectorScriptMain (6 tests) covering the gather-always-called invariant, using exec() + mocked mpi4py — no real MPI processes required.

Test plan

uv run pytest tests/unit/test_cluster_collector.py::TestMPICollectorScriptMain -v — all 6 new tests pass
uv run pytest tests/unit -q — no regressions (956 passed, 8 pre-existing failures unrelated to these changes)
uv run pytest tests/unit -q after 2e74360 — 176 tests pass, no regressions

- utils.py: replace readline() with buffer.read1() in CommandExecutor to prevent blocking on partial/\r-terminated output from DLIO/MPI processes - cluster_collector.py: wrap collect_local_info() in try/except inside MPI_COLLECTOR_SCRIPT so every rank always reaches comm.gather(), preventing a deadlock when any rank exits early due to a collection error - tests/unit/test_cluster_collector.py: add TestMPICollectorScriptMain with 6 tests covering the gather-always-called invariant

github-actions · 2026-05-26T18:58:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena

Tested it on my end and fix works @FileSystemGuy

Handoff for the third unet3d hang fix. The pipe-drain blocking .read() replacement (select+read1 with 0.5s timeout) is applied but not yet committed. See .planning/.continue-here.md for full context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…locking on orphaned DataLoader workers PyTorch DataLoader workers are forked inside DLIO MPI ranks after MPI_Init. When mpirun exits (its direct children done), those orphaned grandchildren still hold the pipe write-end open, causing TextIOWrapper.read() in the post-loop drain to block indefinitely. Replacing the blocking .read() calls with a select()+read1() loop (0.5s timeout) un-sticks the orchestrator; the normal (no-orphan) case completes immediately on EOF with zero added latency.

FileSystemGuy · 2026-05-28T15:30:56Z

Please review: @russfellows

russfellows

PR #394 — Fix three silent hang causes in unet3d training benchmark
Author: FileSystemGuy | +184 / -20 | 3 files | Status: blocked, review requested from @russfellows (that's you)

Summary: Fixes three distinct I/O blocking/hang causes in CommandExecutor and the MPI collector script, plus adds 6 unit tests.

Assessment: ✅ LGTM — this is high quality and directly addresses real hangs.

The three fixes are independent and each addresses a real problem:

readline() → read1(65536): Correct. readline() blocks until \n; DLIO/MPI progress output uses \r-terminated lines that never trigger it, causing a hang. read1() on the raw buffer returns whatever bytes are available immediately. The decode('utf-8', errors='replace') is the right way to handle it.

comm.gather() always reached in MPI_COLLECTOR_SCRIPT: Correct. MPI collective operations require all ranks to participate; an unguarded exception in any rank causes all others to deadlock waiting at gather(). The try/except with a sentinel dict ({'_collection_error': str(e)}) is exactly the right pattern. Tests validate the sentinel structure.

Post-loop drain using select()+read1() instead of .read(): Correct. The root cause explanation is accurate — PyTorch DataLoader workers are forked inside DLIO MPI ranks; those grandchildren inherit the pipe's write-end and hold it open after mpirun exits, causing the original .read() to block forever. The select(timeout=0.5) approach is correct: zero added latency on the normal path (EOF is immediately readable when write-end is closed), 0.5s grace period when orphans are present.

Note: This PR overlaps thematically with our PR #400 (they're both fixing hang/crash issues in the training pipeline). These can merge independently. However, PR #394's fix #3 (orphaned DataLoader worker grandchildren holding pipe open) is arguably the primary cause of #391, complementing our fix of the fork-safety issue inside dlio_benchmark itself. Both fixes are needed.

FileSystemGuy requested a review from a team May 26, 2026 18:58

idevasena previously approved these changes May 26, 2026

View reviewed changes

FileSystemGuy and others added 2 commits May 26, 2026 12:53

FileSystemGuy dismissed idevasena’s stale review via 2e74360 May 26, 2026 19:57

FileSystemGuy added 2 commits May 26, 2026 12:58

Delete .planning/.continue-here.md

3d9a9a8

Delete .planning/HANDOFF.json

b8a4dee

FileSystemGuy changed the title ~~Fix two silent hang causes in unet3d training benchmark~~ Fix three silent hang causes in unet3d training benchmark May 26, 2026

idevasena approved these changes May 27, 2026

View reviewed changes

russfellows approved these changes Jun 1, 2026

View reviewed changes

russfellows merged commit 25fb40b into main Jun 1, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 1, 2026

russfellows deleted the FileSystemGuy-une3d-hang branch June 1, 2026 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix three silent hang causes in unet3d training benchmark#394

Fix three silent hang causes in unet3d training benchmark#394
russfellows merged 5 commits into
mainfrom
FileSystemGuy-une3d-hang

FileSystemGuy commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

idevasena left a comment

Uh oh!

FileSystemGuy commented May 28, 2026

Uh oh!

russfellows left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

FileSystemGuy commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Uh oh!

FileSystemGuy commented May 28, 2026

Uh oh!

russfellows left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FileSystemGuy commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading