fix: pin dlio-benchmark to PR #22 head to resolve issue #415 by FileSystemGuy · Pull Request #436 · mlcommons/storage

FileSystemGuy · 2026-06-12T18:57:58Z

Summary

Fixes #415. DLIO PyTorch DataLoader workers were aborting at the start of epoch 1 with MPI_Init_thread on a NULL communicator whenever reader.read_threads > 0 (retinanet_b200, unet3d_*, etc.).

Root cause lives in dlio_benchmark, not here:

dlio_benchmark/utils/statscounter.py:29 does from mpi4py import MPI at module top level.
dlio_benchmark/utils/utility.py:176 has a comment documenting that auto-init is unsafe for spawn-context workers, but the mpi4py.rc.initialize / rc.finalize flags that actually disable auto-init were never set.
With the default mpi4py.rc.initialize == True, every spawn-context DataLoader child re-imports statscounter, triggers MPI_Init_thread() outside mpirun's PMIX namespace, and aborts with No permission (-17).

The upstream fix is mlcommons/DLIO_local_changes#22 (+2 lines). This PR re-pins our dlio-benchmark source from branch = "main" to rev = 60fd3b8e7ae9cc8be644b47df0661366ac2c8bd6 (PR #22 head) so storage picks up the fix immediately. The lockfile is regenerated to match.

Changes

pyproject.toml: switch dlio-benchmark source from branch = "main" → rev = <PR #22 head SHA>.
uv.lock: regenerated via uv lock --upgrade-package dlio-benchmark.

Follow-up

Once mlcommons/DLIO_local_changes#22 merges to main, revert this file change back to branch = "main" and re-lock.

Test plan

uv sync succeeds on a fresh checkout
mlpstorage training run --model retinanet -g b200 -na 8 ... no longer aborts with MPI_Init_thread on a NULL communicator
mlpstorage training run --model unet3d (which uses read_threads: 4) completes epoch 1
No regression in existing unit tests: pytest tests/unit -v

PyTorch DataLoader workers were aborting with "MPI_Init_thread on a NULL communicator" whenever a DLIO workload used reader.read_threads > 0 (e.g. retinanet_b200, unet3d_*). Root cause is in dlio_benchmark: the top-level `from mpi4py import MPI` in statscounter.py triggers MPI_Init_thread() in every spawn-context DataLoader child, which has no PMIX/ORTE environment because it was not launched by mpirun. The upstream comment in dlio_benchmark/utils/utility.py:176 already documented the hazard, but the mpi4py.rc.initialize / rc.finalize flags that actually disable the auto-init were never set. mlcommons/DLIO_local_changes#22 adds those two lines. Pin to that PR's head commit (60fd3b8e) so storage picks up the fix immediately. Revert to branch = "main" once #22 merges upstream.

github-actions · 2026-06-12T18:58:08Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy requested a review from a team June 12, 2026 18:57

idevasena approved these changes Jun 12, 2026

View reviewed changes

FileSystemGuy merged commit aa68b5b into main Jun 12, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 12, 2026

FileSystemGuy deleted the FileSystemGuy-DLIO-fix branch June 13, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: pin dlio-benchmark to PR #22 head to resolve issue #415#436

fix: pin dlio-benchmark to PR #22 head to resolve issue #415#436
FileSystemGuy merged 1 commit into
mainfrom
FileSystemGuy-DLIO-fix

FileSystemGuy commented Jun 12, 2026 •

edited by idevasena

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FileSystemGuy commented Jun 12, 2026 • edited by idevasena Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Follow-up

Test plan

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FileSystemGuy commented Jun 12, 2026 •

edited by idevasena

Loading