fix: pin dlio-benchmark to PR #22 head to resolve issue #415#436
Merged
Conversation
PyTorch DataLoader workers were aborting with "MPI_Init_thread on a NULL communicator" whenever a DLIO workload used reader.read_threads > 0 (e.g. retinanet_b200, unet3d_*). Root cause is in dlio_benchmark: the top-level `from mpi4py import MPI` in statscounter.py triggers MPI_Init_thread() in every spawn-context DataLoader child, which has no PMIX/ORTE environment because it was not launched by mpirun. The upstream comment in dlio_benchmark/utils/utility.py:176 already documented the hazard, but the mpi4py.rc.initialize / rc.finalize flags that actually disable the auto-init were never set. mlcommons/DLIO_local_changes#22 adds those two lines. Pin to that PR's head commit (60fd3b8e) so storage picks up the fix immediately. Revert to branch = "main" once #22 merges upstream.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
idevasena
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #415. DLIO PyTorch DataLoader workers were aborting at the start of epoch 1 with
MPI_Init_threadon a NULL communicator wheneverreader.read_threads > 0(retinanet_b200, unet3d_*, etc.).Root cause lives in
dlio_benchmark, not here:dlio_benchmark/utils/statscounter.py:29doesfrom mpi4py import MPIat module top level.dlio_benchmark/utils/utility.py:176has a comment documenting that auto-init is unsafe for spawn-context workers, but thempi4py.rc.initialize/rc.finalizeflags that actually disable auto-init were never set.mpi4py.rc.initialize == True, every spawn-context DataLoader child re-importsstatscounter, triggersMPI_Init_thread()outsidempirun's PMIX namespace, and aborts withNo permission (-17).The upstream fix is mlcommons/DLIO_local_changes#22 (+2 lines). This PR re-pins our
dlio-benchmarksource frombranch = "main"torev = 60fd3b8e7ae9cc8be644b47df0661366ac2c8bd6(PR #22 head) so storage picks up the fix immediately. The lockfile is regenerated to match.Changes
pyproject.toml: switch dlio-benchmark source frombranch = "main"→rev = <PR #22 head SHA>.uv.lock: regenerated viauv lock --upgrade-package dlio-benchmark.Follow-up
Once mlcommons/DLIO_local_changes#22 merges to
main, revert this file change back tobranch = "main"and re-lock.Test plan
uv syncsucceeds on a fresh checkoutmlpstorage training run --model retinanet -g b200 -na 8 ...no longer aborts withMPI_Init_thread on a NULL communicatormlpstorage training run --model unet3d(which usesread_threads: 4) completes epoch 1pytest tests/unit -v