Skip to content

fix: pin dlio-benchmark to PR #22 head to resolve issue #415#436

Merged
FileSystemGuy merged 1 commit into
mainfrom
FileSystemGuy-DLIO-fix
Jun 12, 2026
Merged

fix: pin dlio-benchmark to PR #22 head to resolve issue #415#436
FileSystemGuy merged 1 commit into
mainfrom
FileSystemGuy-DLIO-fix

Conversation

@FileSystemGuy

@FileSystemGuy FileSystemGuy commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #415. DLIO PyTorch DataLoader workers were aborting at the start of epoch 1 with MPI_Init_thread on a NULL communicator whenever reader.read_threads > 0 (retinanet_b200, unet3d_*, etc.).

Root cause lives in dlio_benchmark, not here:

  • dlio_benchmark/utils/statscounter.py:29 does from mpi4py import MPI at module top level.
  • dlio_benchmark/utils/utility.py:176 has a comment documenting that auto-init is unsafe for spawn-context workers, but the mpi4py.rc.initialize / rc.finalize flags that actually disable auto-init were never set.
  • With the default mpi4py.rc.initialize == True, every spawn-context DataLoader child re-imports statscounter, triggers MPI_Init_thread() outside mpirun's PMIX namespace, and aborts with No permission (-17).

The upstream fix is mlcommons/DLIO_local_changes#22 (+2 lines). This PR re-pins our dlio-benchmark source from branch = "main" to rev = 60fd3b8e7ae9cc8be644b47df0661366ac2c8bd6 (PR #22 head) so storage picks up the fix immediately. The lockfile is regenerated to match.

Changes

  • pyproject.toml: switch dlio-benchmark source from branch = "main"rev = <PR #22 head SHA>.
  • uv.lock: regenerated via uv lock --upgrade-package dlio-benchmark.

Follow-up

Once mlcommons/DLIO_local_changes#22 merges to main, revert this file change back to branch = "main" and re-lock.

Test plan

  • uv sync succeeds on a fresh checkout
  • mlpstorage training run --model retinanet -g b200 -na 8 ... no longer aborts with MPI_Init_thread on a NULL communicator
  • mlpstorage training run --model unet3d (which uses read_threads: 4) completes epoch 1
  • No regression in existing unit tests: pytest tests/unit -v

PyTorch DataLoader workers were aborting with "MPI_Init_thread on a NULL
communicator" whenever a DLIO workload used reader.read_threads > 0
(e.g. retinanet_b200, unet3d_*). Root cause is in dlio_benchmark: the
top-level `from mpi4py import MPI` in statscounter.py triggers
MPI_Init_thread() in every spawn-context DataLoader child, which has no
PMIX/ORTE environment because it was not launched by mpirun.

The upstream comment in dlio_benchmark/utils/utility.py:176 already
documented the hazard, but the mpi4py.rc.initialize / rc.finalize flags
that actually disable the auto-init were never set.

mlcommons/DLIO_local_changes#22 adds those two lines. Pin to that PR's
head commit (60fd3b8e) so storage picks up the fix immediately. Revert
to branch = "main" once #22 merges upstream.
@FileSystemGuy FileSystemGuy requested a review from a team June 12, 2026 18:57
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy FileSystemGuy merged commit aa68b5b into main Jun 12, 2026
2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
@FileSystemGuy FileSystemGuy deleted the FileSystemGuy-DLIO-fix branch June 13, 2026 00:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DLIO PyTorch DataLoader workers abort with MPI_Init "No permission" when read_threads > 0

2 participants