Skip to content

chore(deps): track DLIO main @ PR #25 merge (fixes #455)#470

Merged
idevasena merged 1 commit into
mainfrom
FileSystemGuy-dlio-pr25
Jun 18, 2026
Merged

chore(deps): track DLIO main @ PR #25 merge (fixes #455)#470
idevasena merged 1 commit into
mainfrom
FileSystemGuy-dlio-pr25

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

Bumps the dlio-benchmark git pin to pick up DLIO PR #25 (cbe20010), which fixes the inter-epoch sampler deadlock reported in #455.

  • Pin: 1d11f982cbe20010
  • Storage version: 3.0.123.0.13
  • uv.lock regenerated.

Why

dlio_sampler was using math.ceil(num_samples / comm_size) with a clamp on the last rank. When num_samples % comm_size != 0 the last rank produced fewer batches per epoch and the per-step / end-of-epoch MPI_Barriers in main._train() matched across iterations — a silent CPU-spinning deadlock at the next epoch boundary with no diagnostic.

PR #25 replaces ceil + clamp with floor division at three call sites (torch_data_loader.dlio_sampler, config.build_sample_map_iter, config.get_global_map_index), emits a rank-0 warning when the floor drops samples, and fixes a pre-existing Sampler.__len__ contract violation (caught by @idevasena during review) so len(sampler) == len(list(iter(sampler))). Includes three regression tests.

Closes #455.

Test plan

  • mlpstorage --version reports 3.0.13.
  • pip install -e . resolves DLIO at cbe20010.
  • Multi-rank run with num_samples % comm_size != 0 (e.g. unet3d configured so num_files_train is not a multiple of num-accelerators) completes past epoch 1 without hanging.
  • DLIO emits the "dropping N sample(s)" warning on rank 0 when N is uneven.

Bumps the DLIO_local_changes pin from 1d11f982 to cbe20010, picking up
PR #25 (sampler floor division — equal per-rank shards). Fixes the
inter-epoch deadlock reported in #455 when num_samples is not an even
multiple of comm_size.

Also bumps mlpstorage version 3.0.12 -> 3.0.13 so users can tell from
`mlpstorage --version` whether the fix is present.

Regenerated uv.lock.
@FileSystemGuy FileSystemGuy requested a review from a team June 18, 2026 14:36
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

@idevasena Just Finishing off the DLIO work you helped fix up!

@idevasena idevasena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you @FileSystemGuy

@idevasena idevasena merged commit 19be847 into main Jun 18, 2026
3 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 18, 2026
@russfellows russfellows deleted the FileSystemGuy-dlio-pr25 branch June 18, 2026 15:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dlio_sampler uses ceil(N/size) causing inter-epoch deadlock when num_samples is not divisible by comm_size

2 participants