fix: add EXIT_CODE.INTERRUPTED to resolve AttributeError on SIGTERM (fixes #392, #393)#400
Merged
Merged
Conversation
When dlio_benchmark exits, OpenMPI sends SIGTERM to the parent process group. The mlpstorage signal handler calls sys.exit(EXIT_CODE.INTERRUPTED), which crashed with AttributeError because INTERRUPTED was missing from the EXIT_CODE enum in config.py. Add INTERRUPTED = 8 to the enum. Fixes #392 Fixes #393
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
3 tasks
idevasena
approved these changes
Jun 1, 2026
Contributor
Author
|
Wow, that was fast, thanks Devasena. It would be great if you could review
the PR into dlio_benchmark maybe sometime this week.
Regards,
…On Sun, May 31, 2026 at 9:25 PM Devasena I ***@***.***> wrote:
***@***.**** approved this pull request.
—
Reply to this email directly, view it on GitHub
<#400?email_source=notifications&email_token=AF64UJYQKIABIJZE5PKLH4D45TZSVA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTIMZZHAZDEMBRHA42M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#pullrequestreview-4398220189>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF64UJ2MJA65Z2EW2DCMWOD45TZSVAVCNFSM6AAAAACZUZZCKSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DGOJYGIZDAMJYHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Thanks,
--Russ
|
FileSystemGuy
approved these changes
Jun 2, 2026
Contributor
Author
|
Curtis, and reviewers.
Thanks for merging, but in order for these fixes in MLCommons storage to work properly, we need the fixes in DLIO_Benchmark merged in as well.
Here is that PR if you all can review. It passed the CI reviews.
mlcommons/DLIO_local_changes#21
Regards,
—Russ
… On Jun 2, 2026, at 9:09 AM, Curtis Anderson ***@***.***> wrote:
Merged #400 <#400> into main.
—
Reply to this email directly, view it on GitHub <#400?email_source=notifications&email_token=AF64UJ43XEMCSKAGLBEWUG3453U3FA5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRSGQZTMMBTGE2DBJTSMVQXG33OUZQXK5DIN5ZKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#event-26243603140>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ5V2HQFXLACEILYQGT453U3FAVCNFSM6AAAAACZUZZCKSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGI2DGNRQGMYTIMA>.
You are receiving this because you authored the thread.
|
Contributor
Author
Contributor
Author
|
Devesena,
Great, appreciate your time here. I think the MLCommons storage suite is coming together… finally.
It would also be good for you all to consider the email conversation I had with the Huawei engineer. One possible “fix” for the DLRM workload is to remove the Python + MPI barrier time from the AU calculations. That is, if we measure the Python + MPI barrier overhead time, we should add that time to the GPU sleep time for the workload. In that way, infinitely fast storage should and must score 100% AU. Currently, infinitely fast storage achieves an AU of about 15 - 20%, which just makes no sense.
I think that method is both logical and defensible, and should likely be applied to all of the workloads. However, it will really only have an impact on workloads like DLRM where the GPU sleep time is under 500 micro-seconds. For workloads with GPU compute above that value, it won’t make much, if any difference, which is as it should be.
Regards,
—Russ
… On Jun 2, 2026, at 10:31 AM, Devasena Inupakutika ***@***.***> wrote:
Hello Russ,
Sure, going to review DLIO PR today, run tests and update. Thank you!
--
Regards,
Devasena
From: Russ Fellows ***@***.*** ***@***.***>>
Sent: Tuesday, June 2, 2026 8:40 AM
To: mlcommons/storage ***@***.*** ***@***.***>>; storage-chairs ***@***.*** ***@***.***>>; Devasena Inupakutika ***@***.*** ***@***.***>>; David Slik ***@***.*** ***@***.***>>
Cc: mlcommons/storage ***@***.*** ***@***.***>>; Author ***@***.*** ***@***.***>>
Subject: Re: [mlcommons/storage] fix: add EXIT_CODE.INTERRUPTED to resolve AttributeError on SIGTERM (fixes #392, #393) (PR #400)
Curtis, and reviewers. Thanks for merging, but in order for these fixes in MLCommons storage to work properly, we need the fixes in DLIO_Benchmark merged in as well. Here is that PR if you all can review. It passed the CI reviews. fix: restrict
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
Use caution opening files, clicking links or responding to requests.
ZjQcmQRYFpfptBannerEnd
Curtis, and reviewers.
Thanks for merging, but in order for these fixes in MLCommons storage to work properly, we need the fixes in DLIO_Benchmark merged in as well.
Here is that PR if you all can review. It passed the CI reviews.
fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385) by russfellows · Pull Request #21 · mlcommons/DLIO_local_changes <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=1fb1ac06-7e3ab923-1fb02749-74fe485cbff6-f5631f1f91e26419&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fmlcommons*2FDLIO_local_changes*2Fpull*2F21__;JSUlJSUlJQ!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0lZO96H_vA$>
github.com <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=a6b8d90d-c733cc28-a6b95242-74fe485cbff6-ca3b21a394bfb3a5&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fmlcommons*2FDLIO_local_changes*2Fpull*2F21__;JSUlJSUlJQ!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0lbZ05mrPg$>
<image001.png> <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=7c9a9fb1-1d118a94-7c9b14fe-74fe485cbff6-fbfe008535e7cef4&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fmlcommons*2FDLIO_local_changes*2Fpull*2F21__;JSUlJSUlJQ!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0lYsU3ZSUA$>
Regards,
—Russ
On Jun 2, 2026, at 9:09 AM, Curtis Anderson ***@***.*** ***@***.***>> wrote:
Merged #400 <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=6ac7920a-0b4c872f-6ac61945-74fe485cbff6-c971db7e23a3288d&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fmlcommons*2Fstorage*2Fpull*2F400__;JSUlJSUlJQ!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0laMi4RrcQ$> into main.
—
Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=00008934-618b9c11-0001027b-74fe485cbff6-8970b87b89d1873c&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fmlcommons*2Fstorage*2Fpull*2F400*3Femail_source*3Dnotifications*26email_token*3DAF64UJ43XEMCSKAGLBEWUG3453U3FA5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRSGQZTMMBTGE2DBJTSMVQXG33OUZQXK5DIN5ZKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM*23event-26243603140__;JSUlJSUlJSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0lbnZMNb7Q$>, or unsubscribe <https://urldefense.com/v3/__https:/protect2.fireeye.com/v1/url?k=1797dd7a-761cc85f-17965635-74fe485cbff6-c6c11b0db458a43c&q=1&e=f409759c-ddff-4a7d-8f13-786040710ec8&u=https*3A*2F*2Fgithub.com*2Fnotifications*2Funsubscribe-auth*2FAF64UJ5V2HQFXLACEILYQGT453U3FAVCNFSM6AAAAACZUZZCKSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGI2DGNRQGMYTIMA__;JSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!UTounldtbckTt8iisT3V2BGSV7HJWvrTNqx8bgVfgjUT_9a08clz3ycRlBtWVwK992UYOKUBSesr5HTC0laxokX4jw$>.
You are receiving this because you authored the thread.Message ID: ***@***.*** ***@***.***>>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
dlio_benchmarkfinishes (successfully or with an error), OpenMPI sendsSIGTERMto the parent process group as part of its normal cleanup. Themlpstoragesignal handler catches this and callssys.exit(EXIT_CODE.INTERRUPTED).That call crashed with:
because the
INTERRUPTEDmember was missing from theEXIT_CODEenum inmlpstorage_py/config.py. The enum hadSUCCESSthroughTIMEOUT(0–7) and a# Add more as neededcomment whereINTERRUPTEDshould have been.Root Cause
main.pyreferencesEXIT_CODE.INTERRUPTEDin two places (the signal handler atlines 63 and 319), but the enum value was never defined. Any benchmark invocation
that reaches the MPI cleanup phase triggers the SIGTERM → signal handler →
AttributeErrorcrash path, which was misreported as a general failure ratherthan a clean interrupted exit.
Fix
mlpstorage_py/config.py— addINTERRUPTED = 8to theEXIT_CODEenum:Issues Fixed
AttributeError: EXIT_CODE has no attribute 'INTERRUPTED'Testing
uv run python -m pytest tests/unit/ -q)EXIT_CODE.INTERRUPTED == 8andstr(EXIT_CODE.INTERRUPTED) == "INTERRUPTED (8)"Files Changed
mlpstorage_py/config.pyINTERRUPTED = 8toEXIT_CODEenum