fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT#28
Merged
Merged
Conversation
…_CACHES_TIMEOUT
Fixes mlcommons/storage #487.
The current code path treats every drop_caches failure the same: warn
once and disable the flush for the rest of the run. That was correct for
the original bug (#391: an interactive sudo prompt that hung for ~16
hours), but it misfires on hosts where the kernel itself genuinely needs
more than 30s to drop a large page cache. Those operators have
NOPASSWD sudo configured; sudo -n returns 0 quickly, then the kernel
takes its time. The current code interprets that as a fatal sudo error,
silently suppresses the flush for the run, and inflates throughput.
Split the except handler into two cases:
* subprocess.TimeoutExpired -> the slow-kernel case. sudo -n already
succeeded. Warn once with hardware-appropriate guidance, do NOT
disable, and let the next epoch retry.
* everything else -> the sudo-refused / sudo-missing case. Keep the
existing warn-once-and-disable behavior so we don't loop on a
failure mode that originally hung interactively.
Expose the per-call timeout via DLIO_DROP_CACHES_TIMEOUT so operators on
large-RAM hosts can raise the ceiling without changing this file. The
mlpstorage CLI will expose a corresponding --drop-caches-timeout-seconds
knob in a follow-up PR that bumps the storage repo's DLIO pin.
Author
|
@russfellows please review when you have a moment. |
Open
4 tasks
Author
|
@idevasena @dslik You can review/approve as well. mlcommons/.storage#492 is the PR that will takes a CLI arg and creates the environment variable that this PR requires. |
Author
|
@russfellows This one too, please! |
|
I think I have approved and merged everything I can.
There are just two open, 492 and 493, which both require merge edits I believe.
—Russ
… On Jun 23, 2026, at 2:43 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/DLIO_local_changes#28)
<#28 (comment)>
@russfellows <https://github.com/russfellows> This one too, please!
—
Reply to this email directly, view it on GitHub <#28?email_source=notifications&email_token=AF64UJ4D4MCREIUBPJ4XUBT5BLTXVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGMZDSMRTGM22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4783292335>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ665J2WMUJUJDDSJY35BLTXVAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGIYTMNRQGU3DHILWAI>.
You are receiving this because you were mentioned.
|
Author
|
@russfellows Approving and merging this one is required before we can merge mlcommons/storage PR#492 |
russfellows
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes mlcommons/storage#487.
The per-epoch page-cache flush (
sudo -n sh -c 'echo 3 > /proc/sys/vm/drop_caches') currently treats every failure the same: warn once and disable the flush for the rest of the run. That was correct for the original bug (mlcommons/storage#391: an interactive sudo prompt that hung for ~16 hours), but it misfires on hosts where the kernel itself needs more than 30s to drop a large page cache. Operators on large-RAM systems have NOPASSWD sudo configured;sudo -nreturns 0 quickly, then the kernel takes its time. The current code interprets the kernel timeout as a fatal sudo error, silently suppresses the flush for the run, and inflates throughput.Change
Split the
excepthandler into two cases:subprocess.TimeoutExpired— slow-kernel case.sudo -nalready succeeded (otherwise we'd see a quick non-zero exit, not a timeout). Warn once with hardware-appropriate guidance, do not disable, and let the next epoch retry.Expose the per-call timeout via
DLIO_DROP_CACHES_TIMEOUT(default 30, minimum 1). Invalid/empty/sub-1 values fall back to the default; the lower bound matters becausesubprocess.run(timeout=...)rejects 0 / negative values at call time, so a typo in an operator's env must not crash DLIO.Extracted the parsing into a module-level
_resolve_drop_caches_timeout()so it can be unit-tested in isolation.mlpstorage-side follow-up
mlpstorage will plumb a
--drop-caches-timeout-secondsCLI knob through to this env var in a separate PR that also bumps thedlio-benchmarkgit pin to this merge commit.Risk
except Exceptionbranch still warns once and disables for the run).sudo -nsucceeded AND the kernel exceeded the timeout.Test plan
pytest tests/test_drop_caches_timeout.py— 22 cases pass (env var unset / empty / whitespace / unparseable / valid integer / whitespace-stripped / zero / negative / unrelated-env-noise / os.environ fallback / default-constant guard).python3 -c "import ast; ast.parse(open('dlio_benchmark/main.py').read())"— syntax OK.mlpstorage closed training retinanet run fileon a host without NOPASSWD sudo — verify the warn-once-and-disable text still fires (regression check for #391).DLIO_DROP_CACHES_TIMEOUT=1on a host with non-trivial cache) — verify the new warning fires once and the flush retries on subsequent epochs.DLIO_DROP_CACHES_TIMEOUT=300on the host from #487 — verify the flush completes and no warning is emitted.