fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT by FileSystemGuy · Pull Request #28 · mlcommons/DLIO_local_changes

FileSystemGuy · 2026-06-23T00:43:18Z

Summary

The per-epoch page-cache flush (sudo -n sh -c 'echo 3 > /proc/sys/vm/drop_caches') currently treats every failure the same: warn once and disable the flush for the rest of the run. That was correct for the original bug (mlcommons/storage#391: an interactive sudo prompt that hung for ~16 hours), but it misfires on hosts where the kernel itself needs more than 30s to drop a large page cache. Operators on large-RAM systems have NOPASSWD sudo configured; sudo -n returns 0 quickly, then the kernel takes its time. The current code interprets the kernel timeout as a fatal sudo error, silently suppresses the flush for the run, and inflates throughput.

Change

Split the except handler into two cases:

subprocess.TimeoutExpired — slow-kernel case. sudo -n already succeeded (otherwise we'd see a quick non-zero exit, not a timeout). Warn once with hardware-appropriate guidance, do not disable, and let the next epoch retry.
everything else — sudo refused / missing / permissions. Keep the existing warn-once-and-disable behavior so we don't loop on a failure mode that originally hung interactively.

Expose the per-call timeout via DLIO_DROP_CACHES_TIMEOUT (default 30, minimum 1). Invalid/empty/sub-1 values fall back to the default; the lower bound matters because subprocess.run(timeout=...) rejects 0 / negative values at call time, so a typo in an operator's env must not crash DLIO.

Extracted the parsing into a module-level _resolve_drop_caches_timeout() so it can be unit-tested in isolation.

mlpstorage-side follow-up

mlpstorage will plumb a --drop-caches-timeout-seconds CLI knob through to this env var in a separate PR that also bumps the dlio-benchmark git pin to this merge commit.

Risk

Behavior on hosts without NOPASSWD sudo is unchanged (the except Exception branch still warns once and disables for the run).
Behavior on hosts where the flush completes within 30s is unchanged.
New behavior triggers only when sudo -n succeeded AND the kernel exceeded the timeout.

Test plan

pytest tests/test_drop_caches_timeout.py — 22 cases pass (env var unset / empty / whitespace / unparseable / valid integer / whitespace-stripped / zero / negative / unrelated-env-noise / os.environ fallback / default-constant guard).
python3 -c "import ast; ast.parse(open('dlio_benchmark/main.py').read())" — syntax OK.
Smoke (operator-side, after merge): invoke mlpstorage closed training retinanet run file on a host without NOPASSWD sudo — verify the warn-once-and-disable text still fires (regression check for #391).
Smoke: invoke the same on a host with NOPASSWD sudo and force a timeout (e.g., DLIO_DROP_CACHES_TIMEOUT=1 on a host with non-trivial cache) — verify the new warning fires once and the flush retries on subsequent epochs.
Smoke: invoke with DLIO_DROP_CACHES_TIMEOUT=300 on the host from #487 — verify the flush completes and no warning is emitted.

…_CACHES_TIMEOUT Fixes mlcommons/storage #487. The current code path treats every drop_caches failure the same: warn once and disable the flush for the rest of the run. That was correct for the original bug (#391: an interactive sudo prompt that hung for ~16 hours), but it misfires on hosts where the kernel itself genuinely needs more than 30s to drop a large page cache. Those operators have NOPASSWD sudo configured; sudo -n returns 0 quickly, then the kernel takes its time. The current code interprets that as a fatal sudo error, silently suppresses the flush for the run, and inflates throughput. Split the except handler into two cases: * subprocess.TimeoutExpired -> the slow-kernel case. sudo -n already succeeded. Warn once with hardware-appropriate guidance, do NOT disable, and let the next epoch retry. * everything else -> the sudo-refused / sudo-missing case. Keep the existing warn-once-and-disable behavior so we don't loop on a failure mode that originally hung interactively. Expose the per-call timeout via DLIO_DROP_CACHES_TIMEOUT so operators on large-RAM hosts can raise the ceiling without changing this file. The mlpstorage CLI will expose a corresponding --drop-caches-timeout-seconds knob in a follow-up PR that bumps the storage repo's DLIO pin.

FileSystemGuy · 2026-06-23T00:43:57Z

@russfellows please review when you have a moment.

FileSystemGuy · 2026-06-23T01:14:53Z

@idevasena @dslik You can review/approve as well.

mlcommons/.storage#492 is the PR that will takes a CLI arg and creates the environment variable that this PR requires.

FileSystemGuy · 2026-06-23T20:43:15Z

@russfellows This one too, please!

russfellows · 2026-06-23T20:50:34Z

I think I have approved and merged everything I can. There are just two open, 492 and 493, which both require merge edits I believe. —Russ

…

On Jun 23, 2026, at 2:43 PM, Curtis Anderson ***@***.***> wrote: FileSystemGuy left a comment (mlcommons/DLIO_local_changes#28) <#28 (comment)> @russfellows <https://github.com/russfellows> This one too, please! — Reply to this email directly, view it on GitHub <#28?email_source=notifications&email_token=AF64UJ4D4MCREIUBPJ4XUBT5BLTXVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGMZDSMRTGM22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4783292335>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ665J2WMUJUJDDSJY35BLTXVAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGIYTMNRQGU3DHILWAI>. You are receiving this because you were mentioned.

FileSystemGuy · 2026-06-23T21:08:23Z

@russfellows Approving and merging this one is required before we can merge mlcommons/storage PR#492

FileSystemGuy requested a review from a team June 23, 2026 00:43

FileSystemGuy mentioned this pull request Jun 23, 2026

Could not flush page cache between epochs warning mlcommons/storage#487

Closed

FileSystemGuy mentioned this pull request Jun 23, 2026

training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var mlcommons/storage#492

Open

4 tasks

russfellows approved these changes Jun 23, 2026

View reviewed changes

russfellows merged commit 3667ed1 into main Jun 23, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT#28

fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT#28
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue487-drop-caches-timeout

FileSystemGuy commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

russfellows commented Jun 23, 2026 via email

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FileSystemGuy commented Jun 23, 2026

Summary

Change

mlpstorage-side follow-up

Risk

Test plan

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

russfellows commented Jun 23, 2026 via email

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants