Skip to content

fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT#28

Merged
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue487-drop-caches-timeout
Jun 23, 2026
Merged

fix(main): handle slow flush as transient; expose DLIO_DROP_CACHES_TIMEOUT#28
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue487-drop-caches-timeout

Conversation

@FileSystemGuy

Copy link
Copy Markdown

Summary

Fixes mlcommons/storage#487.

The per-epoch page-cache flush (sudo -n sh -c 'echo 3 > /proc/sys/vm/drop_caches') currently treats every failure the same: warn once and disable the flush for the rest of the run. That was correct for the original bug (mlcommons/storage#391: an interactive sudo prompt that hung for ~16 hours), but it misfires on hosts where the kernel itself needs more than 30s to drop a large page cache. Operators on large-RAM systems have NOPASSWD sudo configured; sudo -n returns 0 quickly, then the kernel takes its time. The current code interprets the kernel timeout as a fatal sudo error, silently suppresses the flush for the run, and inflates throughput.

Change

Split the except handler into two cases:

  • subprocess.TimeoutExpired — slow-kernel case. sudo -n already succeeded (otherwise we'd see a quick non-zero exit, not a timeout). Warn once with hardware-appropriate guidance, do not disable, and let the next epoch retry.
  • everything else — sudo refused / missing / permissions. Keep the existing warn-once-and-disable behavior so we don't loop on a failure mode that originally hung interactively.

Expose the per-call timeout via DLIO_DROP_CACHES_TIMEOUT (default 30, minimum 1). Invalid/empty/sub-1 values fall back to the default; the lower bound matters because subprocess.run(timeout=...) rejects 0 / negative values at call time, so a typo in an operator's env must not crash DLIO.

Extracted the parsing into a module-level _resolve_drop_caches_timeout() so it can be unit-tested in isolation.

mlpstorage-side follow-up

mlpstorage will plumb a --drop-caches-timeout-seconds CLI knob through to this env var in a separate PR that also bumps the dlio-benchmark git pin to this merge commit.

Risk

  • Behavior on hosts without NOPASSWD sudo is unchanged (the except Exception branch still warns once and disables for the run).
  • Behavior on hosts where the flush completes within 30s is unchanged.
  • New behavior triggers only when sudo -n succeeded AND the kernel exceeded the timeout.

Test plan

  • pytest tests/test_drop_caches_timeout.py — 22 cases pass (env var unset / empty / whitespace / unparseable / valid integer / whitespace-stripped / zero / negative / unrelated-env-noise / os.environ fallback / default-constant guard).
  • python3 -c "import ast; ast.parse(open('dlio_benchmark/main.py').read())" — syntax OK.
  • Smoke (operator-side, after merge): invoke mlpstorage closed training retinanet run file on a host without NOPASSWD sudo — verify the warn-once-and-disable text still fires (regression check for #391).
  • Smoke: invoke the same on a host with NOPASSWD sudo and force a timeout (e.g., DLIO_DROP_CACHES_TIMEOUT=1 on a host with non-trivial cache) — verify the new warning fires once and the flush retries on subsequent epochs.
  • Smoke: invoke with DLIO_DROP_CACHES_TIMEOUT=300 on the host from #487 — verify the flush completes and no warning is emitted.

…_CACHES_TIMEOUT

Fixes mlcommons/storage #487.

The current code path treats every drop_caches failure the same: warn
once and disable the flush for the rest of the run. That was correct for
the original bug (#391: an interactive sudo prompt that hung for ~16
hours), but it misfires on hosts where the kernel itself genuinely needs
more than 30s to drop a large page cache. Those operators have
NOPASSWD sudo configured; sudo -n returns 0 quickly, then the kernel
takes its time. The current code interprets that as a fatal sudo error,
silently suppresses the flush for the run, and inflates throughput.

Split the except handler into two cases:

  * subprocess.TimeoutExpired -> the slow-kernel case. sudo -n already
    succeeded. Warn once with hardware-appropriate guidance, do NOT
    disable, and let the next epoch retry.

  * everything else -> the sudo-refused / sudo-missing case. Keep the
    existing warn-once-and-disable behavior so we don't loop on a
    failure mode that originally hung interactively.

Expose the per-call timeout via DLIO_DROP_CACHES_TIMEOUT so operators on
large-RAM hosts can raise the ceiling without changing this file. The
mlpstorage CLI will expose a corresponding --drop-caches-timeout-seconds
knob in a follow-up PR that bumps the storage repo's DLIO pin.
@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows please review when you have a moment.

@FileSystemGuy

Copy link
Copy Markdown
Author

@idevasena @dslik You can review/approve as well.

mlcommons/.storage#492 is the PR that will takes a CLI arg and creates the environment variable that this PR requires.

@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows This one too, please!

@russfellows

russfellows commented Jun 23, 2026 via email

Copy link
Copy Markdown

@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows Approving and merging this one is required before we can merge mlcommons/storage PR#492

@russfellows russfellows merged commit 3667ed1 into main Jun 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Could not flush page cache between epochs warning

2 participants