training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var by FileSystemGuy · Pull Request #492 · mlcommons/storage

FileSystemGuy · 2026-06-23T01:12:35Z

Summary

mlpstorage-side half of #487. The DLIO half is mlcommons/DLIO_local_changes#28, which adds the DLIO_DROP_CACHES_TIMEOUT env var so large-RAM hosts can raise the per-call timeout for DLIO's per-epoch page-cache flush.

This PR adds a deployment knob to the training run subcommand only and plumbs the value through to DLIO via that env var.

Scope

Subcommand: training run only. datasize / datagen / configview do not invoke the per-epoch flush loop, so they don't need the knob.
Benchmark: training only. Checkpointing, VectorDB, and KV Cache do not use the auto-flush code path (checkpointing uses O_DIRECT / fadvise on the data path and a manual operator-side clear between phases; VDB/KVCache don't invoke drop_caches at all).
Mode: all (closed/open/whatif). It's a deployment knob, not a submission tunable, matching the existing pattern for --dlio-bin-path.

Plumbing

CLI (mlpstorage_py/cli/training_args.py): argparse validates positive int (>= 1) via a _positive_int type. Default None means no override; DLIO falls back to its built-in 30s.
Env-var set (mlpstorage_py/benchmarks/dlio.py:TrainingBenchmark.__init__): when the flag is provided, set os.environ['DLIO_DROP_CACHES_TIMEOUT'] to the stringified value. subprocess.Popen inherits parent env, so single-process invocations see it.
MPI forwarding (generate_dlio_command()): when the env var is set and exec_type == MPI, append -x DLIO_DROP_CACHES_TIMEOUT to the MPI prefix. OpenMPI does not forward arbitrary env vars to ranks by default; -x opts this one in so multi-host training honors the CLI choice.

Operator usage (post-merge)

After the DLIO PR lands and this repo's dlio-benchmark pin is bumped:

mlpstorage closed training retinanet run file \\
    --client-host-memory-in-gb 64 --num-accelerators 4 \\
    --accelerator-type b200 --data-dir ... --results-dir ... \\
    --drop-caches-timeout-seconds 300

(DLIO_DROP_CACHES_TIMEOUT=300 exported directly into DLIO's env is the alternative path; both arrive at the same place.)

Test plan

pytest tests/unit/test_drop_caches_timeout_cli.py — 25 cases pass (flag scope, value validation, env-var plumbing, MPI forwarding, end-to-end).
pytest tests/unit --ignore=tests/unit/test_benchmarks_base.py --ignore=tests/unit/test_parquet_reader.py --ignore=tests/unit/test_vdb_modular_fake_backend.py — 1399 pass, 4 skipped, 0 failures.
Manual end-to-end: invoked parse_arguments() with --drop-caches-timeout-seconds 300, confirmed args.drop_caches_timeout_seconds == 300, the env var is set, and the MPI prefix gets -x DLIO_DROP_CACHES_TIMEOUT appended.
Manual rejection: --drop-caches-timeout-seconds -5 exits with argument --drop-caches-timeout-seconds: expected positive integer (>= 1), got -5.

Tests added (25 cases)

TestFlagScope (6 cases): flag present on training run; rejected on training datasize/datagen/configview, checkpointing run, vectordb run, kvcache run.
TestValueValidation (12 cases): positive ints accepted (1, 30, 300, 7200, 86400); 0, negative, non-numeric, empty, float rejected; default is None when omitted.
TestEnvVarPlumbing (3 cases): env var set when flag provided, unset otherwise, stored as string (not int).
TestMpiForwarding (2 cases): -x DLIO_DROP_CACHES_TIMEOUT present when env set, absent when unset.
TestEndToEnd (1 case): full argparse → env-set chain reaches the expected state.

Dependencies

This PR's CLI knob is functional regardless of the DLIO pin — the env var simply gets set in the subprocess environment. The behavior change in DLIO (slow-flush retry + reading the env var) requires mlcommons/DLIO_local_changes#28. The follow-up PR that bumps the dlio-benchmark pin in pyproject.toml will connect the two.

… var mlpstorage-side half of mlcommons/storage #487. The DLIO half lives at mlcommons/DLIO_local_changes #28, which adds the DLIO_DROP_CACHES_TIMEOUT env var so large-RAM hosts can raise the per-call timeout for the per-epoch page-cache flush. Add a deployment knob (--drop-caches-timeout-seconds) to the training `run` subcommand only. Available in all modes (closed/open/whatif), matching the existing pattern for --dlio-bin-path. Scoped to `run` because the per-epoch flush only happens during training execution — datasize/datagen/configview don't invoke it. Not exposed on checkpointing/vectordb/kvcache because those benchmarks don't use the auto-flush code path. Plumbing: 1. argparse validates positive int (>= 1). Default None means no override; DLIO falls back to its built-in 30s. 2. TrainingBenchmark.__init__ sets os.environ['DLIO_DROP_CACHES_TIMEOUT'] when the flag is provided. Subprocess.Popen inherits the parent env, so single-process invocations see the value. 3. generate_dlio_command appends `-x DLIO_DROP_CACHES_TIMEOUT` to the MPI prefix when the env var is set. OpenMPI does not forward arbitrary env vars to ranks by default; -x opts this one in so multi-host training honors the operator's CLI choice. Tests (25 cases): - Flag scope: present on training run; rejected on training datasize/datagen/configview, checkpointing, vectordb, kvcache. - Validation: positive ints accepted (1, 30, 300, 7200, 86400); 0, negative, non-numeric, empty, float rejected. - Env-var plumbing: set when flag provided, unset otherwise, stringified for subprocess. - MPI forwarding: -x present when env set, absent when unset. - End-to-end: full argparse -> env-set chain produces the expected state. After the DLIO PR merges and the storage repo bumps its dlio-benchmark pin, the user from #487 can do: mlpstorage closed training retinanet run file \\ --client-host-memory-in-gb 64 --num-accelerators 4 \\ --accelerator-type b200 --data-dir ... --results-dir ... \\ --drop-caches-timeout-seconds 300

github-actions · 2026-06-23T01:12:45Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy · 2026-06-23T01:15:50Z

Resolves #475

FileSystemGuy requested a review from a team June 23, 2026 01:12

FileSystemGuy marked this pull request as draft June 23, 2026 01:12

FileSystemGuy and others added 4 commits June 23, 2026 08:07

Merge branch 'main' into feat/drop-caches-timeout-cli

5308765

Merge branch 'main' into feat/drop-caches-timeout-cli

71346f9

Merge branch 'main' into feat/drop-caches-timeout-cli

f80392a

Merge branch 'main' into feat/drop-caches-timeout-cli

dbed87f

FileSystemGuy marked this pull request as ready for review June 23, 2026 22:16

FileSystemGuy mentioned this pull request Jun 24, 2026

fix(kvcache): consolidate per-option workload params in mlpstorage (#498, #500) #501

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492

training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492
FileSystemGuy wants to merge 5 commits into
mainfrom
feat/drop-caches-timeout-cli

FileSystemGuy commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FileSystemGuy commented Jun 23, 2026

Summary

Scope

Plumbing

Operator usage (post-merge)

Test plan

Tests added (25 cases)

Dependencies

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

FileSystemGuy commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants