Skip to content

training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492

Open
FileSystemGuy wants to merge 5 commits into
mainfrom
feat/drop-caches-timeout-cli
Open

training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492
FileSystemGuy wants to merge 5 commits into
mainfrom
feat/drop-caches-timeout-cli

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

mlpstorage-side half of #487. The DLIO half is mlcommons/DLIO_local_changes#28, which adds the DLIO_DROP_CACHES_TIMEOUT env var so large-RAM hosts can raise the per-call timeout for DLIO's per-epoch page-cache flush.

This PR adds a deployment knob to the training run subcommand only and plumbs the value through to DLIO via that env var.

Scope

  • Subcommand: training run only. datasize / datagen / configview do not invoke the per-epoch flush loop, so they don't need the knob.
  • Benchmark: training only. Checkpointing, VectorDB, and KV Cache do not use the auto-flush code path (checkpointing uses O_DIRECT / fadvise on the data path and a manual operator-side clear between phases; VDB/KVCache don't invoke drop_caches at all).
  • Mode: all (closed/open/whatif). It's a deployment knob, not a submission tunable, matching the existing pattern for --dlio-bin-path.

Plumbing

  1. CLI (mlpstorage_py/cli/training_args.py): argparse validates positive int (>= 1) via a _positive_int type. Default None means no override; DLIO falls back to its built-in 30s.
  2. Env-var set (mlpstorage_py/benchmarks/dlio.py:TrainingBenchmark.__init__): when the flag is provided, set os.environ['DLIO_DROP_CACHES_TIMEOUT'] to the stringified value. subprocess.Popen inherits parent env, so single-process invocations see it.
  3. MPI forwarding (generate_dlio_command()): when the env var is set and exec_type == MPI, append -x DLIO_DROP_CACHES_TIMEOUT to the MPI prefix. OpenMPI does not forward arbitrary env vars to ranks by default; -x opts this one in so multi-host training honors the CLI choice.

Operator usage (post-merge)

After the DLIO PR lands and this repo's dlio-benchmark pin is bumped:

mlpstorage closed training retinanet run file \\
    --client-host-memory-in-gb 64 --num-accelerators 4 \\
    --accelerator-type b200 --data-dir ... --results-dir ... \\
    --drop-caches-timeout-seconds 300

(DLIO_DROP_CACHES_TIMEOUT=300 exported directly into DLIO's env is the alternative path; both arrive at the same place.)

Test plan

  • pytest tests/unit/test_drop_caches_timeout_cli.py — 25 cases pass (flag scope, value validation, env-var plumbing, MPI forwarding, end-to-end).
  • pytest tests/unit --ignore=tests/unit/test_benchmarks_base.py --ignore=tests/unit/test_parquet_reader.py --ignore=tests/unit/test_vdb_modular_fake_backend.py — 1399 pass, 4 skipped, 0 failures.
  • Manual end-to-end: invoked parse_arguments() with --drop-caches-timeout-seconds 300, confirmed args.drop_caches_timeout_seconds == 300, the env var is set, and the MPI prefix gets -x DLIO_DROP_CACHES_TIMEOUT appended.
  • Manual rejection: --drop-caches-timeout-seconds -5 exits with argument --drop-caches-timeout-seconds: expected positive integer (>= 1), got -5.

Tests added (25 cases)

  • TestFlagScope (6 cases): flag present on training run; rejected on training datasize/datagen/configview, checkpointing run, vectordb run, kvcache run.
  • TestValueValidation (12 cases): positive ints accepted (1, 30, 300, 7200, 86400); 0, negative, non-numeric, empty, float rejected; default is None when omitted.
  • TestEnvVarPlumbing (3 cases): env var set when flag provided, unset otherwise, stored as string (not int).
  • TestMpiForwarding (2 cases): -x DLIO_DROP_CACHES_TIMEOUT present when env set, absent when unset.
  • TestEndToEnd (1 case): full argparse → env-set chain reaches the expected state.

Dependencies

This PR's CLI knob is functional regardless of the DLIO pin — the env var simply gets set in the subprocess environment. The behavior change in DLIO (slow-flush retry + reading the env var) requires mlcommons/DLIO_local_changes#28. The follow-up PR that bumps the dlio-benchmark pin in pyproject.toml will connect the two.

… var

mlpstorage-side half of mlcommons/storage #487.  The DLIO half lives at
mlcommons/DLIO_local_changes #28, which adds the DLIO_DROP_CACHES_TIMEOUT
env var so large-RAM hosts can raise the per-call timeout for the
per-epoch page-cache flush.

Add a deployment knob (--drop-caches-timeout-seconds) to the training
`run` subcommand only.  Available in all modes (closed/open/whatif),
matching the existing pattern for --dlio-bin-path.  Scoped to `run`
because the per-epoch flush only happens during training execution —
datasize/datagen/configview don't invoke it.  Not exposed on
checkpointing/vectordb/kvcache because those benchmarks don't use the
auto-flush code path.

Plumbing:
  1. argparse validates positive int (>= 1).  Default None means no
     override; DLIO falls back to its built-in 30s.
  2. TrainingBenchmark.__init__ sets os.environ['DLIO_DROP_CACHES_TIMEOUT']
     when the flag is provided.  Subprocess.Popen inherits the parent env,
     so single-process invocations see the value.
  3. generate_dlio_command appends `-x DLIO_DROP_CACHES_TIMEOUT` to the
     MPI prefix when the env var is set.  OpenMPI does not forward
     arbitrary env vars to ranks by default; -x opts this one in so
     multi-host training honors the operator's CLI choice.

Tests (25 cases):
  - Flag scope: present on training run; rejected on training
    datasize/datagen/configview, checkpointing, vectordb, kvcache.
  - Validation: positive ints accepted (1, 30, 300, 7200, 86400);
    0, negative, non-numeric, empty, float rejected.
  - Env-var plumbing: set when flag provided, unset otherwise,
    stringified for subprocess.
  - MPI forwarding: -x present when env set, absent when unset.
  - End-to-end: full argparse -> env-set chain produces the expected
    state.

After the DLIO PR merges and the storage repo bumps its dlio-benchmark
pin, the user from #487 can do:
    mlpstorage closed training retinanet run file \\
        --client-host-memory-in-gb 64 --num-accelerators 4 \\
        --accelerator-type b200 --data-dir ... --results-dir ... \\
        --drop-caches-timeout-seconds 300
@FileSystemGuy FileSystemGuy requested a review from a team June 23, 2026 01:12
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy FileSystemGuy marked this pull request as draft June 23, 2026 01:12
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

Resolves #475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants