training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492
Open
FileSystemGuy wants to merge 5 commits into
Open
training: expose --drop-caches-timeout-seconds; plumb to DLIO via env var#492FileSystemGuy wants to merge 5 commits into
FileSystemGuy wants to merge 5 commits into
Conversation
… var mlpstorage-side half of mlcommons/storage #487. The DLIO half lives at mlcommons/DLIO_local_changes #28, which adds the DLIO_DROP_CACHES_TIMEOUT env var so large-RAM hosts can raise the per-call timeout for the per-epoch page-cache flush. Add a deployment knob (--drop-caches-timeout-seconds) to the training `run` subcommand only. Available in all modes (closed/open/whatif), matching the existing pattern for --dlio-bin-path. Scoped to `run` because the per-epoch flush only happens during training execution — datasize/datagen/configview don't invoke it. Not exposed on checkpointing/vectordb/kvcache because those benchmarks don't use the auto-flush code path. Plumbing: 1. argparse validates positive int (>= 1). Default None means no override; DLIO falls back to its built-in 30s. 2. TrainingBenchmark.__init__ sets os.environ['DLIO_DROP_CACHES_TIMEOUT'] when the flag is provided. Subprocess.Popen inherits the parent env, so single-process invocations see the value. 3. generate_dlio_command appends `-x DLIO_DROP_CACHES_TIMEOUT` to the MPI prefix when the env var is set. OpenMPI does not forward arbitrary env vars to ranks by default; -x opts this one in so multi-host training honors the operator's CLI choice. Tests (25 cases): - Flag scope: present on training run; rejected on training datasize/datagen/configview, checkpointing, vectordb, kvcache. - Validation: positive ints accepted (1, 30, 300, 7200, 86400); 0, negative, non-numeric, empty, float rejected. - Env-var plumbing: set when flag provided, unset otherwise, stringified for subprocess. - MPI forwarding: -x present when env set, absent when unset. - End-to-end: full argparse -> env-set chain produces the expected state. After the DLIO PR merges and the storage repo bumps its dlio-benchmark pin, the user from #487 can do: mlpstorage closed training retinanet run file \\ --client-host-memory-in-gb 64 --num-accelerators 4 \\ --accelerator-type b200 --data-dir ... --results-dir ... \\ --drop-caches-timeout-seconds 300
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
Author
|
Resolves #475 |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mlpstorage-side half of #487. The DLIO half is mlcommons/DLIO_local_changes#28, which adds the
DLIO_DROP_CACHES_TIMEOUTenv var so large-RAM hosts can raise the per-call timeout for DLIO's per-epoch page-cache flush.This PR adds a deployment knob to the training
runsubcommand only and plumbs the value through to DLIO via that env var.Scope
runonly.datasize/datagen/configviewdo not invoke the per-epoch flush loop, so they don't need the knob.drop_cachesat all).--dlio-bin-path.Plumbing
mlpstorage_py/cli/training_args.py): argparse validates positive int (>= 1) via a_positive_inttype. DefaultNonemeans no override; DLIO falls back to its built-in 30s.mlpstorage_py/benchmarks/dlio.py:TrainingBenchmark.__init__): when the flag is provided, setos.environ['DLIO_DROP_CACHES_TIMEOUT']to the stringified value.subprocess.Popeninherits parent env, so single-process invocations see it.generate_dlio_command()): when the env var is set andexec_type == MPI, append-x DLIO_DROP_CACHES_TIMEOUTto the MPI prefix. OpenMPI does not forward arbitrary env vars to ranks by default;-xopts this one in so multi-host training honors the CLI choice.Operator usage (post-merge)
After the DLIO PR lands and this repo's
dlio-benchmarkpin is bumped:(
DLIO_DROP_CACHES_TIMEOUT=300exported directly into DLIO's env is the alternative path; both arrive at the same place.)Test plan
pytest tests/unit/test_drop_caches_timeout_cli.py— 25 cases pass (flag scope, value validation, env-var plumbing, MPI forwarding, end-to-end).pytest tests/unit --ignore=tests/unit/test_benchmarks_base.py --ignore=tests/unit/test_parquet_reader.py --ignore=tests/unit/test_vdb_modular_fake_backend.py— 1399 pass, 4 skipped, 0 failures.parse_arguments()with--drop-caches-timeout-seconds 300, confirmedargs.drop_caches_timeout_seconds == 300, the env var is set, and the MPI prefix gets-x DLIO_DROP_CACHES_TIMEOUTappended.--drop-caches-timeout-seconds -5exits withargument --drop-caches-timeout-seconds: expected positive integer (>= 1), got -5.Tests added (25 cases)
-x DLIO_DROP_CACHES_TIMEOUTpresent when env set, absent when unset.Dependencies
This PR's CLI knob is functional regardless of the DLIO pin — the env var simply gets set in the subprocess environment. The behavior change in DLIO (slow-flush retry + reading the env var) requires mlcommons/DLIO_local_changes#28. The follow-up PR that bumps the
dlio-benchmarkpin inpyproject.tomlwill connect the two.