Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70
Open
ksd3 wants to merge 2 commits into
Open
Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70ksd3 wants to merge 2 commits into
ksd3 wants to merge 2 commits into
Conversation
Shuffle the streaming train set with a fixed `data_seed` BEFORE sharding across DDP ranks, rather than after sharding with a per-rank seed (`1337 + seed_offset`). The global example order now depends only on `data_seed` -- not on the GPU count or the per-rank model seed -- so every run/config in a scaling study trains on the identical example order, while each rank still takes a deterministic, disjoint slice of that one canonical order. Model-init/DDP seeding (`torch.manual_seed(1337 + seed_offset)`) is unchanged.
Lower num_workers from 32 to 8 in every config/pythia-like config. They were already identical (so already reproducible), but 32 workers/rank x 4 ranks fans out to ~128 concurrent HF range-requests and rate-limits (HTTP 429) when streaming the 764 GB Smith42/galaxies set. 8/rank keeps ~32 streams/node -- 429-safe -- and, being a single fixed value across all 12 configs, means no launcher needs to override num_workers per-run (which would change DataLoader worker interleaving and so the example order). With the fixed data_seed shuffle, this makes the data order genuinely identical across the scaling study.
95c393b to
80132c5
Compare
80132c5 to
f21d4cd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the streaming training data appear in the same example order for every run/config, so a scaling study compares model/objective on identical data.
1.
scripts/train.py— fixeddata_seed, shuffle before shard.The stream shuffled after node-sharding with a rank-dependent seed (
1337 + seed_offset), so the global order depended on GPU count and the model seed. Now it shuffles the full stream with a fixeddata_seed = 1337before sharding; each rank takes a deterministic disjoint slice. Model/DDP seeding unchanged; single-GPU behavior unchanged.2.
config/pythia-like/*.py— pinnum_workers32 → 8.Byte-identical batch order also needs the same
num_workers(DataLoader worker interleaving sets the order). 32/rank × 4 ranks ≈ 128 concurrent HF range-requests on the 764 GB stream → HTTP 429, forcing per-run overrides that break reproducibility. A fixed8(~32 streams/node) is 429-safe and identical across all 12 configs.(The one-epoch
max_iterschange is split out into #71.)