Set scaling-study configs to exactly one epoch#71
Open
ksd3 wants to merge 1 commit into
Open
Conversation
max_iters was an inherited placeholder (30000), which at effective batch 640 is 19.2M presentations = ~2.27 passes over the 8,474,566-galaxy train set. Set max_iters = 13241 = floor(8,474,566 / 640) so each run does exactly one epoch (~2.17B tokens, ~Chinchilla-optimal for the 100M model), and shrink lr_decay_iters to match so the cosine LR fully decays to min_lr by the end of the run rather than stopping mid-decay.
8abd9ec to
8f8ad66
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
max_iters = 30000in theconfig/pythia-likeconfigs was an inherited nanoGPT placeholder, not derived from the dataset. At effective batch 640 that's 19.2M presentations = ~2.27 passes over the 8,474,566-galaxySmith42/galaxiestrain split.This sets
max_iters = 13241 = floor(8,474,566 / 640)— one pass over the train set (~2.17B image-patch tokens, ≈ Chinchilla-optimal for the 100M run) — and shrinkslr_decay_itersto match so the cosine LR fully decays tomin_lrby the end of the run rather than stopping mid-decay.(Note: 13241 is one pass minus a 326-example remainder, since 8,474,566 isn't divisible by the effective batch; truly exact one-pass coverage under DDP needs len-aware sampling, which is a separate change.)