15' + 1 hour record: Document-level shuffling#76
Conversation
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start (high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling). Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start. Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start (high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling). Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start. Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
akshayvegesna
left a comment
There was a problem hiding this comment.
Thank you, left a couple minor comments. To merge the update to prepare_data, we also need to ensure that unlimited/train.py (and any other code) doesn't break. Can you update the Dataloader with your no-document-level shuffling version that is equivalent to before so it works?
| print() | ||
| verify_hash(val_path) | ||
| verify_hash(train_path) | ||
| for p in (val_path, train_path): |
There was a problem hiding this comment.
Can you add an assert that the hash doesn't change? It's nice to have confidence that nothing has changed upstream in the huggingface dataset.
| @@ -12,12 +12,17 @@ | |||
| import math | |||
There was a problem hiding this comment.
Since we don't have a record here, let's leave this file changes out until we do.
| EVAL_TOKENS = 10_000_000 | ||
| DATA_DIR = "fineweb_data" | ||
| BOS_ID = 50256 # <|endoftext|> | ||
| RUNS_DIR = "runs" |
There was a problem hiding this comment.
Let's add runs/ to .gitignore
| def resolve_run_dir(run_name): | ||
| if run_name: | ||
| return run_name, os.path.join(RUNS_DIR, run_name) | ||
| name = "".join(secrets.choice(string.ascii_lowercase + string.digits) for _ in range(6)) |
There was a problem hiding this comment.
nit: I suggest we keep the datetime run id.
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start (high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling). Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start. Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
|
Alright all have been changed. All the non-record dataloaders (unlimited/, two_hour/, research/) should be token-for-token identical to before. Thanks! |
|
Thank you, looks good. Merging. |
Instead of shuffling the order of a fixed set of batches every epoch, shuffle the order of the documents in the training set, then tokenize into batches + re-shuffle.
15' log file:
Old record: 3.345
New record: 3.332
1hr log file:
Old record: 3.211
New record: 3.204
This is taking one of the many components of #67 so that comparison to previous results is straightforward. This does not change the eval tokens or batching at all -- all it does is make the training shuffle the documents before re-batching. This adds more training diversity as any given token is preceded by different tokens every epoch.
I also added 2 hour record which was pre-empted while running. Submitting now, hoping someone can run it. The code is backwards compatible with
--no-doc-shuffle, and I added some QOL things like saving the log, training file, and checkpoints to a file. This not only makes dev easier, but also replication. I hope people adopt this standard of attaching logs.