implemented NCA pre-pretraining (val/loss - 3.207276)#63
Conversation
|
Hi Idhant, this is a really great contribution! A bunch of folks have been interested in NCA on Slowrun so this will be cool to them. A couple questions:
Generally we want this repo to be a home for research on synthetic data from simple programs, maybe we make a separate track to support this. The unlimited/ track would probably not incentivize research here given the no time limit setting. Give us a little while to think that through and let's develop the code in the dev/ directory for now. |
|
thank you for the feedback! answering to your questions,
happy to perform any other abalations / experiements on the same. and answer any questions :) |
|
Yup the ablation would be great! And yes, I would suggest putting it into dev/nca/ with both train, pre_pre_train, and run. |
|
sounds good
also a further clarification, i didnt change the model size. its still 2.7B |
|
So the 2.7B in the README was the baseline during the initial release. The current parameter count in the main 1 hour track is: |
|
oh okay. i see. for the time being, i'll go with 2.7B model for the time being (to save on time and compute lol) (to define the base / wo/o nca training) just another feedback. i noticed that overall these small / minor details weren’t mentioned clearly in the repo. for the newer participants, it would be nice if we clarify the baseline scores and hyper-params clearly just to avoid confusion going forward. |
|
Yeah sounds good. I do think the defaults are set properly to minimize confusion, but the 2.7B parameter point being confusing is fair, some folks got confused after we updated the baseline. Will clean that up soon. |
|
done with the w/o NCA training. and the val/loss loss is higher at 3.28241 compared to 3.207276. updated the wandb report i noticed that w/o NCA tends to overfit faster than with NCA by achieving lower train/loss and faster. also noticed that w/ NCA was much more stable that that of NCA.
train/loss comparison
val/loss comparison w/o NCA training script [also in WANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \
--run lang-pre-train-v11-no-nca \
--n_embd 2560 --n_head 20 --n_layer 30 \
--num-epochs 25 \
--total-batch-size 524288 \
--lr_multiplier 0.25 \
--weight-decay 1.3 \
--dropout 0.1 \
--warmdown-ratio 0.2 \
--dupe-start-epoch 12 \
--logit-avg 7 \
--swa-last-epochs 6
w/o NCA train log - pre-training-no-nca.log and no-nca-oom-error.log |
|
I am a bit reluctant to merge this -- my expectation is that increasing model size to 2.7B would make val loss significantly lower (even if you just run the main track code with just that change). Not sure if it will get to 3.207, but it should get close (we get 3.227 with a 1.4B model). I think we should tune the baseline w/o NCA so the comparison is fairer. I just want us to be confident than NCA gives the gain, and that the baseline size increase is not completely responsible for it. Your current new flags kind of confuse the story a bit, and i can't tell whether the w/o NCA baseline performs worse because of those flags or because it is worse. I think it would be sufficient to test the main track with the updated model size, and also try increasing number of epochs to ensure it is trained for a comparable amount of time. Generally down to merge this if NCA shows any significant gain, but over the fair baseline. Fine to work at a smaller model scale in order to test this as well so we can get confidence without spending a lot of compute. |
|
thats sounds fair. i can run the training again in order to get it comparable against the leaderboard. just to get it clear, will be running a training run on 1.4B model. are there any other params to match the i keep have. would. it be fine to have it run for longer epochs / time. so i'll be running 2 training runs (1) w/ NCA and (2) w/o NCA. does that sound good? |
Yes that sounds good! If there is a gain from NCA, this is good to go. Any parameters that you use in the Fineweb training with NCA should be match the Fineweb training configuration without NCA (including epoch count). |
|
sounds good! i'll just the same training run that i did before. with just the right model size. |
|
When the paper came out, I tried running the NCA experiments using their repo but with 2 slight differences: a). using Muon on the 2d weights and b). using Fineweb as the dataset. The NCA gains held when using Adam and Fineweb. But, when I used Muon, NCA actually hurt the performance -- making it comparable to Adam w/ no NCA (step-wise comparison). So roughly, Muon (NCA) ~= Adam (no NCA), and Muon (no NCA) ~= Adam (NCA). My rough intuition for why is that the spectral normalization of Muon makes the weight updates "move more" than in Adam, kind of like how Muon erases the number encoded in the weights like below (from https://jeremybernste.in/writing/deriving-muon). Given this, I tried a more gentle LR warmup when init'ing after NCA pre-pre-training, but that didn't change results much.
Given that NCA is a fixed cost and Muon is per-step overhead, maybe there is some step count where Adam+NCA does better than Muon w/ no NCA when measuring wall-clock time. However, I don't think this is the case, but it would be interesting for someone to explore this more fully. Given the many implementation choices (e.g. maybe mixing the pre-pre-training data throughout pre-training, different Muon requires different NCA hparams, ...), a negative result here doesn't mean that much, but putting this in case it is useful. I love the idea of algorithmic pre-pre-training and hope someone can get it to work :) |
|
@samacqua this is very interesting observation and helpful in this case. thank you. also i had a question, did you keep the optimizer constant (like muon / adam) for both pre-pre-training and pre-training phase? further, what about the other hyperparams like LR, WD, etc. did you change / tune them? if so, to what degree? |
|
so couple of training runs with sligthly different configs. with the model paramaters being - didnt really that much sucess interms of val/loss. ran each of the runs for 25ep while changing some of the NCA specific params final min val/loss after looking into what @samacqua. i have an idea or two to try in mind to try. i was thinking to try hybrid optimizer approach wherein adamW only on the matrix params that were loaded from the NCA checkpoint; muon on everything else. but apart from what if theres any other feedback or so. open to that. |




this PR includes the implementation of NCA pre-pretraining (orginal paper: https://arxiv.org/abs/2603.10055, orginal open-source code: https://github.com/danihyunlee/nca-pre-pretraining)
final best val/loss: 3.207276
total training time: 71m (pre-pre-training) + 240m (pre-training) = 5h 11m
total wall time: 113m (pre-pre-training) + 272m (pre-training) = 6h 25m
training command / script used for this is present in
run.shtraining stats report: wandb report
nca pre-pre-training:
implentation for the nca pre-pretraining is present in
pre_pre_train.pyfor this 50M were used per-epoch for 6 epochs (i.e., 300M total NCA tokens)
WANDB_MODE=online torchrun --standalone --nproc_per_node=8 pre_pre_train.py \ --tokens-per-epoch 50000000 --num-eval-tokens 16000000 \ --num-epochs 6 \ --regen-data \ --grid 12 --patch 2 --num-colors 10 \ --temperature 0.0001 --dT 1 --init-rollout-steps 10 \ --filter-rules --filter-threshold 0.5 --filter-upper-bound 1.0 \ --n-layer 30 --n-head 20 --n-embd 2560 \ --seq-len 1024 --window-pattern SSSL \ --device-batch-size 8 --total-batch-size 65536 \ --lr 6e-4 --weight-decay 0.3 --dropout 0.1 \ --warmdown-ratio 0.15 \ --save-dir nca_ckpts/ppt_v8 \ --run nca-ppt-v8final train/loss: 3.339553
final val/loss: 3.420781
log - nca-ppt-training.log
pre-training:
also some edits were made to the
train.pyin order to accomodate pre-pre-training likeWANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \ --run lang-pre-train-v10 \ --pretrained-ckpt nca_ckpts/ppt_v8/nca_pretrained_best.pt \ --n_embd 2560 --n_head 20 --n_layer 30 \ --num-epochs 25 \ --total-batch-size 524288 \ --lr_multiplier 0.25 \ --weight-decay 1.3 \ --dropout 0.1 \ --nca-load-mode attn \ --nca-warmup-steps 250 --nca-rampup-steps 200 \ --warmdown-ratio 0.2 \ --dupe-start-epoch 12 \ --logit-avg 5 \ --swa-last-epochs 4log - pre-training.log
compute
all the training was done on a single node of 8xH100 (via modal.com)