Skip to content

implemented NCA pre-pretraining (val/loss - 3.207276)#63

Open
idhantgulati wants to merge 15 commits into
qlabs-eng:mainfrom
idhantgulati:main
Open

implemented NCA pre-pretraining (val/loss - 3.207276)#63
idhantgulati wants to merge 15 commits into
qlabs-eng:mainfrom
idhantgulati:main

Conversation

@idhantgulati
Copy link
Copy Markdown

@idhantgulati idhantgulati commented Apr 7, 2026

this PR includes the implementation of NCA pre-pretraining (orginal paper: https://arxiv.org/abs/2603.10055, orginal open-source code: https://github.com/danihyunlee/nca-pre-pretraining)

final best val/loss: 3.207276

total training time: 71m (pre-pre-training) + 240m (pre-training) = 5h 11m
total wall time: 113m (pre-pre-training) + 272m (pre-training) = 6h 25m

training command / script used for this is present in run.sh

training stats report: wandb report

nca pre-pre-training:

implentation for the nca pre-pretraining is present in pre_pre_train.py

for this 50M were used per-epoch for 6 epochs (i.e., 300M total NCA tokens)

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 pre_pre_train.py \
    --tokens-per-epoch 50000000 --num-eval-tokens 16000000 \
    --num-epochs 6 \
    --regen-data \
    --grid 12 --patch 2 --num-colors 10 \
    --temperature 0.0001 --dT 1 --init-rollout-steps 10 \
    --filter-rules --filter-threshold 0.5 --filter-upper-bound 1.0 \
    --n-layer 30 --n-head 20 --n-embd 2560 \
    --seq-len 1024 --window-pattern SSSL \
    --device-batch-size 8 --total-batch-size 65536 \
    --lr 6e-4 --weight-decay 0.3 --dropout 0.1 \
    --warmdown-ratio 0.15 \
    --save-dir nca_ckpts/ppt_v8 \
    --run nca-ppt-v8

final train/loss: 3.339553
final val/loss: 3.420781

log - nca-ppt-training.log

pre-training:

also some edits were made to the train.py in order to accomodate pre-pre-training like

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \
    --run lang-pre-train-v10 \
    --pretrained-ckpt nca_ckpts/ppt_v8/nca_pretrained_best.pt \
    --n_embd 2560 --n_head 20 --n_layer 30 \
    --num-epochs 25 \
    --total-batch-size 524288 \
    --lr_multiplier 0.25 \
    --weight-decay 1.3 \
    --dropout 0.1 \
    --nca-load-mode attn \
    --nca-warmup-steps 250 --nca-rampup-steps 200 \
    --warmdown-ratio 0.2 \
    --dupe-start-epoch 12 \
    --logit-avg 5 \
    --swa-last-epochs 4

log - pre-training.log

compute

all the training was done on a single node of 8xH100 (via modal.com)

PS: noticed that during the end of training the model started to overfit due to which the val/loss started to increase. hence, the last couple of epochs were not included in the final best val/loss.

also thanks to @MarmikChaudhari for the inital idea :)

@akshayvegesna
Copy link
Copy Markdown
Contributor

Hi Idhant, this is a really great contribution! A bunch of folks have been interested in NCA on Slowrun so this will be cool to them.

A couple questions:

  1. Have you ablated the contribution of NCA in this PR? We should check that it helps at least a little bit to improving the performance of the pre-training. I would assume that some of your changes like increasing the model size improves the pre-training itself, so we should ensure that all the gains don't just come from there.
  2. I think we should merge this to the dev/ folder for now in a nca/ directory. For new folks, we don't want the iteration time on the main track to be too long. Can you make that change? Then this is good to go.

Generally we want this repo to be a home for research on synthetic data from simple programs, maybe we make a separate track to support this. The unlimited/ track would probably not incentivize research here given the no time limit setting. Give us a little while to think that through and let's develop the code in the dev/ directory for now.

@idhantgulati
Copy link
Copy Markdown
Author

thank you for the feedback!

answering to your questions,

  1. yes, did a training run with and without NCA. but since then some things diverged. i'll run another training run (w/o NCA) whiie keeping everthing same and will post the comparison here by tomorrow or so. from but what i noticed that w/o NCA it seems to converge faster intially just because of the warmup and rampup steps i believe.
  2. sounds good. just so i understand, i should put my changes into the dev/, right?

happy to perform any other abalations / experiements on the same. and answer any questions :)

@akshayvegesna
Copy link
Copy Markdown
Contributor

Yup the ablation would be great! And yes, I would suggest putting it into dev/nca/ with both train, pre_pre_train, and run.

@idhantgulati
Copy link
Copy Markdown
Author

sounds good

I would assume that some of your changes like increasing the model size improves the pre-training itself

also a further clarification, i didnt change the model size. its still 2.7B [ Parameters: 2,734,834,155 (transformer: 2,378,973,600, value_embeds: 98,304,000, lm_head: 128,778,240, other: 128,778,315)]

@akshayvegesna
Copy link
Copy Markdown
Contributor

So the 2.7B in the README was the baseline during the initial release.

The current parameter count in the main 1 hour track is:
Parameters: 1,398,287,931 (transformer: 1,169,829,360, value_embeds: 48,168,960, lm_head: 90,144,768, other: 90,144,843)
So relative to that there is a decent bump in the parameter count, particularly due to n-embd=2560 probably

@idhantgulati
Copy link
Copy Markdown
Author

idhantgulati commented Apr 7, 2026

oh okay. i see.

for the time being, i'll go with 2.7B model for the time being (to save on time and compute lol) (to define the base / wo/o nca training)

just another feedback. i noticed that overall these small / minor details weren’t mentioned clearly in the repo. for the newer participants, it would be nice if we clarify the baseline scores and hyper-params clearly just to avoid confusion going forward.

@akshayvegesna
Copy link
Copy Markdown
Contributor

Yeah sounds good. I do think the defaults are set properly to minimize confusion, but the 2.7B parameter point being confusing is fair, some folks got confused after we updated the baseline. Will clean that up soon.

@idhantgulati
Copy link
Copy Markdown
Author

idhantgulati commented Apr 8, 2026

done with the w/o NCA training. and the val/loss loss is higher at 3.28241 compared to 3.207276.

updated the wandb report

i noticed that w/o NCA tends to overfit faster than with NCA by achieving lower train/loss and faster. also noticed that w/ NCA was much more stable that that of NCA.

CleanShot 2026-04-07 at 20 37 51@2x

train/loss comparison

CleanShot 2026-04-07 at 20 40 41@2x

val/loss comparison

w/o NCA training script [also in run.sh]

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \
    --run lang-pre-train-v11-no-nca \
    --n_embd 2560 --n_head 20 --n_layer 30 \
    --num-epochs 25 \
    --total-batch-size 524288 \
    --lr_multiplier 0.25 \
    --weight-decay 1.3 \
    --dropout 0.1 \
    --warmdown-ratio 0.2 \
    --dupe-start-epoch 12 \
    --logit-avg 7 \
    --swa-last-epochs 6

NOTE: for some reason, the baseline training crashed right before finishing the 25th epoch (the last epoch) (due to OOM error) but i believe from that this shouldnt help in the score either ways because the val/loss started to become worse well before that (overfitting) but i computed the eval logit avg vall/loss from the saved checkpoint (epoch 18-24, instead of 19-25)

w/o NCA train log - pre-training-no-nca.log and no-nca-oom-error.log

@akshayvegesna
Copy link
Copy Markdown
Contributor

I am a bit reluctant to merge this -- my expectation is that increasing model size to 2.7B would make val loss significantly lower (even if you just run the main track code with just that change). Not sure if it will get to 3.207, but it should get close (we get 3.227 with a 1.4B model). I think we should tune the baseline w/o NCA so the comparison is fairer. I just want us to be confident than NCA gives the gain, and that the baseline size increase is not completely responsible for it. Your current new flags kind of confuse the story a bit, and i can't tell whether the w/o NCA baseline performs worse because of those flags or because it is worse.

I think it would be sufficient to test the main track with the updated model size, and also try increasing number of epochs to ensure it is trained for a comparable amount of time.

Generally down to merge this if NCA shows any significant gain, but over the fair baseline. Fine to work at a smaller model scale in order to test this as well so we can get confidence without spending a lot of compute.

@idhantgulati
Copy link
Copy Markdown
Author

thats sounds fair. i can run the training again in order to get it comparable against the leaderboard.

just to get it clear, will be running a training run on 1.4B model. are there any other params to match the i keep have. would. it be fine to have it run for longer epochs / time.

so i'll be running 2 training runs (1) w/ NCA and (2) w/o NCA. does that sound good?

@idhantgulati
Copy link
Copy Markdown
Author

@akshayvegesna

@akshayvegesna
Copy link
Copy Markdown
Contributor

so i'll be running 2 training runs (1) w/ NCA and (2) w/o NCA. does that sound good?

Yes that sounds good! If there is a gain from NCA, this is good to go. Any parameters that you use in the Fineweb training with NCA should be match the Fineweb training configuration without NCA (including epoch count).

@idhantgulati
Copy link
Copy Markdown
Author

sounds good!

i'll just the same training run that i did before. with just the right model size.

@samacqua
Copy link
Copy Markdown
Contributor

When the paper came out, I tried running the NCA experiments using their repo but with 2 slight differences: a). using Muon on the 2d weights and b). using Fineweb as the dataset. The NCA gains held when using Adam and Fineweb. But, when I used Muon, NCA actually hurt the performance -- making it comparable to Adam w/ no NCA (step-wise comparison).

So roughly, Muon (NCA) ~= Adam (no NCA), and Muon (no NCA) ~= Adam (NCA).

My rough intuition for why is that the spectral normalization of Muon makes the weight updates "move more" than in Adam, kind of like how Muon erases the number encoded in the weights like below (from https://jeremybernste.in/writing/deriving-muon). Given this, I tried a more gentle LR warmup when init'ing after NCA pre-pre-training, but that didn't change results much.

image

Given that NCA is a fixed cost and Muon is per-step overhead, maybe there is some step count where Adam+NCA does better than Muon w/ no NCA when measuring wall-clock time. However, I don't think this is the case, but it would be interesting for someone to explore this more fully.

Given the many implementation choices (e.g. maybe mixing the pre-pre-training data throughout pre-training, different Muon requires different NCA hparams, ...), a negative result here doesn't mean that much, but putting this in case it is useful. I love the idea of algorithmic pre-pre-training and hope someone can get it to work :)

@idhantgulati
Copy link
Copy Markdown
Author

@samacqua this is very interesting observation and helpful in this case. thank you.

also i had a question, did you keep the optimizer constant (like muon / adam) for both pre-pre-training and pre-training phase?

further, what about the other hyperparams like LR, WD, etc. did you change / tune them? if so, to what degree?

@idhantgulati
Copy link
Copy Markdown
Author

idhantgulati commented Apr 15, 2026

so couple of training runs with sligthly different configs. with the model paramaters being - 1,398,287,931 (transformer: 1,169,829,360, value_embeds: 48,168,960, lm_head: 90,144,768, other: 90,144,843)

didnt really that much sucess interms of val/loss. ran each of the runs for 25ep while changing some of the NCA specific params

final min val/loss
baseline; w/o NCA - 3.177205 [2 hours 6 min]
it-1 - 3.187808
it-2 - 3.208140 [bumped up ppt epochs from 6 -> 10; also used nca load mode = all instead of attn; bumped warm/ramp = 500/500 from 250/200]
it-3 - 3.205975 [nca load mode = all; w/ warmup steps = 0, rampup = 500]
it-4 - 3.193332 [back to load mode attn; w/ warm/ramp = 500/500]

after looking into what @samacqua. i have an idea or two to try in mind to try. i was thinking to try hybrid optimizer approach wherein adamW only on the matrix params that were loaded from the NCA checkpoint; muon on everything else.

but apart from what if theres any other feedback or so. open to that.

@samacqua
Copy link
Copy Markdown
Contributor

also i had a question, did you keep the optimizer constant (like muon / adam) for both pre-pre-training and pre-training phase?

Yes, I tried the cross product of (no pre-pre-training, adam pre-pre-training, muon pre-pre-training) and (adam pretraining, muon pretraining). I just found the old results:

image

Barely tuned hparams. Only change was to increase the LR warmup or muon pretraining in the hopes that this would preserve the pre-pre-training structure, to no avail. So muon pre-pre-training gets better pre-pre-training loss and helps downstream pre-training more for both pre-training optimizers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants