implemented NCA pre-pretraining (val/loss - 3.207276) by idhantgulati · Pull Request #63 · qlabs-eng/slowrun

idhantgulati · 2026-04-07T00:25:19Z

this PR includes the implementation of NCA pre-pretraining (orginal paper: https://arxiv.org/abs/2603.10055, orginal open-source code: https://github.com/danihyunlee/nca-pre-pretraining)

final best val/loss: 3.207276

total training time: 71m (pre-pre-training) + 240m (pre-training) = 5h 11m
total wall time: 113m (pre-pre-training) + 272m (pre-training) = 6h 25m

training command / script used for this is present in run.sh

training stats report: wandb report

nca pre-pre-training:

implentation for the nca pre-pretraining is present in pre_pre_train.py

for this 50M were used per-epoch for 6 epochs (i.e., 300M total NCA tokens)

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 pre_pre_train.py \
    --tokens-per-epoch 50000000 --num-eval-tokens 16000000 \
    --num-epochs 6 \
    --regen-data \
    --grid 12 --patch 2 --num-colors 10 \
    --temperature 0.0001 --dT 1 --init-rollout-steps 10 \
    --filter-rules --filter-threshold 0.5 --filter-upper-bound 1.0 \
    --n-layer 30 --n-head 20 --n-embd 2560 \
    --seq-len 1024 --window-pattern SSSL \
    --device-batch-size 8 --total-batch-size 65536 \
    --lr 6e-4 --weight-decay 0.3 --dropout 0.1 \
    --warmdown-ratio 0.15 \
    --save-dir nca_ckpts/ppt_v8 \
    --run nca-ppt-v8

final train/loss: 3.339553
final val/loss: 3.420781

log - nca-ppt-training.log

pre-training:

also some edits were made to the train.py in order to accomodate pre-pre-training like

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \
    --run lang-pre-train-v10 \
    --pretrained-ckpt nca_ckpts/ppt_v8/nca_pretrained_best.pt \
    --n_embd 2560 --n_head 20 --n_layer 30 \
    --num-epochs 25 \
    --total-batch-size 524288 \
    --lr_multiplier 0.25 \
    --weight-decay 1.3 \
    --dropout 0.1 \
    --nca-load-mode attn \
    --nca-warmup-steps 250 --nca-rampup-steps 200 \
    --warmdown-ratio 0.2 \
    --dupe-start-epoch 12 \
    --logit-avg 5 \
    --swa-last-epochs 4

log - pre-training.log

compute

all the training was done on a single node of 8xH100 (via modal.com)

PS: noticed that during the end of training the model started to overfit due to which the val/loss started to increase. hence, the last couple of epochs were not included in the final best val/loss.

also thanks to @MarmikChaudhari for the inital idea :)

akshayvegesna · 2026-04-07T05:28:49Z

Hi Idhant, this is a really great contribution! A bunch of folks have been interested in NCA on Slowrun so this will be cool to them.

A couple questions:

Have you ablated the contribution of NCA in this PR? We should check that it helps at least a little bit to improving the performance of the pre-training. I would assume that some of your changes like increasing the model size improves the pre-training itself, so we should ensure that all the gains don't just come from there.
I think we should merge this to the dev/ folder for now in a nca/ directory. For new folks, we don't want the iteration time on the main track to be too long. Can you make that change? Then this is good to go.

Generally we want this repo to be a home for research on synthetic data from simple programs, maybe we make a separate track to support this. The unlimited/ track would probably not incentivize research here given the no time limit setting. Give us a little while to think that through and let's develop the code in the dev/ directory for now.

idhantgulati · 2026-04-07T05:51:54Z

thank you for the feedback!

answering to your questions,

yes, did a training run with and without NCA. but since then some things diverged. i'll run another training run (w/o NCA) whiie keeping everthing same and will post the comparison here by tomorrow or so. from but what i noticed that w/o NCA it seems to converge faster intially just because of the warmup and rampup steps i believe.
sounds good. just so i understand, i should put my changes into the dev/, right?

happy to perform any other abalations / experiements on the same. and answer any questions :)

akshayvegesna · 2026-04-07T06:06:30Z

Yup the ablation would be great! And yes, I would suggest putting it into dev/nca/ with both train, pre_pre_train, and run.

idhantgulati · 2026-04-07T06:26:00Z

sounds good

I would assume that some of your changes like increasing the model size improves the pre-training itself

also a further clarification, i didnt change the model size. its still 2.7B [ Parameters: 2,734,834,155 (transformer: 2,378,973,600, value_embeds: 98,304,000, lm_head: 128,778,240, other: 128,778,315)]

akshayvegesna · 2026-04-07T06:58:50Z

So the 2.7B in the README was the baseline during the initial release.

The current parameter count in the main 1 hour track is:
Parameters: 1,398,287,931 (transformer: 1,169,829,360, value_embeds: 48,168,960, lm_head: 90,144,768, other: 90,144,843)
So relative to that there is a decent bump in the parameter count, particularly due to n-embd=2560 probably

idhantgulati · 2026-04-07T07:10:18Z

oh okay. i see.

for the time being, i'll go with 2.7B model for the time being (to save on time and compute lol) (to define the base / wo/o nca training)

just another feedback. i noticed that overall these small / minor details weren’t mentioned clearly in the repo. for the newer participants, it would be nice if we clarify the baseline scores and hyper-params clearly just to avoid confusion going forward.

akshayvegesna · 2026-04-07T08:01:49Z

Yeah sounds good. I do think the defaults are set properly to minimize confusion, but the 2.7B parameter point being confusing is fair, some folks got confused after we updated the baseline. Will clean that up soon.

idhantgulati · 2026-04-08T03:57:30Z

done with the w/o NCA training. and the val/loss loss is higher at 3.28241 compared to 3.207276.

updated the wandb report

i noticed that w/o NCA tends to overfit faster than with NCA by achieving lower train/loss and faster. also noticed that w/ NCA was much more stable that that of NCA.

train/loss comparison

val/loss comparison

w/o NCA training script [also in run.sh]

WANDB_MODE=online torchrun --standalone --nproc_per_node=8 train.py \
    --run lang-pre-train-v11-no-nca \
    --n_embd 2560 --n_head 20 --n_layer 30 \
    --num-epochs 25 \
    --total-batch-size 524288 \
    --lr_multiplier 0.25 \
    --weight-decay 1.3 \
    --dropout 0.1 \
    --warmdown-ratio 0.2 \
    --dupe-start-epoch 12 \
    --logit-avg 7 \
    --swa-last-epochs 6

NOTE: for some reason, the baseline training crashed right before finishing the 25th epoch (the last epoch) (due to OOM error) but i believe from that this shouldnt help in the score either ways because the val/loss started to become worse well before that (overfitting) but i computed the eval logit avg vall/loss from the saved checkpoint (epoch 18-24, instead of 19-25)

w/o NCA train log - pre-training-no-nca.log and no-nca-oom-error.log

akshayvegesna · 2026-04-09T05:07:56Z

I am a bit reluctant to merge this -- my expectation is that increasing model size to 2.7B would make val loss significantly lower (even if you just run the main track code with just that change). Not sure if it will get to 3.207, but it should get close (we get 3.227 with a 1.4B model). I think we should tune the baseline w/o NCA so the comparison is fairer. I just want us to be confident than NCA gives the gain, and that the baseline size increase is not completely responsible for it. Your current new flags kind of confuse the story a bit, and i can't tell whether the w/o NCA baseline performs worse because of those flags or because it is worse.

I think it would be sufficient to test the main track with the updated model size, and also try increasing number of epochs to ensure it is trained for a comparable amount of time.

Generally down to merge this if NCA shows any significant gain, but over the fair baseline. Fine to work at a smaller model scale in order to test this as well so we can get confidence without spending a lot of compute.

idhantgulati · 2026-04-10T02:52:04Z

thats sounds fair. i can run the training again in order to get it comparable against the leaderboard.

just to get it clear, will be running a training run on 1.4B model. are there any other params to match the i keep have. would. it be fine to have it run for longer epochs / time.

so i'll be running 2 training runs (1) w/ NCA and (2) w/o NCA. does that sound good?

idhantgulati · 2026-04-10T20:54:51Z

@akshayvegesna

akshayvegesna · 2026-04-12T19:12:43Z

so i'll be running 2 training runs (1) w/ NCA and (2) w/o NCA. does that sound good?

Yes that sounds good! If there is a gain from NCA, this is good to go. Any parameters that you use in the Fineweb training with NCA should be match the Fineweb training configuration without NCA (including epoch count).

idhantgulati · 2026-04-13T09:32:39Z

sounds good!

i'll just the same training run that i did before. with just the right model size.

samacqua · 2026-04-13T14:48:18Z

When the paper came out, I tried running the NCA experiments using their repo but with 2 slight differences: a). using Muon on the 2d weights and b). using Fineweb as the dataset. The NCA gains held when using Adam and Fineweb. But, when I used Muon, NCA actually hurt the performance -- making it comparable to Adam w/ no NCA (step-wise comparison).

So roughly, Muon (NCA) ~= Adam (no NCA), and Muon (no NCA) ~= Adam (NCA).

My rough intuition for why is that the spectral normalization of Muon makes the weight updates "move more" than in Adam, kind of like how Muon erases the number encoded in the weights like below (from https://jeremybernste.in/writing/deriving-muon). Given this, I tried a more gentle LR warmup when init'ing after NCA pre-pre-training, but that didn't change results much.

Given that NCA is a fixed cost and Muon is per-step overhead, maybe there is some step count where Adam+NCA does better than Muon w/ no NCA when measuring wall-clock time. However, I don't think this is the case, but it would be interesting for someone to explore this more fully.

Given the many implementation choices (e.g. maybe mixing the pre-pre-training data throughout pre-training, different Muon requires different NCA hparams, ...), a negative result here doesn't mean that much, but putting this in case it is useful. I love the idea of algorithmic pre-pre-training and hope someone can get it to work :)

idhantgulati · 2026-04-14T02:25:31Z

@samacqua this is very interesting observation and helpful in this case. thank you.

also i had a question, did you keep the optimizer constant (like muon / adam) for both pre-pre-training and pre-training phase?

further, what about the other hyperparams like LR, WD, etc. did you change / tune them? if so, to what degree?

idhantgulati · 2026-04-15T20:05:18Z

so couple of training runs with sligthly different configs. with the model paramaters being - 1,398,287,931 (transformer: 1,169,829,360, value_embeds: 48,168,960, lm_head: 90,144,768, other: 90,144,843)

didnt really that much sucess interms of val/loss. ran each of the runs for 25ep while changing some of the NCA specific params

final min val/loss
baseline; w/o NCA - 3.177205 [2 hours 6 min]
it-1 - 3.187808
it-2 - 3.208140 [bumped up ppt epochs from 6 -> 10; also used nca load mode = all instead of attn; bumped warm/ramp = 500/500 from 250/200]
it-3 - 3.205975 [nca load mode = all; w/ warmup steps = 0, rampup = 500]
it-4 - 3.193332 [back to load mode attn; w/ warm/ramp = 500/500]

after looking into what @samacqua. i have an idea or two to try in mind to try. i was thinking to try hybrid optimizer approach wherein adamW only on the matrix params that were loaded from the NCA checkpoint; muon on everything else.

but apart from what if theres any other feedback or so. open to that.

samacqua · 2026-04-20T15:52:31Z

also i had a question, did you keep the optimizer constant (like muon / adam) for both pre-pre-training and pre-training phase?

Yes, I tried the cross product of (no pre-pre-training, adam pre-pre-training, muon pre-pre-training) and (adam pretraining, muon pretraining). I just found the old results:

Barely tuned hparams. Only change was to increase the LR warmup or muon pretraining in the hopes that this would preserve the pre-pre-training structure, to no avail. So muon pre-pre-training gets better pre-pre-training loss and helps downstream pre-training more for both pre-training optimizers.

idhantgulati added 11 commits March 27, 2026 01:04

added init nca pre-pre-training

a5fecde

typo

604b20b

added wandb+MFU to ppt; add NCA embedding warmup to train

50f44e9

perf: fix ddp throughput bottlenecks in ppt

b85816d

Merge branch 'qlabs-eng:main' into main

97f1981

added rampup to lr and load-mode

3676618

simplify gradient sync logic

d9f4d93

better data gen for nca

82f10eb

training script

a733ca8

Merge remote-tracking branch 'upstream/main'

b24f9a6

final training script

ca3526c

move files

59eea36

idhantgulati added 2 commits April 7, 2026 15:12

removed nca from main train.py

34b2aad

added w/o nca training command

c618063

Merge remote-tracking branch 'upstream/main'

cc450fc

Conversation

idhantgulati commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nca pre-pre-training:

pre-training:

compute

Uh oh!

akshayvegesna commented Apr 7, 2026

Uh oh!

idhantgulati commented Apr 7, 2026

Uh oh!

akshayvegesna commented Apr 7, 2026

Uh oh!

idhantgulati commented Apr 7, 2026

Uh oh!

akshayvegesna commented Apr 7, 2026

Uh oh!

idhantgulati commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshayvegesna commented Apr 7, 2026

Uh oh!

idhantgulati commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshayvegesna commented Apr 9, 2026

Uh oh!

idhantgulati commented Apr 10, 2026

Uh oh!

idhantgulati commented Apr 10, 2026

Uh oh!

akshayvegesna commented Apr 12, 2026

Uh oh!

idhantgulati commented Apr 13, 2026

Uh oh!

samacqua commented Apr 13, 2026

Uh oh!

idhantgulati commented Apr 14, 2026

Uh oh!

idhantgulati commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samacqua commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

idhantgulati commented Apr 7, 2026 •

edited

Loading

idhantgulati commented Apr 7, 2026 •

edited

Loading

idhantgulati commented Apr 8, 2026 •

edited

Loading

idhantgulati commented Apr 15, 2026 •

edited

Loading