Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
a477efb
Use ReGLU, Decoder-only, tune hyperparameters, torch.jit, .to optimiz…
AlstonTang Apr 6, 2026
117effb
Fix some stuff
AlstonTang Apr 7, 2026
9611cea
Integer lol
AlstonTang Apr 7, 2026
b4c48cb
FEAT: Colab
AlstonTang Apr 7, 2026
506ef4c
LFMShortConv
AlstonTang Apr 7, 2026
253f227
Compile + No LFMConv
AlstonTang Apr 7, 2026
38eadfd
Update Hyperparameters
AlstonTang Apr 7, 2026
eef75a1
Partial RoPE + Hyperparameter Tuning
AlstonTang Apr 7, 2026
e05af07
Minor optimizations
AlstonTang Apr 7, 2026
ed564a5
Hyperparam tuning, resid scaling, and attention qkv projection fusing
AlstonTang Apr 7, 2026
ae5ce67
Hyperparameter tuning
AlstonTang Apr 7, 2026
ad7ba33
LoRA and stuff
AlstonTang Apr 8, 2026
9679737
Attempt LoRA improvement
AlstonTang Apr 8, 2026
c4446e4
Hyperparam config
AlstonTang Apr 8, 2026
1893d80
Revert LoRA for now. Trying extreme hyperparameters.
AlstonTang Apr 8, 2026
a6fb815
Resid + Model Adjustments + Better GPU contiguous dimensions
AlstonTang Apr 8, 2026
dc5a514
Dynamic U-shaped KV head count
AlstonTang Apr 8, 2026
94b6c69
Optimize
AlstonTang Apr 8, 2026
5debd92
Decent
AlstonTang Apr 9, 2026
6efb59e
Testing
AlstonTang Apr 9, 2026
e667cd5
QK and V split
AlstonTang Apr 9, 2026
b390a28
Architecture tuning
AlstonTang Apr 9, 2026
cfe4bdd
fp8
AlstonTang Apr 10, 2026
faced9e
Revert fp8 and try new architecture
AlstonTang Apr 14, 2026
79303f6
ZerO
AlstonTang Apr 15, 2026
b036344
Hyperparameter tuning
AlstonTang Apr 16, 2026
b69df5e
Just one more layer
AlstonTang Apr 16, 2026
d83e32a
Whoops fixed too many parameters
AlstonTang Apr 16, 2026
f74eda1
DDP BEFORE Compile + Pin memory
AlstonTang Apr 16, 2026
850e81b
Actually compile before ddp
AlstonTang Apr 16, 2026
24b71e6
Maybe no static graph?
AlstonTang Apr 16, 2026
6736fe4
Optimizations attempt 1
AlstonTang Apr 20, 2026
2c103aa
Optims pt 2
AlstonTang Apr 20, 2026
eb41e01
data
AlstonTang Apr 20, 2026
00cf003
Readd skips
AlstonTang Apr 20, 2026
472be7e
Step things up
AlstonTang Apr 21, 2026
441ecca
Skip correction
AlstonTang Apr 21, 2026
adf8b12
rope and reduce parameters
AlstonTang Apr 21, 2026
de67696
ZerO stuff and redo ReLU^2
AlstonTang Apr 21, 2026
81bde6f
Attempt optimizations
AlstonTang Apr 21, 2026
e62da0d
warmup plus slight increase of qk_gain and mlp_mult
AlstonTang Apr 21, 2026
c55d3f1
Reducing sequential steps
AlstonTang Apr 23, 2026
2cae1e7
Do compile
AlstonTang Apr 23, 2026
f60e102
Revert + Fix
AlstonTang Apr 23, 2026
42beafa
SwiGLU again
AlstonTang Apr 23, 2026
3c518d4
Hyperparam tuning
AlstonTang Apr 23, 2026
ecfac55
RoPE changes
AlstonTang Apr 23, 2026
15527e5
Reduce learning rate
AlstonTang Apr 24, 2026
bd45028
Hail mary
AlstonTang Apr 30, 2026
d34a510
One can only hope
AlstonTang Apr 30, 2026
ac0a728
Restore original eval_val implementation to ensure correctness
AlstonTang Apr 30, 2026
effba2a
AI-based implementation optimizations
AlstonTang Apr 30, 2026
412a17e
Delete masking
AlstonTang Apr 30, 2026
d6ce05e
Tuning to squeeze every last parameter
AlstonTang Apr 30, 2026
8a95441
Squeeze part 2
AlstonTang Apr 30, 2026
25f583c
Too many params
AlstonTang Apr 30, 2026
cd72e75
Longer
AlstonTang Apr 30, 2026
bf03d00
Unique hyperparams + geometric rope progression
AlstonTang Apr 30, 2026
a2905ff
It's unique for sure
AlstonTang May 1, 2026
8ffd7a9
Merge branch 'openai:main' into main
AlstonTang May 1, 2026
528747f
Add submission
AlstonTang May 1, 2026
9993ab1
Restore original train_gpt.py
AlstonTang May 1, 2026
31f2ebb
Get train_gpt and update readme
AlstonTang May 1, 2026
4221127
Delete check.py
AlstonTang May 1, 2026
9a2be88
Rectify trailing line in original train_gpt.py
AlstonTang May 1, 2026
3f4fc45
Update README.md
AlstonTang May 1, 2026
25e9856
Update README.md
AlstonTang May 1, 2026
fef7edc
Touch up README.md
AlstonTang May 1, 2026
1ca0503
Utilize the more original version of ZerO
AlstonTang May 1, 2026
22d8e51
Compilation flags
AlstonTang May 1, 2026
5685ab3
No norms?
AlstonTang May 1, 2026
93b44fb
Revert no norms since instability
AlstonTang May 1, 2026
014c14c
Can't reduce overhead either
AlstonTang May 1, 2026
a93e162
Fix up initialization function
AlstonTang May 1, 2026
02b4d37
One more layer + adam
AlstonTang May 1, 2026
41a38ab
Restore muon
AlstonTang May 1, 2026
907f5bf
Gear up for another submission
AlstonTang May 1, 2026
b6eabb8
Update README.md
AlstonTang May 1, 2026
3d1a1c1
Update README.md
AlstonTang May 1, 2026
48e24c6
Include logs and update readme.md
AlstonTang May 1, 2026
3e35d0c
Update README.md
AlstonTang May 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Constrained by Time
*(Not in the traditional sense)*

Between schoolwork and other commitments, this was definitely an on-and-off type of project.
- Prior to this, I have never trained any language models of any kind, let alone in a speedrunning context.
- I was planning on full-scale testing but had to wait until 4/30/2026 for a sufficient compute grant for full H100 testing. With only a day to really experiment, there wasn't much time for me to validate anything or find very good ideas.
- Smaller-scale testing was done on Google Colab and the RTX 5090s provided by RunPod, but these were slow and it was hard to really know what worked when moving so slowly.

There were many (frankly) strange implementations that I tried experimenting with with the help of AI.
- Extreme depth/low dim
- Low depth/high dim
- Rolling data inside each batch (give the model one more chance at "harder" data)
- Dynamic loss scaling
- You'll find more by going through [my repo's](https://github.com/AlstonTang/parameter-golf) commit history.

Although this probably isn't going to shatter any records (at all), I do hope that this at least shines an interesting light at some potential ideas that may be integrated into future GPT training/speedruns.
- Even if this ultimately doesn't go very far alone, it would be nice if some of the ideas in this implementation were explored. Perhaps there's too much going on in this implementation, and that together they're conflicting each other. Or perhaps's it's merely a hyperparameter configuration away from getting solid results.

My focus ended up not being on Test-Time Training (TTT) or any significant implementation-specifc optimizations (e.g. fp8 training). Rather, my focus was on the underlying architecture itself, and really (trying towards) pushing the limits of what a conventional transformer can do.
- I may continue experiments even after this competition, since research isn't just one and done! I may take some ideas from my implementation and iteratively add it to new language models I may train in the future as I both learn more about LLMs and advanced deep learning in general.
- Some commits may show attempts implementation-specific optimizations (e.g. my attempt at fp8 training), but these usually either failed or led to training instability during experimentation.

## ZerO initalization
I was interested in this paper: https://arxiv.org/pdf/2110.12661
- Zhao et. al. describes how performance of deep networks can be both better and more reproducable through a more deterministic initialization method involving Hadamard/Identity-like matrices

The actual usage of this initialization does involve some level of non-determinism (and likely somewhat deviates from the original algorithm), but it's less pronounced than fully random initalization.

## Progression of Various Model Hyperparameters
Throughout each layer, I utilized a progression of KV head count, rope proportion, and MLP expansion factor, all of which increase as the layer's depth within the model increases. The rationale is as follows:
1. Earlier layers most likely focus on nearby context and shouldn't worry about long-range dependencies.
2. As tokens go further into the model, more information is going to be needed.

Whether these should all be scaled linearly, geometrically, inversely, or something else is a question for another day.

## So, could it work?
Well, maybe if there was more time (to experiment + training time) and a better implementation.
- Of course, hyperparameter choice also plays a role. However, I did not have too much time to really test anything.

## What would I have done if I had more time?
1. Tuning hyperparameters.
2. More experimentation with other strange hypotheses.
3. Optimizing the implementation.

If I wasn't constrained by the constraints, I would also test it on larger-scale models (e.g. 100 million+ parameters).
- Perhaps the model was too small to really realize the potential gains of my proposed implementation.
- Going beyong 16 MB would be nice to see if my ideas could potentially fly!

## Why just one Result?
Time...
- If I had more time, I would submit more results (and probably more refined ones).
- The result should (hopefully) be reproducable across seeds due to the more deterministic [ZerO intialization](#zero-initalization) being used.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
numpy
tqdm
torch
huggingface-hub
kernels
setuptools
typing-extensions==4.15.0
datasets
tiktoken
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"track": "non_record_16mb",
"date": "2026-04-30",
"name": "ZerO Initalization + Progressive KV, RoPE proportion and base, and MLP mult",
"author": "Alston Tang",
"github_id": "AlstonTang",
"val_bpb": 1.25212989,
"val_loss": 2.11416887,
"bytes_total": 15703926
}

Loading