Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
6ed2fa5
Add aggressive submission: QAT + BigramHash 12288 + stride 32
EthanYangTW Mar 21, 2026
bccc688
Add submission: QAT + BigramHash 12K + Stride 32
EthanYangTW Mar 21, 2026
8a0ebf2
Add LoRA TTT (test-time training) to submission
EthanYangTW Mar 21, 2026
db5c5dd
Remove TTT, bump BigramHash to 13312
EthanYangTW Mar 21, 2026
65f54ac
Revert BigramHash to 12288 (13312 over 16MB)
EthanYangTW Mar 21, 2026
32790dd
Add training log and update submission with 8xH100 results
EthanYangTW Mar 21, 2026
4bd048c
Fix SWA description: 50 steps not 25
EthanYangTW Mar 21, 2026
a3b1212
Update records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32…
EthanYangTW Mar 21, 2026
38dff06
Add #315/#388 full stack: 11L, XSA4, Partial RoPE, LN Scale, EMA, Lat…
EthanYangTW Mar 22, 2026
67fa031
Fix speed and artifact size: disable FA3, reduce BigramHash, EMA ever…
EthanYangTW Mar 22, 2026
943597d
Reduce BigramHash to 2048, increase pruning to 10% to fit under 16MB
EthanYangTW Mar 22, 2026
308ed62
Remove BigramHash, increase pruning to 15% — must fit under 16MB
EthanYangTW Mar 22, 2026
78f998e
New record: 11L XSA4 + Tight SWA + TTT (based on PR #374)
EthanYangTW Mar 22, 2026
4c37972
Fix TTT: compile + DDP + 3 epochs + batch 64 for speed
EthanYangTW Mar 22, 2026
e83a277
Fix FA3 import: add fallback to flash_attn and SDPA
EthanYangTW Mar 22, 2026
573f735
New record: 11L XSA4 + Tight SWA + Two-Phase TTT (1.1258 BPB)
EthanYangTW Mar 22, 2026
358a426
Update: FA3 Hopper + aggressive two-phase TTT (val_bpb=1.1216)
EthanYangTW Mar 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# QAT + BigramHash(12288) + Stride 32

## Summary

Built on the current SOTA (`10L_Int5MLP_MuonWD04_SWA50`) with the following improvements:

- **QAT (Quantization-Aware Training):** STE fake-quantize during training — int5 for MLP layers, int6 for attention. Reduces post-quantization degradation.
- **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage.
- **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation.
- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
- **SWA every 50 steps:** Checkpoint averaging during warmdown.

## Architecture

- 10 transformer layers, dim=512, 8 heads, 4 KV heads
- 3x MLP with SmearGate
- BigramHash(12288) with bigram_dim=128
- Mixed quantization: int5 MLP, int6 attention
- zstd-22 compression

## Results

```
seed=2024: val_bpb=1.14443, artifact=15,902,583 bytes
```

## Command

```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "QAT + BigramHash(12288) + Stride 32",
"val_loss": 1.14443,
"bytes_total": 15902583,
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
"author": "fbedev",
"github_id": "fbedev",
"date": "2026-03-21"
}
Loading
Loading