openai · EthanYangTW · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md b/records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md
@@ -0,0 +1,31 @@
+# QAT + BigramHash(12288) + Stride 32
+
+## Summary
+
+Built on the current SOTA (`10L_Int5MLP_MuonWD04_SWA50`) with the following improvements:
+
+- **QAT (Quantization-Aware Training):** STE fake-quantize during training — int5 for MLP layers, int6 for attention. Reduces post-quantization degradation.
+- **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage.
+- **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation.
+- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
+- **SWA every 50 steps:** Checkpoint averaging during warmdown.
- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
- **SWA every 50 steps:** Checkpoint averaging during warmdown.
+- **Magnitude pruning 15% (quantile=0.15):** Increased from 3% to improve compression ratio.
+- **SWA every 25 steps:** Checkpoint averaging during warmdown.
- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
- **SWA every 50 steps:** Checkpoint averaging during warmdown.
+- **Magnitude pruning 15% (quantile=0.15):** Increased from 3% to improve compression ratio.
+- **SWA every 25 steps:** Checkpoint averaging during warmdown.
+
+## Architecture
+
+- 10 transformer layers, dim=512, 8 heads, 4 KV heads
+- 3x MLP with SmearGate
+- BigramHash(12288) with bigram_dim=128
+- Mixed quantization: int5 MLP, int6 attention
+- zstd-22 compression
+
+## Results
+
+```
+seed=2024: val_bpb=1.14443, artifact=15,902,583 bytes
+```
+
+## Command
+
+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json b/records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "QAT + BigramHash(12288) + Stride 32",
+  "val_loss": 1.14443,
+  "bytes_total": 15902583,
+  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
-  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
+  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 15%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
-  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
+  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 15%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
+  "author": "fbedev",
+  "github_id": "fbedev",
+  "date": "2026-03-21"
+}