openai · albertorkive · Mar 22, 2026
diff --git a/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/README.md b/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/README.md
@@ -0,0 +1,54 @@
+# 11L Gradient-Guided Quant + EMA + Sliding Eval
+
+**val_bpb: 1.1396** (post int8+zstd quantization roundtrip, sliding window eval stride=64, full validation coverage)
+
+## Run Command
+
+```bash
+# Setup
+pip install sentencepiece zstandard
+python3 data/cached_challenge_fineweb.py
+
+# Train + evaluate (defaults baked into train_gpt.py)
+torchrun --nproc_per_node=8 records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py
+```
+
+All hyperparameters are set as defaults in `train_gpt.py`. No env vars needed.
+
+## Architecture
+
+- 11 layers, 512 dim, 8 heads, 4 KV heads (GQA)
+- MLP 3x expansion (hidden=1536), relu^2 activation
+- Encoder-decoder skip connections with learned weights
+- SmearGate residual mixing
+- NTK-aware RoPE positional encoding
+- XSA (cross-sequence attention) on last 4 layers
+- Orthogonal initialization
+- Tied input/output embeddings
+
+## Training
+
+- Muon optimizer: matrix_lr=0.025, scalar_lr=0.025, momentum=0.99
+- AdamW for embeddings/scalars: WD=0.04
+- Momentum warmup: 0.92 -> 0.99 over 1500 steps
+- Adaptive warmdown: 3000 iters (auto-capped to 55% of total steps for hardware robustness)
+- Warmup: 20 steps
+- seq_len=2048, batch=786K tokens
+- grad_clip=0.3
+- EMA: alpha=0.997, initialized from model init
+
+## Compression
+
+- Gradient-guided adaptive quantization: per-tensor bit assignment based on gradient sensitivity
+  - Top 45% (highest gradient): int7 (127 values)
+  - Middle 40%: int6 (63 values)
+  - Bottom 15% (lowest gradient): int5 (31 values)
+- zstd level 22 compression
+- Artifact: 15,913,419 bytes (code: ~59KB, model: ~15.9MB)
+
+## Evaluation
+
+- Sliding window eval with stride=64, full validation set coverage (~121K windows/GPU)
+- Per-token loss scoring: only the last 64 tokens of each 2048-token window are scored (full context)
+- Post-quantization roundtrip: quantize -> decompress -> evaluate
+- Eval time: ~6 min (runs after training completes)
diff --git a/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/submission.json b/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/submission.json
@@ -0,0 +1,12 @@
+{
+  "author": "Alberto Luengo",
+  "github_id": "albertorkive",
+  "name": "11L Gradient-Guided Quant + EMA + Sliding Eval",
+  "blurb": "11 layers, 512 dim, MLP 3x, gradient-guided adaptive int5/int6/int7 quantization, EMA (alpha=0.997, from init), SmearGate, NTK-RoPE, XSA (last 4 layers), zstd-22 compression, sliding window eval stride=64. Muon optimizer with momentum warmup, orthogonal init, adaptive warmdown.",
+  "date": "2026-03-22T17:17:00Z",
+  "val_loss": 1.92417619,
+  "val_bpb": 1.13960858,
+  "bytes_total": 15913419,
+  "bytes_code": 59010,
+  "bytes_model_int8_zstd": 15854409
+}
diff --git a/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train.log b/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train.log
@@ -0,0 +1,108 @@
+W0322 17:17:48.534000 125682743755392 torch/distributed/run.py:779] 
+W0322 17:17:48.534000 125682743755392 torch/distributed/run.py:779] *****************************************
+W0322 17:17:48.534000 125682743755392 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0322 17:17:48.534000 125682743755392 torch/distributed/run.py:779] *****************************************
+logs/ccc19fc1-632c-4c37-83d3-75602d4e859d.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26502232
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=True math=True
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+ema:initialized from model init (alpha=0.997)
+step:0/20000 val_loss:6.9321 val_bpb:4.1056 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9309 train_time:157ms step_avg:157.17ms
+step:2/20000 train_loss:8.1553 train_time:259ms step_avg:129.68ms
+step:3/20000 train_loss:8.4565 train_time:365ms step_avg:121.76ms
+step:4/20000 train_loss:7.4809 train_time:471ms step_avg:117.77ms
+step:5/20000 train_loss:6.7898 train_time:577ms step_avg:115.37ms
+step:6/20000 train_loss:6.3670 train_time:683ms step_avg:113.79ms
+step:7/20000 train_loss:6.2948 train_time:788ms step_avg:112.63ms
+step:8/20000 train_loss:6.2930 train_time:894ms step_avg:111.73ms
+step:9/20000 train_loss:6.0769 train_time:999ms step_avg:111.05ms
+step:10/20000 train_loss:5.9117 train_time:1106ms step_avg:110.56ms
+step:200/20000 train_loss:2.4001 train_time:21628ms step_avg:108.14ms
+step:400/20000 train_loss:2.4496 train_time:43541ms step_avg:108.85ms
+step:600/20000 train_loss:2.3498 train_time:65273ms step_avg:108.79ms
+step:800/20000 train_loss:2.2478 train_time:87264ms step_avg:109.08ms
+step:1000/20000 train_loss:2.2832 train_time:109011ms step_avg:109.01ms
+step:1000/20000 val_loss:2.1985 val_bpb:1.3021 train_time:109016ms step_avg:109.02ms
+step:1200/20000 train_loss:2.3597 train_time:130809ms step_avg:109.01ms
+step:1400/20000 train_loss:2.1911 train_time:152694ms step_avg:109.07ms
+step:1600/20000 train_loss:2.0803 train_time:174403ms step_avg:109.00ms
+step:1800/20000 train_loss:2.1607 train_time:196374ms step_avg:109.10ms
+step:2000/20000 train_loss:2.0712 train_time:218101ms step_avg:109.05ms
+step:2000/20000 val_loss:2.0976 val_bpb:1.2423 train_time:218106ms step_avg:109.05ms
+step:2200/20000 train_loss:2.1394 train_time:239841ms step_avg:109.02ms
+step:2400/20000 train_loss:2.0699 train_time:261442ms step_avg:108.93ms
+grad_guided_quant: started accumulating (79 tensors)
+step:2600/20000 train_loss:2.1105 train_time:283954ms step_avg:109.21ms
+step:2800/20000 train_loss:2.1478 train_time:307106ms step_avg:109.68ms
+step:3000/20000 train_loss:2.1486 train_time:329999ms step_avg:110.00ms
+step:3000/20000 val_loss:2.0416 val_bpb:1.2091 train_time:329999ms step_avg:110.00ms
+step:3200/20000 train_loss:2.1548 train_time:353047ms step_avg:110.33ms
+step:3400/20000 train_loss:1.9952 train_time:375958ms step_avg:110.58ms
+step:3600/20000 train_loss:2.0697 train_time:399077ms step_avg:110.85ms
+step:3800/20000 train_loss:2.0410 train_time:421956ms step_avg:111.04ms
+step:4000/20000 train_loss:1.9437 train_time:445096ms step_avg:111.27ms
+step:4000/20000 val_loss:1.9923 val_bpb:1.1799 train_time:445096ms step_avg:111.27ms
+step:4200/20000 train_loss:2.1129 train_time:468136ms step_avg:111.46ms
+step:4400/20000 train_loss:1.9926 train_time:490961ms step_avg:111.58ms
+step:4600/20000 train_loss:1.7964 train_time:514095ms step_avg:111.76ms
+step:4800/20000 train_loss:2.3775 train_time:537003ms step_avg:111.88ms
+step:5000/20000 train_loss:2.0476 train_time:560152ms step_avg:112.03ms
+step:5000/20000 val_loss:1.9315 val_bpb:1.1439 train_time:560153ms step_avg:112.03ms
+step:5200/20000 train_loss:1.9847 train_time:582960ms step_avg:112.11ms
+step:5347/20000 val_loss:1.9145 val_bpb:1.1339 train_time:600060ms step_avg:112.22ms
+stopping_early: wallclock_cap train_time:600060ms step:5347/20000
+peak memory allocated: 28149 MiB reserved: 28408 MiB
+ema:loading shadow weights (alpha=0.997)
+Serialized model: 105001444 bytes
+Code size: 59010 bytes
+Total submission size: 105060454 bytes
+grad_guided_quant: 79 tensors assigned adaptive bits
+  bit distribution: {5: 12, 6: 31, 7: 36}
+Serialized model int8+zstd: 15854409 bytes (payload:26659168 raw_torch:26714058 payload_ratio:3.94x)
+Total submission size int8+zstd: 15913419 bytes
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+/workspace/parameter-golf/records/track_10min_16mb/2026-03-22_11L_GradQuant_EMA_SlidingEval/train_gpt.py:1285: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  quant_state = torch.load(io.BytesIO(_decompressed), map_location="cpu")
+final_int8_zlib_roundtrip val_loss:1.9242 val_bpb:1.1396 eval_time:374873ms
+final_int8_zlib_roundtrip_exact val_loss:1.92417619 val_bpb:1.13960858