Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)

First submission to combine 6 independently-proven architecture improvements that have never been stacked together in a single entry.

## Results (8xH100 SXM)

| Metric | Value |
|--------|-------|
| **Sliding window val_bpb** | **1.1690** |
| Post-quant roundtrip val_bpb | 1.2043 |
| Pre-quant val_bpb | 1.1911 |
| Steps | 6,981 |
| Step avg | 85.78 ms |
| Training time | 598.8s |
| Artifact size | 15,372,888 bytes (15.3 MB) |
| Compressed model | 15,312,258 bytes (int6+zstd) |

## Architecture

- **12 layers**, 512 dim, 8 heads / 4 KV heads (GQA), 3x MLP (ReLU-squared)
- Vocab 1024 (SentencePiece BPE), seq len 1024, tied embeddings

## Novel Technique Combination

Each technique was proven in a separate PR with controlled ablation data. This is the first submission to combine all 6:

| Technique | Source | Measured Impact | Description |
|-----------|--------|-----------------|-------------|
| **Catalytic Residuals** | PR #450 | -0.024 bpb | `x + c * f(x)` with learned per-dim scalar c (init 1.0). Zero compute overhead. |
| **Value Residual (ResFormer)** | PR #413, arXiv:2410.17897 | -0.015 bpb | Cache layer-0 V vectors, mix into subsequent layers via learned scalars. |
| **Gated Attention** | PR #413, arXiv:2505.06708 | -0.003 bpb | Per-head sigmoid gate after attention output. |
| **BigramHash(10240)** | PR #450 | -0.070 bpb (vs 2048) | Hash-based bigram embedding with 10240 buckets. |
| **12 Layers** | PR #450 | -0.023 bpb (vs 11L) | Deeper model within 16MB budget. |
| **3x MLP** | Merged SOTA | Standard | 3x expansion vs baseline 2x. |

## Additional Techniques

- **OrthoInit**: Orthogonal weight init with muP-style projection scaling
- **Muon WD=0.04**: Decoupled weight decay on both Muon and AdamW
- **SWA**: Stochastic weight averaging from last 20% of warmdown
- **Late QAT (threshold 0.25)**: STE int6 fake-quantize during warmdown
- **Sliding window eval (stride 64)**: Overlapping windows for final BPB
- **Logit softcap 30.0**
- **Int6+zstd compression**: Mixed int6 (mlp+attn) / int8 (embed) with zstandard level 22

## Run Command

```bash
pip install sentencepiece zstandard
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

All hyperparameters set as defaults. No env vars needed for the standard run.

## Dependencies

- PyTorch >= 2.5 (native GQA via `enable_gqa=True` in SDPA)
- sentencepiece
- zstandard
- numpy
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"name": "Joshua Warren",
"github": "joshuaswarren",
"val_bpb": "1.1690",
"val_loss": "1.9738",
"num_layers": 12,
"model_dim": 512,
"num_heads": 8,
"num_kv_heads": 4,
"mlp_mult": 3,
"vocab_size": 1024,
"seq_len": 1024,
"steps": 6981,
"step_avg_ms": 85.78,
"artifact_bytes": 15372888,
"compressed_model_bytes": 15312258,
"training_time_seconds": 598.84,
"hardware": "8xH100 SXM",
"quantization": "int6+zstd",
"eval_method": "sliding_window_stride_64",
"techniques": "Catalytic Residuals, Value Residual (ResFormer), Gated Attention, BigramHash(10240), 12L, 3xMLP, OrthoInit, Muon WD=0.04, SWA, Late QAT, Sliding Window Eval",
"novel_contribution": "First submission combining 6 independently-proven architecture improvements never stacked together: Catalytic Residuals (-0.024), Value Residual (-0.015), Gated Attention (-0.003), BigramHash(10240) (-0.070 vs 2048), 12 layers (-0.023), 3xMLP."
}
Loading