Skip to content
269 changes: 269 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# Parameter Golf — Fractal Transformer Research Plan
**DGX Spark · GB10 · March 2026**

---

## Challenge Summary

| Constraint | Value |
|------------|-------|
| Artifact size | ≤16MB (code + int8 quantized + zlib compressed weights) |
| Training time | ≤10 minutes on 8×H100 |
| Metric | bits-per-byte (BPB) on FineWeb validation set |
| Baseline | 1.2244 BPB |
| Record threshold | ≤1.2194 BPB (must beat by ≥0.005) |
| 4-hour unlimited baseline | 1.2074 BPB |
| Challenge window | March 18 → April 30, 2026 |
| Repo | https://github.com/newjordan/parameter-golf |

---

## Our Approach: Fractal Transformer + Gravity + AttnRes

### Core Thesis

Weight-shared transformer layers with learned gravitational auxiliary losses
and attention residuals will achieve lower BPB than the baseline's 9-unique-layer
architecture within the same 16MB parameter budget.

### Three Innovations Combined

**1. Fractal Architecture (Weight Sharing / Depth Recurrence)**

Instead of 9 unique layers, use 3 unique layers repeated in 3 loops.

```
CURRENT BASELINE:
9 unique layers × 512 dim = ~14M params

OUR APPROACH:
3 unique layers × 3 loops = 9 effective layers
Wider layers (~700 dim) with same total param count
Loop position embedding tells shared weights which pass they're on
```

Why this helps:
- Fewer unique parameters → more room in 16MB budget → wider layers
- Wider layers = richer features per layer
- Weight sharing compresses extremely well under int8+zlib
- Depth recurrence explicitly encouraged by the challenge README

**2. Gravity (Learned Auxiliary Losses)**

At the end of each loop, peek at the output using the shared lm_head and
compute an auxiliary cross-entropy loss. The weights are LEARNED parameters.

```python
self.gravity_weights = nn.Parameter(torch.tensor([0.1, 0.3, 1.0]))

total_loss = 0
for loop in range(3):
x = run_shared_layers(x, loop_pos=loop)
loop_logits = lm_head(rms_norm(x))
loop_loss = cross_entropy(loop_logits, targets)
total_loss += softplus(self.gravity_weights[loop]) * loop_loss
```

Why this helps:
- 3× gradient signal — every layer gets direct supervision, not diluted backprop
- Model discovers optimal loop weighting during training
- Especially powerful with weight sharing: same weights receive gradient from 3 depths
- Zero new parameters (3 scalars for weights, reuses existing lm_head)
- ~1.2% compute overhead (2 extra lm_head calls)

The "gravity" analogy:
- Loop 1 output is far from the target → strong pull, large updates
- Loop 2 is closer → medium pull, refinement
- Loop 3 is nearest → full weight, precision
- Each loop starts from a better position because the previous loop was already pulled toward the answer

**3. AttnRes (Attention Residuals)**

Replace fixed skip connections with learned, input-dependent attention over depth.
From Moonshot's paper (arxiv:2603.15031).

```
Standard residuals: x = x + layer_output (fixed, uniform weight)
AttnRes: x = softmax(query · [prev_outputs]) · [prev_outputs]
```

Each layer has a single learned query vector w_l ∈ R^d that attends over all
previous loop outputs. The softmax produces content-aware, input-dependent
weights instead of fixed uniform accumulation.

Why this helps:
- Paper shows 1.25× compute equivalent for near-zero parameter cost
- Replaces BOTH the baseline's U-Net skips AND resid_mix
- Only 9 × dim ≈ 4,608 new parameters
- Critical for weight sharing: lets later loops selectively reference earlier loops

### What We Remove From Baseline

| Component | Parameters | Replaced By |
|-----------|-----------|-------------|
| U-Net encoder/decoder split | structural | Fractal loops |
| skip_weights (9 × 512) | 4,608 | AttnRes queries |
| resid_mix (9 × 2 × 512) | 9,216 | AttnRes |
| **Total removed** | **~13,824** | |

### What We Add

| Component | Parameters | Purpose |
|-----------|-----------|---------|
| AttnRes queries (9 layers) | 4,608 | Selective depth attention |
| Loop position embeddings (3 loops) | ~2,100 | Tell weights which loop they're in |
| Gravity weights (3 scalars) | 3 | Learned auxiliary loss weighting |
| **Total added** | **~6,711** | |

**Net: ~7,113 parameters saved → reinvested into wider layers.**

---

## Architecture Diagram

```
INPUT TOKENS (1024 vocab)
EMBEDDING (1024 × ~700 dim)
LOOP 1 (broad strokes):
├── Layer A (attention + MLP, loop_pos=0)
├── Layer B (attention + MLP, loop_pos=0)
├── Layer C (attention + MLP, loop_pos=0)
├── GRAVITY: peek → compute loss₁ (learned weight ~0.1)
└── Store loop 1 output for AttnRes
LOOP 2 (refinement):
├── AttnRes: attend over [embedding, loop1_output]
├── Layer A (attention + MLP, loop_pos=1) ← same weights as loop 1
├── Layer B (attention + MLP, loop_pos=1)
├── Layer C (attention + MLP, loop_pos=1)
├── GRAVITY: peek → compute loss₂ (learned weight ~0.3)
└── Store loop 2 output for AttnRes
LOOP 3 (precision):
├── AttnRes: attend over [embedding, loop1_output, loop2_output]
├── Layer A (attention + MLP, loop_pos=2) ← same weights again
├── Layer B (attention + MLP, loop_pos=2)
├── Layer C (attention + MLP, loop_pos=2)
└── FINAL LOSS: full cross-entropy (weight = 1.0)
OUTPUT: logits → BPB
```

Each loop tightens the representation:
- Loop 1: rough sketch (only sees embedding)
- Loop 2: refinement (sees embedding + loop 1 output via AttnRes)
- Loop 3: precision (sees full history, committed to answer)

---

## Information Tightening Mechanisms

### Gravity (primary — Frosty's intuition)
Each loop is pulled toward the final answer by its own loss signal. Later loops
start from better positions because earlier loops were already course-correcting.
The model learns how hard each loop should pull (learned gravity weights).

### AttnRes (secondary — from Moonshot paper)
Selective attention over previous loop outputs. Later loops can choose which
earlier representations are useful for each specific token, not a fixed blend.

### Future: Ring Buffer + Temperature Cooling (Phase 4)
- Ring buffer: bounded memory with eviction of unhelpful previous states
- Temperature: AttnRes attention sharpens with depth (soft early, committed late)
- Only add if Phase 1-3 show signal

---

## Experiment Sequence

### Phase 1: Establish Weight Sharing Baselines
1. Run baseline as-is → establish local BPB reference
2. 3 shared layers × 3 loops, same total params, ~512 dim → does sharing work?
3. 3 shared layers × 3 loops, wider ~700 dim → does width help?
4. 2 shared layers × 4 loops, widest ~850 dim → more loops?
5. 4 shared layers × 2 loops, ~620 dim → fewer loops?

### Phase 2: Add Gravity
6. Best config from Phase 1 + gravity with learned weights
7. Compare: gravity learned vs gravity fixed [0.1, 0.3, 1.0] vs no gravity

### Phase 3: Add AttnRes
8. Best from Phase 2 + full AttnRes
9. Test: AttnRes before attention only / before MLP only / both
10. Test: AttnRes with vs without gravity

### Phase 4: Advanced Mechanisms
11. Add ring buffer (bounded memory with eviction)
12. Add temperature cooling on AttnRes
13. Try combining all mechanisms

### Phase 5: Optimize for Submission
14. Verify int8+zlib artifact ≤16MB
15. Tune width to maximize quality within size budget
16. Port winning config to official train_gpt.py style
17. Run on cloud 8×H100, verify 10-minute timing
18. Prepare submission folder for /records

---

## Workflow

### Local (DGX Spark, free, unlimited)
- Adapted research fork without Triton/torch.compile dependency
- Shorter training budget (2 min per experiment)
- Smaller batch size
- Same model, data, tokenizer, BPB metric
- Results won't match H100 numbers but relative ordering transfers
- Run 50-100 experiments to find winning configuration
- Autoresearch agent runs overnight (Phase 1-4)

### Cloud (H100s, paid, limited)
- Take best configuration from local experiments
- Run at full scale: 8×H100, 10 minutes, full batch
- Verify BPB, artifact size, timing
- Prepare official submission

---

## Source Material

### Attention Residuals (Moonshot)
- Paper: arxiv:2603.15031
- Repo: https://github.com/MoonshotAI/Attention-Residuals
- Core: replace fixed residual connections with softmax attention over depth
- Result: matches 1.25× compute baseline at near-zero parameter cost

### Autoresearch (Karpathy)
- Repo: https://github.com/karpathy/autoresearch
- Core: AI agent modifies train.py, trains 5 min, keeps/discards, loops forever
- Adapted as our outer optimization loop

### Parameter Golf Baseline
- Repo: https://github.com/openai/parameter-golf
- Architecture: 9-layer GPT, 512 dim, 1024 vocab, GQA, Muon optimizer
- Key features: U-Net skip connections, resid_mix, ReLU², logit softcapping
- BPB: 1.2244 (10 min), 1.2074 (4 hour)

---

## Key Insight

The competition rewards compression quality per parameter. Weight sharing is
the ultimate compression — the same function applied repeatedly. AttnRes gives
that repeated function the ability to selectively reference its earlier outputs.
Gravity ensures every repetition is actively pulled toward the correct answer.

The fractal structure means each loop genuinely tightens the representation:
same weights, progressively richer input, direct loss supervision at every
stage. The model isn't just repeating — it's refining.

---

*Plan authored by Octavian + Frosty · Spark-2949 · 2026-03-18*
69 changes: 69 additions & 0 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Parameter Golf — Local Experiment Results
**DGX Spark GB10 · 2026-03-18**

## Experiment Ladder (300 steps, 1 train shard, 1M eval tokens)

| # | Config | val_bpb | Δ vs baseline | params | dim | ms/step |
|---|--------|--------:|----------:|-------:|----:|--------:|
| 1 | Baseline (9 unique layers, 512d) | 2.7927 | — | 17.05M | 512 | 167 |
| 2 | **Fractal only (3×3, 864d)** | **2.5953** | **-0.1975** | 16.57M | 864 | 333 |
| 3 | Fractal + Gravity (3×3, 864d) | 2.6149 | -0.1779 | 16.57M | 864 | 347 |
| 4 | Fractal + Gravity + AttnRes (3×3, 864d) | 2.6084 | -0.1843 | 16.58M | 864 | 425 |

## Training Loss Comparison (300 steps)

| Step | Baseline | Fractal | Fractal+Gravity | Fractal+Grav+AttnRes |
|------|----------|---------|-----------------|---------------------|
| 50 | 5.8850 | — | 5.8229 | — |
| 100 | 5.2427 | — | 5.0172 | — |
| 150 | 4.8926 | — | 4.6254 | — |
| 200 | 4.7830 | — | 4.5360 | — |
| 250 | 4.7162 | — | 4.4521 | — |
| 300 | 4.6554 | 4.3473 | 4.3794 | 4.3751 |

## Key Findings

1. **Weight sharing + wider layers is the dominant effect.** Fractal-only beats baseline
by 7.1% BPB with fewer total parameters. The 864d shared layers are significantly more
expressive than 512d unique layers.

2. **Gravity slightly hurts at 300 steps.** The auxiliary losses on early loops add gradient
noise before those loops learn to produce useful predictions. The model learned weights
[0.13, 0.13, 0.70] — trying to minimize early loop influence but can't fully zero it.

3. **AttnRes partially recovers the gravity penalty.** Selective depth attention helps
the model route around noisy early-loop outputs.

4. **All fractal variants beat baseline convincingly.** Even the worst fractal config
(fractal+gravity at 2.6149) still beats baseline (2.7927) by 0.18 BPB.

## Hypothesis for Full-Scale Runs

Gravity and AttnRes should improve with more training steps because:
- Early loops need many steps to learn useful intermediate predictions
- At 13,000+ steps (H100 10-minute budget), the gravity signal should become useful
- The learned gravity weights should evolve from [0.13, 0.13, 0.70] toward something
that actually leverages early loops

## Learned Gravity Weights (Experiments 3 & 4)

Both converged to: `[0.127, 0.127, 0.699]`
- softplus(-2.0) = 0.127 (early loops, barely contributing)
- softplus(0.0) = 0.693 (final loop, dominant)
- The model essentially learned to "turn off" early gravity — confirming that at
300 steps, direct early-loop supervision is noise rather than signal

## Next Steps

1. Try gravity with warmup: zero gravity for first 100 steps, then ramp up
2. Try different loop configs: 2×4, 4×2, 2×5
3. Ship fractal-only (best local result) to cloud H100s for official timing
4. Ship fractal+gravity+attnres as second cloud experiment to test if it
overtakes with more training

## Environment
- Hardware: DGX Spark GB10, 130.7GB unified VRAM
- PyTorch: 2.10.0+cu130 (no torch.compile, no Triton)
- Data: FineWeb sp1024, 1 train shard, ~100M train tokens
- Eval: 1M validation tokens (truncated for speed)
- Optimizer: AdamW (not Muon — local simplification)
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
## Record: 11L TTT Burst + EMA + GPTQ-lite + warmdown3500 + QAT@0.15

**val_bpb: 1.1236** (sliding window stride=64, 2-seed mean) | **15.59 MB** (mean) | 8xH100 SXM, 600s

### Key Innovation Over PR #414

| Change | PR #414 | This | Impact |
|--------|---------|------|--------|
| **TTT Burst** | None | 2-epoch replay of last 100 training batches at 10% LR before EMA | -0.0001 BPB |

Everything else inherited from PR #414: EMA(0.997), GPTQ-lite(5 percentiles), warmdown 3500, Late QAT@0.15, int6+zstd-22.

### TTT Burst: Late-Stage Sharpening

After the main training loop and before EMA application, we replay the last 100 training batches for 2 epochs at 10% of base LR. EMA is updated during the burst so it absorbs the sharpened signal. This gives the model a final sharpening pass on recent data before weight averaging and quantization.

### Results (3 seeds, 8xH100 SXM)

| Seed | Steps | val_loss | Sliding BPB (s64) | Artifact |
|------|-------|----------|-------------------|----------|
| **1337** | 6991 | 1.9246 | **1.1232** | 15.68 MB |
| 42 | 6994 | 1.9262 | 1.1240 | 16.37 MB* |
| **2024** | 6987 | 1.9255 | **1.1239** | 15.50 MB |

**Mean (1337+2024): 1.1236 | Std: 0.0004**

*Seed 42 artifact over size limit due to compression variance; BPB validates the approach.

### Architecture

11L, 512d, 8H/4KV, MLP 3x (relu^2), U-Net skips, XSA4, Partial RoPE 16/64, LN Scale, VE128, SmearGate, BigramHash(2048), FA3, Muon WD=0.04, EMA(0.997), Tight SWA, Late QAT@0.15, TTT Burst(2ep/10%LR), int6+zstd-22, GPTQ-lite.

### Run Command

```bash
SEED=1337 torchrun --nproc_per_node=8 train_gpt.py
```

### Test plan

- [x] All seeds train in 600s on 8xH100
- [x] Seeds 1337, 2024 under 16MB (15.68 MB, 15.50 MB)
- [x] Post-quant int6 roundtrip verified
- [x] Sliding window eval (stride=64) consistent across seeds (std=0.0004)
- [x] train_gpt.py under 1500 lines (1443)
- [x] No TTT on validation data
Loading