From e3ee33fcad67a42b75d5297213424686c6e78b76 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 21 Mar 2026 22:11:42 +0000 Subject: [PATCH 1/4] Initial plan From 5ee7f6329d68f7dcae4d6ad92a80b4b76efb0eab Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 21 Mar 2026 22:16:08 +0000 Subject: [PATCH 2/4] Move all parameter golf documents into docs/parameter-golf/ and update cross-references Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com> Agent-Logs-Url: https://github.com/devlux76/q2/sessions/bd82c93d-89a9-49d4-80ee-b3f311e2d20c --- README.md | 2 +- .../parameter-golf/ANALYSIS.md | 14 +++++++------- .../parameter-golf/APPROACH_INITIAL.md | 0 .../parameter-golf/APPROACH_REVISED.md | 6 +++--- .../DESIGN_REVISION_PLAN.md} | 0 .../IMPLEMENTATION.md} | 2 +- .../STRATEGY.md} | 0 .../WILDBERGER_RUBINE_REVIEW.md} | 0 scripts/q2_pack.py | 2 +- scripts/train_q2_ltc.py | 4 ++-- 10 files changed, 15 insertions(+), 15 deletions(-) rename PARAMETER_GOLF.md => docs/parameter-golf/ANALYSIS.md (98%) rename PARAMETER_GOLF_APPROACH.md => docs/parameter-golf/APPROACH_INITIAL.md (100%) rename PARAMETER_GOLF_REVISED.md => docs/parameter-golf/APPROACH_REVISED.md (98%) rename docs/{design-revision-plan.md => parameter-golf/DESIGN_REVISION_PLAN.md} (100%) rename docs/{parameter-golf-implementation.md => parameter-golf/IMPLEMENTATION.md} (99%) rename docs/{parameter-golf.md => parameter-golf/STRATEGY.md} (100%) rename docs/{wildberger-rubine-review.md => parameter-golf/WILDBERGER_RUBINE_REVIEW.md} (100%) diff --git a/README.md b/README.md index 8326d84..cd6304e 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Quaternary Quantization > **Quality gate:** this repo treats lint warnings as errors, and `bun run check` (lint + typecheck) is required for builds, tests, and CI. -> **Parameter Golf:** the approach for the OpenAI challenge is in [`docs/parameter-golf.md`](docs/parameter-golf.md). +> **Parameter Golf:** all documents for the OpenAI challenge are in [`docs/parameter-golf/`](docs/parameter-golf/). ## What it does diff --git a/PARAMETER_GOLF.md b/docs/parameter-golf/ANALYSIS.md similarity index 98% rename from PARAMETER_GOLF.md rename to docs/parameter-golf/ANALYSIS.md index d4d3d77..ab94a7c 100644 --- a/PARAMETER_GOLF.md +++ b/docs/parameter-golf/ANALYSIS.md @@ -1,9 +1,9 @@ # Parameter Golf: A Q²-Based Strategy -> **Related documents:** [DESIGN.md](DESIGN.md) · [RELATED_WORK.md](RELATED_WORK.md) +> **Related documents:** [DESIGN.md](../../DESIGN.md) · [RELATED_WORK.md](../../RELATED_WORK.md) -Section references of the form §D-x.y refer to [DESIGN.md](DESIGN.md). -Section references of the form §R-x refer to [RELATED_WORK.md](RELATED_WORK.md). +Section references of the form §D-x.y refer to [DESIGN.md](../../DESIGN.md). +Section references of the form §R-x refer to [RELATED_WORK.md](../../RELATED_WORK.md). --- @@ -830,14 +830,14 @@ For QAT-from-scratch, 2-bit is the correct choice from both a Williams perspecti #### Reconciliation with parallel analyses -Two parallel analyses (in `PARAMETER_GOLF_REVISED.md` and `docs/parameter-golf.md` -on the `main` branch) reach compatible conclusions: +Two parallel analyses (in `APPROACH_REVISED.md` and `STRATEGY.md` +in this folder) reach compatible conclusions: -- `PARAMETER_GOLF_REVISED.md` correctly identifies that **odd bit-widths are +- `APPROACH_REVISED.md` correctly identifies that **odd bit-widths are suboptimal for cache alignment** and recommends power-of-2 widths. Williams confirms this: every wasted bit reduces $N$, directly increasing bpb. -- `docs/parameter-golf.md` recommends mixed int5/int6 precision, which is the +- `STRATEGY.md` recommends mixed int5/int6 precision, which is the leaderboard SOTA approach. The Williams analysis shows this is suboptimal vs. 2-bit QAT because it achieves $N_{\text{eff}} \approx 24$ M at int5 (not the nominal 25.6 M, due to register alignment), while Q² 2-bit achieves $N = 64$ M. diff --git a/PARAMETER_GOLF_APPROACH.md b/docs/parameter-golf/APPROACH_INITIAL.md similarity index 100% rename from PARAMETER_GOLF_APPROACH.md rename to docs/parameter-golf/APPROACH_INITIAL.md diff --git a/PARAMETER_GOLF_REVISED.md b/docs/parameter-golf/APPROACH_REVISED.md similarity index 98% rename from PARAMETER_GOLF_REVISED.md rename to docs/parameter-golf/APPROACH_REVISED.md index a22eaef..7d360d2 100644 --- a/PARAMETER_GOLF_REVISED.md +++ b/docs/parameter-golf/APPROACH_REVISED.md @@ -1,8 +1,8 @@ # Parameter Golf: Revised Strategy (PyTorch-Native Q²) > **Status**: Revised based on feedback -> **Supersedes**: PARAMETER_GOLF_APPROACH.md (initial strategy) -> **Related**: [DESIGN.md](DESIGN.md), [RELATED_WORK.md](RELATED_WORK.md) +> **Supersedes**: APPROACH_INITIAL.md (initial strategy) +> **Related**: [DESIGN.md](../../DESIGN.md), [RELATED_WORK.md](../../RELATED_WORK.md) ## Executive Summary @@ -576,4 +576,4 @@ The core Q² mathematical framework (Lee metric, Gray map, Geode factorization) **Document Status**: Ready for implementation **Last Updated**: 2026-03-21 -**Supersedes**: PARAMETER_GOLF_APPROACH.md +**Supersedes**: APPROACH_INITIAL.md diff --git a/docs/design-revision-plan.md b/docs/parameter-golf/DESIGN_REVISION_PLAN.md similarity index 100% rename from docs/design-revision-plan.md rename to docs/parameter-golf/DESIGN_REVISION_PLAN.md diff --git a/docs/parameter-golf-implementation.md b/docs/parameter-golf/IMPLEMENTATION.md similarity index 99% rename from docs/parameter-golf-implementation.md rename to docs/parameter-golf/IMPLEMENTATION.md index d2889aa..ea04923 100644 --- a/docs/parameter-golf-implementation.md +++ b/docs/parameter-golf/IMPLEMENTATION.md @@ -1,7 +1,7 @@ # Parameter Golf: Implementation Roadmap > **Status**: Ready for implementation -> **Related**: [PARAMETER_GOLF_APPROACH.md](../PARAMETER_GOLF_APPROACH.md) +> **Related**: [APPROACH_INITIAL.md](APPROACH_INITIAL.md) This document provides tactical implementation details for the Q² Parameter Golf strategy. diff --git a/docs/parameter-golf.md b/docs/parameter-golf/STRATEGY.md similarity index 100% rename from docs/parameter-golf.md rename to docs/parameter-golf/STRATEGY.md diff --git a/docs/wildberger-rubine-review.md b/docs/parameter-golf/WILDBERGER_RUBINE_REVIEW.md similarity index 100% rename from docs/wildberger-rubine-review.md rename to docs/parameter-golf/WILDBERGER_RUBINE_REVIEW.md diff --git a/scripts/q2_pack.py b/scripts/q2_pack.py index 5d28971..ecb2e10 100644 --- a/scripts/q2_pack.py +++ b/scripts/q2_pack.py @@ -380,7 +380,7 @@ def unpack_state_dict( return result -# ── LIV cache-line packing (§5.5 of PARAMETER_GOLF.md) ────────────────────── +# ── LIV cache-line packing (§5.5 of docs/parameter-golf/ANALYSIS.md) ────────────────────── # # LIV (Liquid Integrated Vision/Language) symbols use 5-bit quantisation # (int5, 32 levels). A 64-bit word can hold: diff --git a/scripts/train_q2_ltc.py b/scripts/train_q2_ltc.py index e0b68ee..088254e 100644 --- a/scripts/train_q2_ltc.py +++ b/scripts/train_q2_ltc.py @@ -3,7 +3,7 @@ train_q2_ltc.py — Q²-QAT Hybrid LTC-Transformer for OpenAI Parameter Golf. Architecture: [GQA, CfC, CfC, CfC] × 4 = 16 layers (Geode-derived, see §4.5 -of PARAMETER_GOLF.md). The layer layout is derived from the Geode factorization +of docs/parameter-golf/ANALYSIS.md). The layer layout is derived from the Geode factorization S(x) - 1 = S1·G where S1=4x gives 4 GQA (coarse) blocks and G=1/(1-3x) gives 3 CfC (refinement) blocks per GQA block. @@ -515,7 +515,7 @@ def token_stream( When byte_tokens=True the .bin shards are read as raw uint8 bytes; each byte is directly used as a token (vocab size 256, no tokeniser encoder). This skips the SentencePiece encode step entirely (see §5.5 of - PARAMETER_GOLF.md). The data_path should point to a directory of raw + docs/parameter-golf/ANALYSIS.md). The data_path should point to a directory of raw text .bin files (UTF-8 or binary). """ import numpy as np From 655a8fd8c8c99379912e8087965ad76db9d9751a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 21 Mar 2026 22:24:42 +0000 Subject: [PATCH 3/4] Add synthesized APPROACH.md, DESIGN.md, and code.py for parameter golf Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com> Agent-Logs-Url: https://github.com/devlux76/q2/sessions/bd82c93d-89a9-49d4-80ee-b3f311e2d20c --- docs/parameter-golf/APPROACH.md | 396 +++++++++++++++ docs/parameter-golf/DESIGN.md | 506 +++++++++++++++++++ docs/parameter-golf/code.py | 865 ++++++++++++++++++++++++++++++++ 3 files changed, 1767 insertions(+) create mode 100644 docs/parameter-golf/APPROACH.md create mode 100644 docs/parameter-golf/DESIGN.md create mode 100644 docs/parameter-golf/code.py diff --git a/docs/parameter-golf/APPROACH.md b/docs/parameter-golf/APPROACH.md new file mode 100644 index 0000000..aa9645c --- /dev/null +++ b/docs/parameter-golf/APPROACH.md @@ -0,0 +1,396 @@ +# Parameter Golf: Unified Approach + +> **Status**: Synthesis of all prior analyses — the definitive strategy +> **Source documents**: [ANALYSIS.md](ANALYSIS.md) · [APPROACH_INITIAL.md](APPROACH_INITIAL.md) · [APPROACH_REVISED.md](APPROACH_REVISED.md) · [STRATEGY.md](STRATEGY.md) · [IMPLEMENTATION.md](IMPLEMENTATION.md) · [WILDBERGER_RUBINE_REVIEW.md](WILDBERGER_RUBINE_REVIEW.md) +> **Related**: [DESIGN.md](DESIGN.md) · [code.py](code.py) + +--- + +## 0 Starting from constraints + +Every correct solution starts from what is **known and fixed**, not from what +others did. The constraints determine the solution; the solution does not choose +its constraints. + +### 0.1 Hard constraints + +| Constraint | Value | Bits | +|:-----------|:------|:-----| +| Artifact size | 16,000,000 bytes | 128,000,000 bits | +| Training wall-clock | 600 seconds | — | +| Hardware | 8 × H100 SXM | — | +| Metric | val\_bpb on FineWeb (tokenizer-agnostic) | lower is better | + +### 0.2 Hardware knowns (H100 SXM) + +| Resource | Per GPU | 8 × GPU | +|:---------|:--------|:--------| +| BF16 tensor-core FLOPS | 989 TFLOPS | 7,912 TFLOPS | +| L2 cache | 50 MB | 400 MB total | +| HBM3 bandwidth | 3.35 TB/s | 26.8 TB/s | +| HBM3 capacity | 80 GB | 640 GB | +| SM count | 132 | 1,056 | +| Register file per SM | 256 KB | — | +| Shared memory per SM | 228 KB | — | +| Cache line | 128 bytes | — | +| CUDA register width | 32 bits | — | +| Max warps per SM | 64 | — | +| NVLink bandwidth | 900 GB/s | — | + +**Total available compute** in 10 minutes: + +$$T = 8 \times 989 \times 10^{12} \times 600 \approx 4.75 \times 10^{18} \text{ FLOP}$$ + +### 0.3 Williams 2025 SpaceTime bound + +Ryan Williams proved (STOC 2025, arXiv:2502.17779) that any computation +running in time $t$ can be simulated in space: + +$$S = \mathcal{O}\!\left(\sqrt{t \cdot \log t}\right)$$ + +Applied to our constraints: + +$$S_{\min} = \sqrt{4.75 \times 10^{18} \times 62} \approx 1.72 \times 10^{10} \text{ bits} \approx 2.15 \text{ GB}$$ + +Our artifact provides $1.28 \times 10^8$ bits — **0.75% of the +Williams-implied storage**. This means: + +1. We are in a **deep-compression regime** — every bit is precious. +2. Only the most structured, compressible patterns in FineWeb can be captured. +3. The model stores $\sim 3.4 \times 10^{14}$ FLOP of effective computation — the + remaining training FLOP refine weights toward the target distribution without + encoding qualitatively new structure. +4. **Any format that wastes bits (padding, metadata, odd-width alignment) + directly increases bpb.** + +### 0.4 The Wildberger–Geode result + +The Geode factorization (Wildberger & Rubine 2025): + +$$S - 1 = S_1 \cdot G$$ + +decomposes every non-trivial discrete structure into: + +- **$S_1$**: the coarse first-level choice (4 ways for $\mathbb{Z}_4$) +- **$G = 1/(1-3x)$**: the refinement (3 choices per subsequent step) + +This is not metaphor — it is isomorphic to: +- **DNA**: 4 bases ($\mathbb{Z}_4$), codons (triplets of 3-choice refinements) +- **Q² transition trie**: root arity 4, subsequent arity 3 +- **Progressive quantization**: coarse cell → refinement within cell + +The factorization provides the architectural template: **[coarse, refine, refine, refine] repeated**. + +### 0.5 The $\mathbb{Z}_4$ optimality + +Nature runs on $\mathbb{Z}_4$. DNA uses 4 bases: {A, C, G, T}. This is not +coincidence — it is the minimum alphabet that simultaneously preserves: + +1. **Sign** (which side of a hyperplane) +2. **Magnitude class** (near boundary or committed) +3. **Complement structure** (A↔T, C↔G; in Q²: $\theta(x) = x + 2 \bmod 4$) + +At 2 bits per symbol, $\mathbb{Z}_4$ quantization: +- Packs **32 weights per 64-bit register** — zero waste +- Packs **256 weights per 128-byte H100 cache line** — zero waste +- Achieves **$N = 64$ M parameters** in 16 MB — 2.8× more than int5 SOTA +- Preserves Lee metric distances via Gray encoding ($d_L = \text{popcnt}(\text{XOR})$) + +Compare to the current SOTA (int5): +- 12 weights per 64-bit register, **4 bits wasted per register** +- Across 16 MB: 1 MB of pure waste ($\approx 4$ M lost $\mathbb{Z}_4$ parameters) +- Only ~24 M effective parameters vs our 64 M + +--- + +## 1 What convergence tells us + +Four independent analyses arrived at these common conclusions: + +| Finding | Analyses agreeing | Confidence | +|:--------|:-----------------:|:----------:| +| Power-of-2 bit widths beat odd widths | ANALYSIS, APPROACH\_REVISED, Williams | High | +| Geode-guided progressive training beats flat training | ANALYSIS, STRATEGY, APPROACH\_INITIAL | High | +| CfC/LTC blocks are more parameter-efficient than attention | ANALYSIS, APPROACH\_INITIAL, STRATEGY | High | +| BigramHash tokenizer is optimal at 10k vocab | All four | High | +| Pure PyTorch on GPU, no WASM | APPROACH\_REVISED, STRATEGY | High | +| Mixed-precision: high bits for embedding, low bits for deep layers | APPROACH\_REVISED, STRATEGY | Medium | + +Where analyses **diverge**, we take the strongest position: + +| Divergence | Resolution | Rationale | +|:-----------|:-----------|:----------| +| int5/int6 vs Z₄ 2-bit | **Z₄ 2-bit** | Williams + cache alignment + 2.8× more params | +| 12 layers × 384 dim vs 16 layers × 768 dim | **16 layers × 768 dim** | Z₄ budget allows 64M params; use them | +| Standard attention vs full CfC | **Hybrid [GQA, CfC, CfC, CfC] × 4** | Geode-derived; GQA for coarse context, CfC for refinement | +| Uniform vs hierarchical Z-ring | **Uniform Z₄** | Maximizes N; Z₈/Z₁₆ only for embedding if needed | + +--- + +## 2 The architecture + +### 2.1 Geode-derived layout: [GQA, CfC, CfC, CfC] × 4 + +From the Geode factorization $S_1 = 4x$ (coarse) and $G = 1/(1-3x)$ (refine): + +| Layer | Type | Geode role | Information gain | +|:-----:|:-----|:-----------|:-----------------| +| 1 | GQA | $S_1$ root | $\log_2 4 = 2$ bits coarse context | +| 2–4 | CfC × 3 | $G$ level 1 | $3 \times \log_2 3 \approx 4.75$ bits refinement | +| 5 | GQA | $S_1$ reset | Re-establishes coarse context | +| 6–8 | CfC × 3 | $G$ level 2 | Refinement | +| 9 | GQA | $S_1$ reset | Re-establishes coarse context | +| 10–12 | CfC × 3 | $G$ level 3 | Refinement | +| 13 | GQA | $S_1$ reset | Final coarse context | +| 14–16 | CfC × 3 | $G$ level 4 | Final refinement | + +**Total structural capacity**: $4 \times (2 + 3 \times 1.585) \approx 27$ bits — +within the 51.1-bit capacity of the full 32-symbol key. + +### 2.2 Parameter budget + +With $d = 768$, $n_{\text{kv}} = 4$ KV heads, MLP ratio 3×: + +| Component | Formula | Parameters | Storage (Z₄) | +|:----------|:--------|:----------:|:-------------:| +| Embedding (V=1024, tied) | $1024 \times 768 \times 2$ | 1.57 M | 1.57 MB (FP16) | +| 4 × GQA block | $4 \times 11.67 d^2$ | 27.5 M | 6.88 MB | +| 12 × CfC block | $12 \times 5 d^2$ | 35.4 M | 8.85 MB | +| LayerNorm (16 layers) | negligible | ~25 K | ~50 KB (FP16) | +| **Total** | | **~64.5 M** | **~17.3 MB raw** | + +After zstd-22 compression (conservative 0.85×): **~14.7 MB** — within budget +with 1.3 MB headroom. + +If too tight, reduce $d$ to 700–730 or use $V = 256$ (byte tokenization, +saving 1.2 MB on embedding). + +### 2.3 Byte tokenization option + +At the byte level, vocabulary is always exactly 256: + +| Tokenization | Vocab | Embedding cost | Tokenizer | +|:-------------|:-----:|:--------------:|:---------:| +| SP-1024 | 1,024 | 1.57 MB (FP16) | Required | +| BigramHash 10240 | 10,240 | ~15.7 MB | Required | +| Raw bytes | 256 | 0.39 MB (FP16) | **None** | + +Byte tokenization frees ~1.2 MB vs SP-1024 ($\approx 5$ M extra Z₄ weights) +and eliminates the tokenizer encoder entirely. FineWeb bpb scoring operates on +bytes, so there is no evaluation penalty. + +--- + +## 3 The quantization + +### 3.1 Z₄ structural quantization + +All linear weight matrices $W \in \mathbb{R}^{m \times n}$ are quantized to +$\{A, B, C, D\} = \{0, 1, 2, 3\} \subset \mathbb{Z}_4$: + +$$q(w) = \begin{cases} +A & w \leq -\tau^\ast \\ +B & -\tau^\ast < w \leq 0 \\ +C & 0 < w \leq \tau^\ast \\ +D & w > \tau^\ast +\end{cases}$$ + +where $\tau^\ast = \Phi^{-1}(3/4) / \sqrt{n} \approx 0.6745 / \sqrt{n}$. + +**Gray encoding**: $g = s \oplus (s \gg 1)$ maps symbols so that +$d_{\text{Hamming}}(g_i, g_j) = d_{\text{Lee}}(s_i, s_j)$. + +**Packing**: 4 symbols per byte, MSB-first: + +``` +byte = (g[4i] << 6) | (g[4i+1] << 4) | (g[4i+2] << 2) | g[4i+3] +``` + +### 3.2 Why Z₄ beats reconstruction quantization + +| Property | Reconstruction (GPTQ/int5) | Structural (Q²/Z₄) | +|:---------|:---------------------------|:--------------------| +| Objective | $\min \lVert W - \hat{W} \rVert_F^2$ | Preserve relational geometry | +| Bits/weight | 5–6 | **2** | +| Params in 16 MB | ~24 M | **~64 M** | +| Register waste | 4 bits/register | **0** | +| Ring structure | None | $\mathbb{Z}_4$ with Lee metric | +| Complement | None | $\theta(x) = x + 2 \bmod 4$ | +| Gray encoding | N/A | Hamming = Lee distance | + +### 3.3 Straight-through estimator for QAT + +The STE propagates gradients through quantization: + +$$\frac{\partial \mathcal{L}}{\partial W_{ij}} \approx \frac{\partial \mathcal{L}}{\partial \hat{W}_{ij}} \cdot \mathbf{1}\!\left[|W_{ij}| \leq \kappa\right]$$ + +with passthrough window $\kappa = 3\tau^\ast$. + +Threshold $\tau^\ast$ is refreshed every 1024 steps from the empirical 25th/75th +percentile of each weight row (reservoir calibration, §D-2.5). + +### 3.4 Precision allocation + +| Component | Precision | Rationale | +|:----------|:----------|:----------| +| Embedding | FP16 | Interface between tokens and continuous space; small (V=256 or 1024) | +| GQA projections (Q, K, V, O) | Z₄ (2-bit) | Coarse context; complement structure natural | +| GQA MLP (up, gate, down) | Z₄ (2-bit) | Bulk of parameters; Z₄ maximizes N | +| CfC state matrices ($A_1$, $A_2$) | Z₄ (2-bit) | Complement structure ($A_1$ decay ↔ $A_2$ integration) | +| LayerNorm γ, β | FP16 | Negligible count; critical for stability | + +--- + +## 4 The training strategy + +### 4.1 Three-phase Geode-guided training + +**Phase 1 — FP32 warm-up (60 seconds, 10% of budget)** + +Train the full model at FP32 to establish activation distributions before +imposing the Z₄ constraint. OrthoInit for GQA, Kaiming for CfC/MLP. +Freeze embeddings for the first 500 steps to stabilize hash collisions. + +**Phase 2 — Q²-QAT progressive quantization (360 seconds, 60% of budget)** + +Activate Z₄ quantization layer-by-layer following the Geode hierarchy: +deep layers first (they tolerate 2-bit best), then middle, then shallow. +Each layer: quantize → fine-tune → proceed. + +Enable SWA (stochastic weight averaging) from step 60%. + +**Phase 3 — Final refinement (180 seconds, 30% of budget)** + +All layers at Z₄. Cosine LR cooldown. Final SWA pass with weight decay 0.04. +Sliding-window evaluation (stride 64) to harvest lower bpb. + +### 4.2 Optimizer and schedule + +| Setting | Value | Source | +|:--------|:------|:-------| +| Optimizer | Muon (Nesterov + spectral norm) | Leaderboard SOTA | +| Learning rate | 0.01 (cosine with warmup 200 steps) | Leaderboard SOTA | +| Weight decay | 0.04 (matrices only) | Leaderboard SOTA | +| SWA | Last 40% of training | Leaderboard SOTA | +| Gradient clipping | 1.0 | Training stability | +| Sequence length | 2048 (Phase 1–2), 4096 (Phase 3) | Context scaling | +| Q² threshold refresh | Every 1024 steps | §D-2.5 | + +### 4.3 H100 optimizations + +```python +# BF16 for non-quantized operations (H100 native) +torch.set_float32_matmul_precision('high') +torch.backends.cuda.matmul.allow_tf32 = True + +# Compile for max throughput +model = torch.compile(model, mode='max-autotune') + +# FlashAttention for GQA blocks +# F.scaled_dot_product_attention uses FlashAttention-2 on H100 + +# CfC blocks: element-wise sigmoid/multiply — no FlashAttention overhead +``` + +### 4.4 Data pipeline + +8 × H100 data-parallel with gradient accumulation: + +```python +effective_batch_tokens = batch_per_gpu * seq_len * 8 * grad_accum +# Target: ~4M tokens per optimizer step +# With seq_len=2048, batch=32, grad_accum=4: 32 * 2048 * 8 * 4 ~ 2M tokens +``` + +--- + +## 5 Artifact packaging + +### 5.1 Export pipeline + +1. Select SWA-averaged checkpoint +2. Pack all weight matrices to Q2BN format (Gray-encoded, 4 symbols/byte) +3. Order tensors by Geode traversal (long runs → RLE-friendly for zstd) +4. Compress with zstd level 22 +5. Validate: total artifact ≤ 16,000,000 bytes + +### 5.2 Artifact structure + +``` +Header (1 KB): + Model config, quantization thresholds per layer, vocabulary + +Body (~14 MB): + Embedding (FP16, ~0.4 MB for V=256) + GQA weights (Z4 packed, ~6.9 MB) + CfC weights (Z4 packed, ~8.9 MB) + LayerNorm parameters (FP16, ~50 KB) + +Total before zstd: ~16.3 MB +After zstd-22 (~0.85×): ~13.8 MB +Headroom: ~2.2 MB +``` + +--- + +## 6 Performance projection + +### 6.1 Scaling law + +Under Chinchilla scaling ($\alpha \approx 0.34$, $A \approx 406.4$): + +$$\Delta L \approx A \cdot (N_{24M}^{-0.34} - N_{64M}^{-0.34}) \approx 0.056 \text{ nats} \approx 0.081 \text{ bpb}$$ + +### 6.2 Projected performance + +| Component | Estimated bpb gain | +|:----------|:------------------:| +| Current SOTA baseline | 1.1428 | +| Z₄ parameter scaling ($2.8\times N$) | −0.08 | +| CfC architecture efficiency | −0.02 to −0.05 | +| Geode-guided progressive training | −0.01 | +| Zero-waste cache-line alignment | −0.005 | +| **Projected total** | **~1.00 to 1.05** | + +### 6.3 Risk-adjusted estimate + +Conservative (only scaling benefit works): **1.06 bpb** +Expected (scaling + architecture): **1.03 bpb** +Optimistic (all innovations compound): **1.00 bpb** + +Any of these substantially beat the current SOTA of 1.1428 bpb. + +--- + +## 7 Execution + +### 7.1 Immediate + +1. Implement Z₄ quantizer with Gray encoding and STE +2. Implement CfC block and GQA block (all projections use Q2Linear) +3. Assemble 16-layer Geode model +4. Single-GPU smoke test (200 steps) + +### 7.2 This week + +1. 8×H100 full training run (10 minutes) +2. Validate compressed artifact size +3. First bpb measurement on FineWeb validation + +### 7.3 Iterate + +1. Tune $d$, $V$, sequence length, LR, weight decay +2. Ablate: CfC vs attention, byte vs SP-1024, progressive vs flat +3. Target reproducibility: 5+ runs within σ < 0.005 bpb + +--- + +## References + +- Williams, R. (2025). *Simulating Time With Square-Root Space*. STOC 2025. arXiv:2502.17779. +- Wildberger, N. J. & Rubine, D. (2025). *A Hyper-Catalan Series Solution to Polynomial Equations, and the Geode*. Amer. Math. Monthly 132:5, 383–402. +- Hammons, A. R. et al. (1994). *The $\mathbb{Z}_4$-linearity of Kerdock, Preparata, Goethals, and related codes*. IEEE Trans. Inform. Theory 40:2, 301–319. +- Hasani, R. et al. (2021). *Liquid Time-constant Networks*. AAAI-2021. +- Hasani, R. et al. (2022). *Closed-form Continuous-time Neural Networks*. Nature Machine Intelligence 4, 992–1003. +- Ma, S. et al. (2024). *The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits*. arXiv:2402.12263. +- OpenAI. *Parameter Golf*. https://openai.com/index/parameter-golf/ diff --git a/docs/parameter-golf/DESIGN.md b/docs/parameter-golf/DESIGN.md new file mode 100644 index 0000000..cb48d57 --- /dev/null +++ b/docs/parameter-golf/DESIGN.md @@ -0,0 +1,506 @@ +# Parameter Golf: Unified Design + +> **Status**: Synthesized design from all prior analyses +> **Companion**: [APPROACH.md](APPROACH.md) · [code.py](code.py) +> **Mathematical foundations**: [DESIGN.md](../../DESIGN.md) (§D-x.y) · [WILDBERGER\_RUBINE\_REVIEW.md](WILDBERGER_RUBINE_REVIEW.md) + +--- + +## Contents + +1. [Thermodynamic Bounds](#1-thermodynamic-bounds) +2. [The Z₄ Quantization Kernel](#2-the-z4-quantization-kernel) +3. [Geode Architecture](#3-geode-architecture) +4. [CfC Blocks: Closed-Form Continuous-Time](#4-cfc-blocks-closed-form-continuous-time) +5. [GQA Blocks: Grouped Query Attention](#5-gqa-blocks-grouped-query-attention) +6. [Training Dynamics](#6-training-dynamics) +7. [Cache-Line and Register Geometry](#7-cache-line-and-register-geometry) +8. [Compression and Artifact Packing](#8-compression-and-artifact-packing) +9. [The DNA Isomorphism](#9-the-dna-isomorphism) + +--- + +## 1 Thermodynamic Bounds + +### 1.1 The information budget + +The artifact has $B = 128{,}000{,}000$ bits. The training run produces +$T \approx 4.75 \times 10^{18}$ FLOP. By the Williams 2025 bound: + +$$S_{\min} = \mathcal{O}\!\left(\sqrt{T \cdot \log_2 T}\right) \approx 1.72 \times 10^{10} \text{ bits}$$ + +We have $B / S_{\min} \approx 0.0075$ — less than 1% of the +information-theoretically implied storage. The model cannot faithfully encode +all structure discovered during training. It must **compress ruthlessly**. + +**Design consequence**: every bit in the artifact must carry maximum +information. No padding, no odd-width alignment waste, no metadata overhead +that could be absorbed into the weight stream. + +### 1.2 Inverting Williams: what can 16 MB encode? + +$$B^2 \approx T_{\text{eff}} \cdot \log_2 T_{\text{eff}} \implies T_{\text{eff}} \approx 3.4 \times 10^{14} \text{ FLOP}$$ + +A 16 MB model encodes the structure of $\sim 3.4 \times 10^{14}$ FLOP — about +0.007% of the training budget. The remaining FLOP push stored structure toward +the FineWeb distribution without expanding capacity. + +### 1.3 Optimal bit width from first principles + +The question: what integer bit width $b$ maximizes $N = B / b$ (parameter +count) while achieving zero register/cache-line waste? + +| $b$ | $N$ in 16 MB | Waste per 64-bit register | Ring structure | Verdict | +|:---:|:------------:|:-------------------------:|:--------------:|:--------| +| 1 | 128 M | 0 | $\mathbb{Z}_2$ (no complement) | Too coarse | +| **2** | **64 M** | **0** | **$\mathbb{Z}_4$ (full complement)** | **Optimal** | +| 4 | 32 M | 0 | $\mathbb{Z}_8$ | Viable fallback | +| 5 | ~24 M | 4 bits | None | Suboptimal | +| 6 | ~20 M | 4 bits | None | Suboptimal | +| 8 | 16 M | 0 | $\mathbb{Z}_{16}$ | Low capacity | + +$b = 2$ uniquely satisfies: maximum $N$, zero waste, full $\mathbb{Z}_4$ ring +with complement involution, and Lee metric preserved by Gray encoding. + +--- + +## 2 The Z₄ Quantization Kernel + +### 2.1 The four cells + +For a weight $w$ with per-row threshold $\tau^\ast$: + +$$q(w) = \begin{cases} +A = 0 & w \leq -\tau^\ast & \text{strong negative (committed)} \\ +B = 1 & -\tau^\ast < w \leq 0 & \text{weak negative (boundary)} \\ +C = 2 & 0 < w \leq \tau^\ast & \text{weak positive (boundary)} \\ +D = 3 & w > \tau^\ast & \text{strong positive (committed)} +\end{cases}$$ + +The threshold for Gaussian weights: + +$$\tau^\ast = \frac{\Phi^{-1}(3/4)}{\sqrt{n}} \approx \frac{0.6745}{\sqrt{n}}$$ + +ensures equiprobable cells ($P(A) = P(B) = P(C) = P(D) = 1/4$), maximizing +entropy at $I = 2$ bits per dimension. + +For non-Gaussian distributions (heavy-tailed activations, mixture models), the +threshold can alternatively be computed via the hyper-Catalan series +(Wildberger & Rubine 2025) — a combinatorial closed-form that converges +without iteration: + +$$\alpha = \sum_\mathbf{m} C_\mathbf{m} \cdot t_2^{m_2} t_3^{m_3} \cdots$$ + +Truncation order trades precision for compute cost — a natural fit for the +resource-constrained setting. + +### 2.2 Gray encoding + +The Gray map $\phi: \mathbb{Z}_4 \to \mathbb{F}_2^2$: + +$$g = s \oplus (s \gg 1)$$ + +| Symbol | Value | Gray code | +|:------:|:-----:|:---------:| +| A | 0 | 00 | +| B | 1 | 01 | +| C | 2 | 11 | +| D | 3 | 10 | + +**Key property** (Hammons et al. 1994): Hamming distance on Gray codes equals +Lee distance on $\mathbb{Z}_4$ symbols: + +$$d_{\text{Ham}}(\phi(u), \phi(v)) = d_{\text{Lee}}(u, v) = \sum_{i=1}^{n} \min(|u_i - v_i|, 4 - |u_i - v_i|)$$ + +This means Lee distance is computable via `popcnt(XOR)` — a single hardware +instruction on H100. + +### 2.3 Complement involution + +$$\theta(x) = x + 2 \pmod{4}: \quad A \leftrightarrow C, \quad B \leftrightarrow D$$ + +Properties: +- $\theta^2 = \text{id}$ (involution) +- $d_L(x, \theta(x)) = 2$ (maximum Lee distance) +- Encodes structural opposition (strong-negative ↔ weak-positive) + +**Design role**: The complement constraint $\theta(W_{ij}) \neq W_{ij}$ for all +weights prevents redundant weight pairs, enforcing orthogonality at the symbolic +level. This acts as a **regularizer** during QAT. + +### 2.4 Dequantization map + +For the forward pass, symbols map to reconstruction centroids: + +$$\hat{w}(s) = \{-1.5\tau, -0.5\tau, +0.5\tau, +1.5\tau\}[s]$$ + +The spacing is uniform in $\tau$-units. For non-Gaussian distributions, optimal +reconstruction uses the conditional expectation $\mathbb{E}[w \mid q(w) = s]$, +computable via hyper-Catalan series reversion (§5 of Wildberger-Rubine review). + +### 2.5 Packing + +Four symbols per byte, MSB-first: + +``` +byte = (g[4i] << 6) | (g[4i+1] << 4) | (g[4i+2] << 2) | g[4i+3] +``` + +32 weights per 64-bit register. 256 weights per 128-byte H100 cache line. +**Zero waste at every alignment boundary.** + +--- + +## 3 Geode Architecture + +### 3.1 The factorization + +The Geode factorization of Q²'s transition sequences: + +$$S(x) - 1 = \underbrace{4x}_{S_1} \cdot \underbrace{\frac{1}{1-3x}}_{G}$$ + +- $S_1 = 4x$: first symbol → 4 choices → **GQA block** (coarse context) +- $G = 1 + 3x + 9x^2 + \cdots$: each subsequent symbol → 3 choices → **CfC block** (refinement) + +### 3.2 Layer layout + +$$\underbrace{[\text{GQA},\ \text{CfC},\ \text{CfC},\ \text{CfC}]}_{\text{one Geode level}} \times 4 = 16 \text{ layers}$$ + +4 GQA + 12 CfC, ratio 3:1 (CfC:GQA). More CfC-heavy than LFM 2.5's +empirical 10:6 = 1.67:1 — predicted by the Geode for short-context (2048-token) +workloads where less attention is needed. + +### 3.3 Information flow + +| Depth | Layer type | Cumulative bits | +|:-----:|:-----------|:---------------:| +| 1 | GQA | 2.0 | +| 2–4 | CfC × 3 | 6.75 | +| 5 | GQA | 8.75 | +| 6–8 | CfC × 3 | 13.5 | +| 9 | GQA | 15.5 | +| 10–12 | CfC × 3 | 20.25 | +| 13 | GQA | 22.25 | +| 14–16 | CfC × 3 | 27.0 | + +27 bits of structural information — within the 51.1-bit capacity of a full +32-symbol transition key (§D-3.6). + +### 3.4 Euler polytope constraint + +The hyper-Catalan coefficient governs admissible quantization lattices: + +$$C_\mathbf{m} = \frac{(E-1)!}{(V-1)! \cdot \mathbf{m}!}, \quad V - E + F = 1$$ + +For $\mathbb{Z}_4$: $V = 4$, $E = 4$, $F = 1$ → $4 - 4 + 1 = 1$ ✓ + +This constrains the topology: we cannot add layers or heads arbitrarily. +Each architectural modification must preserve $V - E + F = \text{const}$, +which caps parameter growth and keeps the artifact under budget. + +--- + +## 4 CfC Blocks: Closed-Form Continuous-Time + +### 4.1 The LTC ODE + +$$\dot{h}(t) = -\left[\frac{1}{\tau_c} + f(h, x; \theta)\right] h(t) + f(h, x; \theta)$$ + +### 4.2 Closed-form solution (Hasani et al. 2022) + +$$h(t + \Delta t) = e^{-A_1 \Delta t} \odot h(t) + \frac{A_2}{A_1} \odot \left(1 - e^{-A_1 \Delta t}\right)$$ + +where $A_1, A_2$ are learned functions of $(x, h)$. + +### 4.3 Parameter count + +Per CfC layer with hidden dimension $d$: +- $A_1$ projection: $2d^2$ parameters (input + recurrent) +- $A_2$ projection: $2d^2$ parameters +- Output projection: $d^2$ parameters +- **Total: $5d^2$ per CfC block** + +Compare to GQA: $\approx 11.67d^2$ per block. CfC is **2.3× more parameter-efficient**. + +### 4.4 Q² synergy + +CfC state updates use sigmoid activations that saturate at $\pm 1$. Near +saturation, exact weight values matter less than **sign and magnitude class** — +precisely what Z₄ preserves. + +The two matrices $A_1$ (decay) and $A_2$ (integration) have a natural +**complement relationship**: strong-decay and strong-integration are complements +in the same way that $A$ and $C$ are complements in $\mathbb{Z}_4$. + +### 4.5 Implementation + +```python +class CfCBlock(nn.Module): + """One Geode G-level: 3-way refinement via closed-form LTC.""" + + def __init__(self, d_model, n_time_constants=5): + super().__init__() + self.a1_proj = Q2Linear(d_model, d_model) # Decay + self.a2_proj = Q2Linear(d_model, d_model) # Integration + self.out_proj = Q2Linear(d_model, d_model) + self.tau = nn.Parameter(torch.randn(n_time_constants)) + self.ln = nn.LayerNorm(d_model) + # SwiGLU MLP + self.mlp_up = Q2Linear(d_model, d_model * 3) + self.mlp_gate = Q2Linear(d_model, d_model * 3) + self.mlp_down = Q2Linear(d_model * 3, d_model) + self.ln2 = nn.LayerNorm(d_model) + + def forward(self, x, h): + # CfC state update + x_norm = self.ln(x) + a1 = torch.sigmoid(self.a1_proj(x_norm)) + a2 = torch.sigmoid(self.a2_proj(x_norm)) + tau_c = torch.sigmoid(self.tau) + h_new = torch.exp(-a1 * tau_c) * h + (a2 / a1) * (1 - torch.exp(-a1 * tau_c)) + x = x + self.out_proj(h_new) + # SwiGLU MLP + x = x + self.mlp_down(F.silu(self.mlp_gate(self.ln2(x))) * self.mlp_up(self.ln2(x))) + return x, h_new +``` + +--- + +## 5 GQA Blocks: Grouped Query Attention + +### 5.1 Role in Geode architecture + +GQA blocks are the **$S_1$ coarse selectors** — they attend across the full +sequence to establish broad context structure (equivalent to selecting one of +4 block files in the transition key, §D-3.4). + +### 5.2 Implementation + +Standard Grouped Query Attention with: +- $n_h$ query heads, $n_{\text{kv}}$ key-value heads ($n_h / n_{\text{kv}}$ groups) +- All projections (Q, K, V, O) are `Q2Linear` (Z₄ quantized) +- SwiGLU MLP with 3× expansion, all `Q2Linear` +- Uses `F.scaled_dot_product_attention` → FlashAttention-2 kernel on H100 + +```python +class GQABlock(nn.Module): + """One Geode S1-level: 4-way coarse selection via grouped query attention.""" + + def __init__(self, d_model, n_heads=8, n_kv_heads=4): + super().__init__() + self.n_heads = n_heads + self.n_kv_heads = n_kv_heads + self.head_dim = d_model // n_heads + self.q_proj = Q2Linear(d_model, d_model) + self.k_proj = Q2Linear(d_model, self.head_dim * n_kv_heads) + self.v_proj = Q2Linear(d_model, self.head_dim * n_kv_heads) + self.o_proj = Q2Linear(d_model, d_model) + self.ln1 = nn.LayerNorm(d_model) + # SwiGLU MLP + self.mlp_up = Q2Linear(d_model, d_model * 3) + self.mlp_gate = Q2Linear(d_model, d_model * 3) + self.mlp_down = Q2Linear(d_model * 3, d_model) + self.ln2 = nn.LayerNorm(d_model) + + def forward(self, x): + h = self.ln1(x) + q = self.q_proj(h).view(*h.shape[:-1], self.n_heads, self.head_dim) + k = self.k_proj(h).view(*h.shape[:-1], self.n_kv_heads, self.head_dim) + v = self.v_proj(h).view(*h.shape[:-1], self.n_kv_heads, self.head_dim) + # GQA: repeat KV heads + k = k.repeat_interleave(self.n_heads // self.n_kv_heads, dim=-2) + v = v.repeat_interleave(self.n_heads // self.n_kv_heads, dim=-2) + # Transpose for attention: (B, H, T, D) + q, k, v = [t.transpose(-3, -2) for t in (q, k, v)] + attn = F.scaled_dot_product_attention(q, k, v, is_causal=True) + attn = attn.transpose(-3, -2).contiguous().view(*x.shape) + x = x + self.o_proj(attn) + # SwiGLU MLP + h2 = self.ln2(x) + x = x + self.mlp_down(F.silu(self.mlp_gate(h2)) * self.mlp_up(h2)) + return x +``` + +--- + +## 6 Training Dynamics + +### 6.1 QAT via STE + +During training, each `Q2Linear` layer: +1. Quantizes weights to $\{A, B, C, D\}$ +2. Dequantizes to reconstruction centroids for the forward pass +3. Passes gradients through via STE (straight-through estimator) +4. Updates full-precision shadow weights with the optimizer + +The FP32 warm-up phase (10% of training) establishes activation distributions +before imposing the Z₄ constraint. This follows the BitNet finding (Ma et al. +2024) that QAT-from-scratch requires a brief float-precision warm-up. + +### 6.2 Progressive quantization (Geode-guided) + +Layers are quantized in Geode order: deep CfC layers first (most tolerant of +low precision), then middle layers, then GQA layers, then embedding adjacent. + +This matches the Geode's hierarchical decomposition: coarse structure ($S_1$) +is established first, then refinement ($G$) is progressively constrained. + +### 6.3 Muon optimizer + +Nesterov momentum with per-matrix spectral normalization: +- Prevents large weight moves from disrupting Q² complement structure +- Higher LR (0.01) than Adam due to Nesterov momentum +- Weight decay 0.04 on matrices only + +### 6.4 Stochastic weight averaging + +SWA activated from 60% of training. The averaged model produces smoother +loss landscapes that are more amenable to 2-bit quantization — flat minima +tolerate quantization error better than sharp minima. + +--- + +## 7 Cache-Line and Register Geometry + +### 7.1 H100 memory hierarchy + +| Level | Size | Access time | Alignment | +|:------|:-----|:------------|:----------| +| Register file (per SM) | 256 KB | 1 cycle | 32-bit | +| L1/shared memory (per SM) | 228 KB | ~28 cycles | 128-byte | +| L2 cache (per GPU) | 50 MB | ~200 cycles | 128-byte | +| HBM3 | 80 GB | ~400 cycles | 128-byte | + +### 7.2 Z₄ alignment at every level + +| Alignment boundary | Size | Z₄ weights fitting | Waste | +|:-------------------|:-----|:-------------------:|:-----:| +| 32-bit register | 4 B | 16 | 0 | +| 64-bit double-word | 8 B | 32 | 0 | +| 128-byte cache line | 128 B | 512 | 0 | +| 256-byte aligned block | 256 B | 1024 | 0 | + +Z₄ achieves **perfect alignment at every level** of the H100 memory hierarchy. +int5 wastes 4 bits per 64-bit word, accumulating to 1 MB of waste across 16 MB. + +### 7.3 Tensor dimension constraints + +To ensure perfect cache-line alignment, all tensor dimensions must be +divisible by 512 (weights per cache line) or at minimum 32 (weights per +register). With $d = 768$: +- $768 = 32 \times 24$ ✓ (register-aligned) +- $768 \times 3 = 2304 = 32 \times 72$ ✓ (MLP expansion) + +### 7.4 LIV cache-line packing (optional) + +For post-training int5 export (LFM 2.5 compatibility): + +12 LIV symbols × 5 bits + 2-bit Q² tag + 2 unused = 64 bits exactly. + +The Q² tag partitions packed words into 4 groups for parallel SM dispatch. +The top 10 × 5 = 50 bits form two 5 × 5 binary matrices whose Boolean +product serves as a codon checksum — verifiable in $O(25)$ bitwise ops. + +--- + +## 8 Compression and Artifact Packing + +### 8.1 Q2BN binary format + +The Q2BN format stores quantized weights: + +``` +[4-byte magic: "Q2BN"] +[4-byte version] +[4-byte tensor count] +For each tensor: + [4-byte name length][name bytes] + [4-byte ndim][4-byte × ndim shape] + [4-byte dtype: 0=Q2, 1=FP16, 2=FP32] + [packed weight bytes] +``` + +### 8.2 Geode-ordered serialization + +Tensors are serialized in Geode traversal order: +1. GQA block 1 weights (all projections) +2. CfC blocks 2–4 weights +3. GQA block 5 weights +4. CfC blocks 6–8 weights +5. ... (repeat pattern) + +This ordering groups structurally similar weights together, producing long +runs of similar byte patterns that zstd exploits for higher compression. + +### 8.3 Compression pipeline + +```python +# 1. Pack weights to Q2BN +q2_pack.pack_state_dict(model.state_dict(), 'model.q2bin') + +# 2. Compress with zstd level 22 +import zstandard +cctx = zstandard.ZstdCompressor(level=22) +compressed = cctx.compress(open('model.q2bin', 'rb').read()) + +# 3. Validate +assert len(compressed) <= 16_000_000 +``` + +--- + +## 9 The DNA Isomorphism + +### 9.1 Nature's billion-year head start + +The choice of $\mathbb{Z}_4$ is not arbitrary. DNA uses four bases: + +| DNA | Q² | Binary | Complement | +|:---:|:--:|:------:|:----------:| +| A (Adenine) | A (strong −) | 00 | T ↔ C | +| C (Cytosine) | B (weak −) | 01 | G ↔ D | +| G (Guanine) | C (weak +) | 11 | C ↔ A | +| T (Thymine) | D (strong +) | 10 | A ↔ B | + +The complement pairing (A↔T, C↔G in DNA; A↔C, B↔D in Q²) is the same +involution $\theta(x) = x + 2 \bmod 4$. + +### 9.2 Codons as Geode levels + +DNA codons are triplets of bases: $4^3 = 64$ possible codons encoding 20 +amino acids. This is the Geode's 3-way refinement at each level: + +$$G = 1 + 3x + 9x^2 + 27x^3 + \cdots$$ + +At depth 3: $4 \times 3^2 = 36$ distinct run-reduced sequences — close to +the 20 amino acids when accounting for redundancy (the "wobble" in the third +codon position). + +### 9.3 What this means for parameter golf + +Nature evolved $\mathbb{Z}_4$ as the optimal encoding for information in a +thermodynamically constrained environment. The parameter golf challenge +presents the same problem: encode maximum information (language structure) +in minimum space (16 MB) under fixed compute (10 minutes × 8 × H100). + +The isomorphism is not metaphor — it is structural. The same mathematics +(Gray encoding, Lee metric, complement involution, Geode factorization) that +describes DNA coding theory describes our weight quantization scheme. + +We are not borrowing a biological metaphor. We are recognizing that both +problems — storing heritable information in nucleotides and storing linguistic +structure in quantized weights — are instances of the same $\mathbb{Z}_4$ +optimization under resource constraints. + +--- + +## References + +- Williams, R. (2025). *Simulating Time With Square-Root Space*. Proc. STOC 2025. arXiv:2502.17779. +- Wildberger, N. J. & Rubine, D. (2025). *A Hyper-Catalan Series Solution to Polynomial Equations, and the Geode*. Amer. Math. Monthly 132:5, 383–402. +- Hammons, A. R. et al. (1994). *The $\mathbb{Z}_4$-linearity of Kerdock, Preparata, Goethals, and related codes*. IEEE Trans. Inform. Theory 40:2, 301–319. +- Hasani, R. et al. (2021). *Liquid Time-constant Networks*. AAAI-2021. arXiv:2006.04439. +- Hasani, R. et al. (2022). *Closed-form Continuous-time Neural Networks*. Nature Machine Intelligence 4, 992–1003. arXiv:2106.13898. +- Ma, S. et al. (2024). *The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits*. arXiv:2402.12263. +- Liquid AI. *LFM 2.5 Technical Report* (2025). https://www.liquid.ai/research/lfm-2-5 +- OpenAI. *Parameter Golf*. https://openai.com/index/parameter-golf/ diff --git a/docs/parameter-golf/code.py b/docs/parameter-golf/code.py new file mode 100644 index 0000000..2ea086a --- /dev/null +++ b/docs/parameter-golf/code.py @@ -0,0 +1,865 @@ +""" +Parameter Golf: Q² Optimized Training Script +============================================= + +Maximizes every bit of 128,000,000 bits and every FLOP of 8×H100 for 600s. + +Architecture: [GQA, CfC, CfC, CfC] × 4 = 16 layers (Geode-derived) +Quantization: Z₄ (2-bit) structural quantization with Gray encoding +Optimizer: Muon (Nesterov + spectral norm) +Training: 3-phase Geode-guided (FP32 warm-up → progressive QAT → refinement) + +Hardware: 8 × H100 SXM (989 TFLOPS BF16, 80GB HBM3, 50MB L2, 128B cache line) +Budget: 16,000,000 bytes artifact, 600 seconds wall-clock +Target: < 1.05 bits/byte on FineWeb validation + +References: + - Williams 2025 (SpaceTime bound): arXiv:2502.17779 + - Wildberger & Rubine 2025 (Geode): Amer. Math. Monthly 132:5 + - Hammons et al. 1994 (Z₄ Gray map): IEEE Trans. IT 40:2 + - Hasani et al. 2022 (CfC): Nature Machine Intelligence 4 + - Ma et al. 2024 (BitNet 1.58): arXiv:2402.12263 + +See: docs/parameter-golf/APPROACH.md, docs/parameter-golf/DESIGN.md +""" + +from __future__ import annotations + +import math +import os +import struct +import time +from dataclasses import dataclass +from pathlib import Path +from typing import Optional + +import torch +import torch.distributed as dist +import torch.nn as nn +import torch.nn.functional as F +from torch.nn.parallel import DistributedDataParallel as DDP + +# ── Hardware constants (H100 SXM) ────────────────────────────────────────── + +CACHE_LINE_BYTES = 128 # H100 L2 cache line +REGISTER_BITS = 64 # CUDA 64-bit register (for packing math) +Z4_WEIGHTS_PER_REGISTER = 32 # 64 / 2 +Z4_WEIGHTS_PER_CACHE_LINE = 512 # 128 * 8 / 2 +INV_CDF_75 = 0.6745 # Φ⁻¹(3/4) for equiprobable Z₄ thresholds +ARTIFACT_BUDGET = 16_000_000 # bytes + + +# ── Z₄ Quantization ──────────────────────────────────────────────────────── + +class Q2Quantize(torch.autograd.Function): + """Z₄ structural quantization with straight-through estimator. + + Maps weights to {A=0, B=1, C=2, D=3} using equiprobable thresholds, + Gray-encodes for packing, and dequantizes to centroids for forward pass. + """ + + @staticmethod + def forward( + ctx: torch.autograd.function.FunctionCtx, + weight: torch.Tensor, + tau: torch.Tensor, + ) -> torch.Tensor: + # Classify into 4 cells: A (strong−), B (weak−), C (weak+), D (strong+) + sym = torch.zeros_like(weight, dtype=torch.long) + sym = torch.where(weight <= -tau, torch.tensor(0, device=weight.device), sym) + sym = torch.where( + (weight > -tau) & (weight <= 0), torch.tensor(1, device=weight.device), sym + ) + sym = torch.where( + (weight > 0) & (weight <= tau), torch.tensor(2, device=weight.device), sym + ) + sym = torch.where(weight > tau, torch.tensor(3, device=weight.device), sym) + + # Dequantize: {A,B,C,D} → {-1.5τ, -0.5τ, +0.5τ, +1.5τ} + centroids = torch.tensor( + [-1.5, -0.5, 0.5, 1.5], dtype=weight.dtype, device=weight.device + ) + weight_q = centroids[sym] * tau + + # STE passthrough window: κ = 3τ* + kappa = 3.0 * tau + ctx.save_for_backward(weight, kappa) + return weight_q + + @staticmethod + def backward( + ctx: torch.autograd.function.FunctionCtx, + grad_output: torch.Tensor, + ) -> tuple[torch.Tensor, None]: + weight, kappa = ctx.saved_tensors + # Pass gradients through only within the passthrough window + mask = weight.abs() <= kappa + return grad_output * mask.float(), None + + +class Q2Linear(nn.Module): + """Linear layer with Z₄ quantization-aware training. + + During training: quantizes weights to Z₄ via STE each forward pass. + During eval: uses cached quantized weights. + """ + + def __init__( + self, + in_features: int, + out_features: int, + bias: bool = False, + ): + super().__init__() + self.in_features = in_features + self.out_features = out_features + self.weight = nn.Parameter(torch.empty(out_features, in_features)) + if bias: + self.bias = nn.Parameter(torch.zeros(out_features)) + else: + self.register_parameter("bias", None) + + # Equiprobable threshold (refreshed periodically during training) + self.register_buffer( + "tau", torch.tensor(INV_CDF_75 / math.sqrt(in_features)) + ) + + # Q2 active flag (starts inactive for FP32 warm-up) + self.q2_active = False + + # Initialize: Kaiming uniform + nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5)) + + def refresh_tau(self) -> None: + """Refresh threshold from empirical weight distribution (§D-2.5).""" + with torch.no_grad(): + # Per-row 75th percentile + q75 = torch.quantile(self.weight.abs(), 0.75, dim=-1, keepdim=True) + self.tau.fill_(q75.mean().item()) + + def activate_q2(self) -> None: + """Enable Z₄ quantization (call after FP32 warm-up phase).""" + self.q2_active = True + self.refresh_tau() + + def forward(self, x: torch.Tensor) -> torch.Tensor: + if self.training and self.q2_active: + w = Q2Quantize.apply(self.weight, self.tau) + else: + w = self.weight + return F.linear(x, w, self.bias) + + +# ── CfC Block (Geode G-level: refinement) ────────────────────────────────── + +class CfCBlock(nn.Module): + """Closed-form Continuous-time block — one Geode G-level (3-way refinement). + + Runs the closed-form LTC update per token; state h propagates across the + sequence with no KV cache. All projections are Q2Linear (Z₄). + """ + + def __init__(self, d_model: int, n_time_constants: int = 5, mlp_ratio: float = 3.0): + super().__init__() + mlp_dim = int(d_model * mlp_ratio) + + # CfC projections + self.a1_proj = Q2Linear(d_model, d_model) # Decay rate + self.a2_proj = Q2Linear(d_model, d_model) # Integration rate + self.out_proj = Q2Linear(d_model, d_model) + self.tau_c = nn.Parameter(torch.randn(n_time_constants)) + self.ln1 = nn.LayerNorm(d_model) + + # SwiGLU MLP + self.mlp_gate = Q2Linear(d_model, mlp_dim) + self.mlp_up = Q2Linear(d_model, mlp_dim) + self.mlp_down = Q2Linear(mlp_dim, d_model) + self.ln2 = nn.LayerNorm(d_model) + + def forward( + self, x: torch.Tensor, h: Optional[torch.Tensor] = None + ) -> tuple[torch.Tensor, torch.Tensor]: + B, T, D = x.shape + + if h is None: + h = torch.zeros(B, D, device=x.device, dtype=x.dtype) + + outputs = [] + for t in range(T): + x_t = self.ln1(x[:, t, :]) + a1 = torch.sigmoid(self.a1_proj(x_t)) + a2 = torch.sigmoid(self.a2_proj(x_t)) + tc = torch.sigmoid(self.tau_c).unsqueeze(0) + # Pad or slice tc to match d_model + if tc.shape[-1] < D: + tc = tc.repeat(1, (D + tc.shape[-1] - 1) // tc.shape[-1])[:, :D] + # Closed-form LTC update: h_new = exp(-a1*τ)*h + (a2/a1)*(1 - exp(-a1*τ)) + decay = torch.exp(-a1 * tc) + h = decay * h + (a2 / (a1 + 1e-6)) * (1.0 - decay) + outputs.append(h) + + h_seq = torch.stack(outputs, dim=1) # (B, T, D) + x = x + self.out_proj(h_seq) + + # SwiGLU MLP + h2 = self.ln2(x) + x = x + self.mlp_down(F.silu(self.mlp_gate(h2)) * self.mlp_up(h2)) + + return x, h + + +# ── GQA Block (Geode S1-level: coarse selection) ─────────────────────────── + +class GQABlock(nn.Module): + """Grouped Query Attention block — one Geode S1-level (4-way coarse selection). + + Uses F.scaled_dot_product_attention → FlashAttention-2 on H100. + All projections are Q2Linear (Z₄). + """ + + def __init__( + self, + d_model: int, + n_heads: int = 8, + n_kv_heads: int = 4, + mlp_ratio: float = 3.0, + ): + super().__init__() + self.n_heads = n_heads + self.n_kv_heads = n_kv_heads + self.head_dim = d_model // n_heads + self.n_rep = n_heads // n_kv_heads + mlp_dim = int(d_model * mlp_ratio) + + # Attention projections + self.q_proj = Q2Linear(d_model, d_model) + self.k_proj = Q2Linear(d_model, self.head_dim * n_kv_heads) + self.v_proj = Q2Linear(d_model, self.head_dim * n_kv_heads) + self.o_proj = Q2Linear(d_model, d_model) + self.ln1 = nn.LayerNorm(d_model) + + # SwiGLU MLP + self.mlp_gate = Q2Linear(d_model, mlp_dim) + self.mlp_up = Q2Linear(d_model, mlp_dim) + self.mlp_down = Q2Linear(mlp_dim, d_model) + self.ln2 = nn.LayerNorm(d_model) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + B, T, D = x.shape + h = self.ln1(x) + + q = self.q_proj(h).view(B, T, self.n_heads, self.head_dim).transpose(1, 2) + k = self.k_proj(h).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2) + v = self.v_proj(h).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2) + + # GQA: repeat KV heads to match query heads + if self.n_rep > 1: + k = k.repeat_interleave(self.n_rep, dim=1) + v = v.repeat_interleave(self.n_rep, dim=1) + + # FlashAttention-2 on H100 + attn_out = F.scaled_dot_product_attention(q, k, v, is_causal=True) + attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, D) + x = x + self.o_proj(attn_out) + + # SwiGLU MLP + h2 = self.ln2(x) + x = x + self.mlp_down(F.silu(self.mlp_gate(h2)) * self.mlp_up(h2)) + + return x + + +# ── Full Model: Geode Layout [GQA, CfC, CfC, CfC] × 4 ──────────────────── + +@dataclass +class ModelConfig: + """Configuration derived from constraints and Geode structure.""" + vocab_size: int = 256 # Byte tokenization (saves 1.2 MB vs SP-1024) + d_model: int = 768 # Hidden dimension (32-aligned for Z₄ registers) + n_geode_levels: int = 4 # 4 Geode levels + cfc_per_level: int = 3 # 3 CfC blocks per GQA (from G = 1/(1-3x)) + n_heads: int = 8 # Query heads + n_kv_heads: int = 4 # KV heads (GQA) + mlp_ratio: float = 3.0 # SwiGLU expansion + n_time_constants: int = 5 # CfC time constants per block + max_seq_len: int = 2048 # Context length + + @property + def n_layers(self) -> int: + return self.n_geode_levels * (1 + self.cfc_per_level) # 4 * 4 = 16 + + +class Q2LTCModel(nn.Module): + """Q²-QAT Hybrid LTC-Transformer with Geode layout. + + Architecture: [GQA, CfC, CfC, CfC] × 4 = 16 layers + 4 GQA blocks (S₁ coarse context) + 12 CfC blocks (G refinement) + """ + + def __init__(self, cfg: ModelConfig): + super().__init__() + self.cfg = cfg + + # Embedding (FP16, tied with output) + self.embed = nn.Embedding(cfg.vocab_size, cfg.d_model) + + # Build Geode-ordered layer stack + self.layers = nn.ModuleList() + self.layer_types: list[str] = [] + for level in range(cfg.n_geode_levels): + # S₁: GQA block (coarse context, 4 choices) + self.layers.append( + GQABlock(cfg.d_model, cfg.n_heads, cfg.n_kv_heads, cfg.mlp_ratio) + ) + self.layer_types.append("gqa") + # G: 3 × CfC blocks (refinement, 3 choices each) + for _ in range(cfg.cfc_per_level): + self.layers.append( + CfCBlock(cfg.d_model, cfg.n_time_constants, cfg.mlp_ratio) + ) + self.layer_types.append("cfc") + + self.ln_f = nn.LayerNorm(cfg.d_model) + + # Tied output projection + self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False) + self.lm_head.weight = self.embed.weight # Weight tying + + # Initialize + self._init_weights() + + def _init_weights(self) -> None: + """OrthoInit for GQA, Kaiming for CfC/MLP (following BitNet practice).""" + for name, module in self.named_modules(): + if isinstance(module, Q2Linear): + if "q_proj" in name or "k_proj" in name or "v_proj" in name: + nn.init.orthogonal_(module.weight) + else: + nn.init.kaiming_uniform_(module.weight, a=math.sqrt(5)) + + def activate_q2(self, layer_indices: Optional[list[int]] = None) -> None: + """Activate Z₄ quantization on specified layers (or all if None).""" + for i, layer in enumerate(self.layers): + if layer_indices is not None and i not in layer_indices: + continue + for module in layer.modules(): + if isinstance(module, Q2Linear): + module.activate_q2() + + def refresh_all_tau(self) -> None: + """Refresh Z₄ thresholds from current weight distributions.""" + for layer in self.layers: + for module in layer.modules(): + if isinstance(module, Q2Linear): + module.refresh_tau() + + def forward( + self, + idx: torch.Tensor, + cfc_states: Optional[list[Optional[torch.Tensor]]] = None, + ) -> tuple[torch.Tensor, list[Optional[torch.Tensor]]]: + """ + Args: + idx: Token indices (B, T) + cfc_states: Optional CfC hidden states from previous batch + + Returns: + logits: (B, T, V) + new_cfc_states: Updated CfC states for next batch + """ + x = self.embed(idx) + + if cfc_states is None: + cfc_states = [None] * len(self.layers) + + new_states: list[Optional[torch.Tensor]] = [] + cfc_idx = 0 + + for i, (layer, ltype) in enumerate(zip(self.layers, self.layer_types)): + if ltype == "gqa": + x = layer(x) + new_states.append(None) + else: + state = cfc_states[i] if i < len(cfc_states) else None + x, h = layer(x, state) + new_states.append(h.detach()) + cfc_idx += 1 + + x = self.ln_f(x) + logits = self.lm_head(x) + return logits, new_states + + def count_parameters(self) -> int: + return sum(p.numel() for p in self.parameters() if p.requires_grad) + + def estimate_artifact_size(self) -> dict[str, float]: + """Estimate artifact size in bytes at Z₄ (2-bit) packing.""" + embed_bytes = self.cfg.vocab_size * self.cfg.d_model * 2 # FP16 + q2_params = 0 + fp16_params = 0 + + for name, p in self.named_parameters(): + if "embed" in name or "lm_head" in name: + continue # Tied, counted in embed_bytes + if "ln" in name or "tau" in name: + fp16_params += p.numel() + else: + q2_params += p.numel() + + q2_bytes = q2_params * 2 / 8 # 2 bits per weight + fp16_bytes = fp16_params * 2 # 16 bits per param + raw_total = embed_bytes + q2_bytes + fp16_bytes + compressed = raw_total * 0.85 # Conservative zstd-22 + + return { + "embed_bytes": embed_bytes, + "q2_bytes": q2_bytes, + "fp16_bytes": fp16_bytes, + "raw_total": raw_total, + "compressed_estimate": compressed, + "budget_remaining": ARTIFACT_BUDGET - compressed, + "q2_params": q2_params, + "total_params": self.count_parameters(), + } + + +# ── Gray Encoding and Packing ────────────────────────────────────────────── + +def gray_encode(sym: torch.Tensor) -> torch.Tensor: + """Gray map φ: Z₄ → F₂². g = s ⊕ (s >> 1).""" + return sym ^ (sym >> 1) + + +def gray_decode(gray: torch.Tensor) -> torch.Tensor: + """Inverse Gray map.""" + sym = gray.clone() + sym ^= sym >> 1 + return sym + + +def pack_z4(symbols: torch.Tensor) -> bytes: + """Pack Z₄ symbols (values 0-3) into bytes, 4 per byte, MSB-first.""" + gray = gray_encode(symbols.to(torch.uint8)) + n = gray.numel() + # Pad to multiple of 4 + pad = (4 - n % 4) % 4 + if pad: + gray = F.pad(gray.view(-1), (0, pad)) + gray = gray.view(-1, 4) + packed = (gray[:, 0] << 6) | (gray[:, 1] << 4) | (gray[:, 2] << 2) | gray[:, 3] + return packed.cpu().numpy().tobytes() + + +def unpack_z4(data: bytes, n: int, device: str = "cpu") -> torch.Tensor: + """Unpack bytes to Z₄ symbols.""" + packed = torch.frombuffer(bytearray(data), dtype=torch.uint8).to(device) + s0 = (packed >> 6) & 0x3 + s1 = (packed >> 4) & 0x3 + s2 = (packed >> 2) & 0x3 + s3 = packed & 0x3 + gray = torch.stack([s0, s1, s2, s3], dim=-1).view(-1)[:n] + return gray_decode(gray) + + +# ── Q2BN Binary Format ───────────────────────────────────────────────────── + +Q2BN_MAGIC = b"Q2BN" +Q2BN_VERSION = 1 +DTYPE_Q2 = 0 +DTYPE_FP16 = 1 + + +def pack_state_dict(state_dict: dict[str, torch.Tensor], out_path: str) -> int: + """Pack model state dict to Q2BN format. + + Returns total bytes written. + """ + buf = bytearray() + buf.extend(Q2BN_MAGIC) + buf.extend(struct.pack(" 1 else 0.5 + sym = torch.zeros_like(tensor, dtype=torch.long) + sym[tensor <= -tau] = 0 + sym[(tensor > -tau) & (tensor <= 0)] = 1 + sym[(tensor > 0) & (tensor <= tau)] = 2 + sym[tensor > tau] = 3 + packed_bytes = pack_z4(sym.view(-1)) + buf.extend(struct.pack("= 2: + p.mul_(1 - lr * wd) + + # Spectral normalization for 2D+ parameters + if p.dim() >= 2: + # Approximate spectral norm via power iteration + state = self.state[p] + if "v" not in state: + state["v"] = torch.randn( + p.shape[-1], device=p.device, dtype=p.dtype + ) + v = state["v"] + u = p.view(-1, p.shape[-1]) @ v + u = u / (u.norm() + 1e-8) + v = p.view(-1, p.shape[-1]).t() @ u + v = v / (v.norm() + 1e-8) + state["v"] = v + sigma = (u * (p.view(-1, p.shape[-1]) @ v)).sum() + d_p = d_p / (sigma + 1e-8) + + # Momentum + if momentum != 0: + if "momentum_buffer" not in self.state[p]: + self.state[p]["momentum_buffer"] = d_p.clone() + else: + buf = self.state[p]["momentum_buffer"] + buf.mul_(momentum).add_(d_p) + if nesterov: + d_p = d_p + momentum * buf + else: + d_p = buf + + p.add_(d_p, alpha=-lr) + + return loss + + +# ── Training Loop ─────────────────────────────────────────────────────────── + +@dataclass +class TrainConfig: + """Training configuration optimized for 8×H100 × 10 minutes.""" + # Phases (fraction of total steps) + warmup_frac: float = 0.10 # Phase 1: FP32 warm-up + progressive_frac: float = 0.60 # Phase 2: Progressive QAT + refine_frac: float = 0.30 # Phase 3: Full Q2 refinement + + # Optimizer + lr: float = 0.01 + weight_decay: float = 0.04 + warmup_steps: int = 200 + grad_clip: float = 1.0 + + # Batch + batch_size: int = 32 # Per GPU + seq_len: int = 2048 + grad_accum: int = 4 + + # SWA + swa_start_frac: float = 0.60 + + # Q2 + tau_refresh_interval: int = 1024 + + # Timing + max_wall_seconds: int = 600 + + # Data + data_path: str = "" + byte_tokens: bool = True # Raw byte tokenization (V=256) + + +def get_cosine_lr(step: int, total_steps: int, lr: float, warmup: int) -> float: + """Cosine annealing with linear warmup.""" + if step < warmup: + return lr * step / max(warmup, 1) + progress = (step - warmup) / max(total_steps - warmup, 1) + return lr * 0.5 * (1.0 + math.cos(math.pi * progress)) + + +def train( + model_cfg: Optional[ModelConfig] = None, + train_cfg: Optional[TrainConfig] = None, +) -> None: + """Main training entry point. + + Implements the 3-phase Geode-guided training strategy: + Phase 1: FP32 warm-up (establish activation distributions) + Phase 2: Progressive Z₄ quantization (deep layers first) + Phase 3: Full Z₄ refinement with SWA + """ + if model_cfg is None: + model_cfg = ModelConfig() + if train_cfg is None: + train_cfg = TrainConfig() + + # ── Distributed setup ─────────────────────────────────────────────── + local_rank = int(os.environ.get("LOCAL_RANK", 0)) + world_size = int(os.environ.get("WORLD_SIZE", 1)) + + if world_size > 1: + dist.init_process_group("nccl") + torch.cuda.set_device(local_rank) + + device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu") + is_main = local_rank == 0 + + # ── H100 optimizations ────────────────────────────────────────────── + if torch.cuda.is_available(): + torch.set_float32_matmul_precision("high") + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + + # ── Model ─────────────────────────────────────────────────────────── + model = Q2LTCModel(model_cfg).to(device) + + if is_main: + size_info = model.estimate_artifact_size() + print(f"Model: {size_info['total_params']:,} parameters") + print(f"Estimated artifact: {size_info['compressed_estimate']/1e6:.2f} MB") + print(f"Budget remaining: {size_info['budget_remaining']/1e6:.2f} MB") + print(f"Architecture: [GQA, CfC×{model_cfg.cfc_per_level}] × {model_cfg.n_geode_levels}") + print(f" = {model_cfg.n_geode_levels} GQA + {model_cfg.n_geode_levels * model_cfg.cfc_per_level} CfC = {model_cfg.n_layers} layers") + + # Compile for max throughput (PyTorch 2.0+) + try: + model = torch.compile(model, mode="max-autotune") + if is_main: + print("Model compiled with max-autotune") + except Exception: + if is_main: + print("torch.compile not available, continuing without") + + if world_size > 1: + model = DDP(model, device_ids=[local_rank]) + + raw_model = model.module if isinstance(model, DDP) else model + + # ── Optimizer ─────────────────────────────────────────────────────── + optimizer = Muon( + model.parameters(), + lr=train_cfg.lr, + weight_decay=train_cfg.weight_decay, + ) + + # ── Data (placeholder — replace with FineWeb loading) ─────────────── + # In production, load FineWeb shards as raw bytes or SP-1024 tokens + if train_cfg.data_path and Path(train_cfg.data_path).exists(): + if is_main: + print(f"Loading data from {train_cfg.data_path}") + # Placeholder for real data loading + data = torch.randint(0, model_cfg.vocab_size, (1024, train_cfg.seq_len + 1)) + else: + if is_main: + print("Using synthetic data (no data_path provided)") + data = torch.randint(0, model_cfg.vocab_size, (1024, train_cfg.seq_len + 1)) + + # ── Training ──────────────────────────────────────────────────────── + max_steps = int(os.environ.get("MAX_STEPS", 15000)) + phase1_end = int(max_steps * train_cfg.warmup_frac) + phase2_end = int(max_steps * (train_cfg.warmup_frac + train_cfg.progressive_frac)) + swa_start = int(max_steps * train_cfg.swa_start_frac) + + # SWA model + swa_model = None + swa_n = 0 + + start_time = time.time() + + if is_main: + print(f"\nTraining for {max_steps} steps ({train_cfg.max_wall_seconds}s budget)") + print(f" Phase 1 (FP32 warm-up): steps 0–{phase1_end}") + print(f" Phase 2 (Progressive QAT): steps {phase1_end}–{phase2_end}") + print(f" Phase 3 (Full Z₄ refinement): steps {phase2_end}–{max_steps}") + print(f" SWA starts at step {swa_start}") + + model.train() + cfc_states: Optional[list[Optional[torch.Tensor]]] = None + + for step in range(max_steps): + # Wall-clock check + elapsed = time.time() - start_time + if elapsed > train_cfg.max_wall_seconds - 30: # 30s buffer for packaging + if is_main: + print(f"Wall-clock limit approaching ({elapsed:.0f}s), stopping training") + break + + # ── Phase transitions ─────────────────────────────────────────── + if step == phase1_end: + if is_main: + print(f"\n→ Phase 2: Activating Z₄ quantization (progressive)") + # Activate deep layers first (Geode: refine before coarse) + deep_layers = list(range(len(raw_model.layers) - 1, len(raw_model.layers) // 2, -1)) + raw_model.activate_q2(deep_layers) + + elif step == (phase1_end + phase2_end) // 2: + # Activate remaining layers + all_layers = list(range(len(raw_model.layers))) + raw_model.activate_q2(all_layers) + if is_main: + print(f"\n→ Phase 2.5: All layers now Z₄ quantized") + + elif step == phase2_end: + if is_main: + print(f"\n→ Phase 3: Full Z₄ refinement") + + # ── Threshold refresh ─────────────────────────────────────────── + if step > 0 and step % train_cfg.tau_refresh_interval == 0: + raw_model.refresh_all_tau() + + # ── Learning rate ─────────────────────────────────────────────── + lr = get_cosine_lr(step, max_steps, train_cfg.lr, train_cfg.warmup_steps) + for pg in optimizer.param_groups: + pg["lr"] = lr + + # ── Forward + backward ────────────────────────────────────────── + batch_idx = step % len(data) + batch = data[batch_idx].unsqueeze(0).to(device) + input_ids = batch[:, :-1] + targets = batch[:, 1:] + + with torch.amp.autocast("cuda", dtype=torch.bfloat16): + logits, cfc_states = raw_model(input_ids, cfc_states) + loss = F.cross_entropy( + logits.view(-1, model_cfg.vocab_size), targets.view(-1) + ) + + loss.backward() + + if (step + 1) % train_cfg.grad_accum == 0: + if train_cfg.grad_clip > 0: + torch.nn.utils.clip_grad_norm_(model.parameters(), train_cfg.grad_clip) + optimizer.step() + optimizer.zero_grad(set_to_none=True) + + # ── SWA ───────────────────────────────────────────────────────── + if step >= swa_start: + if swa_model is None: + swa_model = { + k: v.clone() for k, v in raw_model.state_dict().items() + } + swa_n = 1 + else: + swa_n += 1 + for k, v in raw_model.state_dict().items(): + swa_model[k] += (v - swa_model[k]) / swa_n + + # ── Logging ───────────────────────────────────────────────────── + if is_main and step % 100 == 0: + bpb = loss.item() / math.log(2) + phase = ( + "FP32" if step < phase1_end else + "QAT" if step < phase2_end else + "Refine" + ) + print( + f"step {step:5d} | loss {loss.item():.4f} | " + f"bpb {bpb:.4f} | lr {lr:.6f} | " + f"phase {phase} | {elapsed:.0f}s" + ) + + # ── Package artifact ──────────────────────────────────────────────── + if is_main: + print("\n── Packaging artifact ──") + final_sd = swa_model if swa_model is not None else raw_model.state_dict() + out_path = "model.q2bin" + raw_bytes = pack_state_dict(final_sd, out_path) + print(f"Q2BN size: {raw_bytes / 1e6:.3f} MB") + + # zstd compression + try: + import zstandard + cctx = zstandard.ZstdCompressor(level=22) + raw_data = Path(out_path).read_bytes() + compressed = cctx.compress(raw_data) + compressed_path = "model.q2bin.zst" + Path(compressed_path).write_bytes(compressed) + print(f"Compressed: {len(compressed) / 1e6:.3f} MB") + if len(compressed) <= ARTIFACT_BUDGET: + print(f"✓ Within budget ({ARTIFACT_BUDGET / 1e6:.0f} MB)") + else: + print(f"✗ OVER BUDGET by {(len(compressed) - ARTIFACT_BUDGET) / 1e6:.3f} MB") + except ImportError: + print("zstandard not installed; skipping compression") + + if world_size > 1: + dist.destroy_process_group() + + +# ── Entry point ───────────────────────────────────────────────────────────── + +if __name__ == "__main__": + cfg = ModelConfig( + vocab_size=int(os.environ.get("VOCAB_SIZE", 256)), + d_model=int(os.environ.get("D_MODEL", 768)), + n_geode_levels=int(os.environ.get("N_GEODE_LEVELS", 4)), + max_seq_len=int(os.environ.get("SEQ_LEN", 2048)), + ) + + tcfg = TrainConfig( + data_path=os.environ.get("DATA_PATH", ""), + byte_tokens=os.environ.get("BYTE_TOKENS", "1") == "1", + max_wall_seconds=int(os.environ.get("MAX_WALLCLOCK_SECONDS", 600)), + batch_size=int(os.environ.get("BATCH_SIZE", 32)), + seq_len=cfg.max_seq_len, + ) + + train(cfg, tcfg) From bebfc04e716a1b95ecd8e9adfcce60bf6f857f1d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 21 Mar 2026 23:32:58 +0000 Subject: [PATCH 4/4] Integrate PR #85: clarify that LFM 2.5 10:6 is absolute layer count, not reducible ratio Co-authored-by: devlux76 <86517969+devlux76@users.noreply.github.com> Agent-Logs-Url: https://github.com/devlux76/q2/sessions/709766c3-895e-4e85-bdbc-67aec60c1798 --- docs/parameter-golf/ANALYSIS.md | 7 +++++-- docs/parameter-golf/DESIGN.md | 7 +++++-- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/docs/parameter-golf/ANALYSIS.md b/docs/parameter-golf/ANALYSIS.md index ab94a7c..3158f3d 100644 --- a/docs/parameter-golf/ANALYSIS.md +++ b/docs/parameter-golf/ANALYSIS.md @@ -265,8 +265,11 @@ for two reasons: ### 4.5 Geode-derived layer layout -LFM 2.5's 10:6 CfC:GQA ratio was found empirically. The Geode factorization -(§D-4.1) provides a principled derivation that eliminates the guesswork. +LFM 2.5's 10:6 CfC:GQA ratio was found empirically. Note that 10:6 cannot be +reduced to 5:3: the numbers are absolute layer counts (10 CfC + 6 GQA = 16 layers +total), not a bare ratio. Reducing to 5:3 would describe a different 8-layer +model, halving the depth. The Geode factorization (§D-4.1) provides a principled +derivation that eliminates the guesswork. The generating function for Q²'s transition sequences: diff --git a/docs/parameter-golf/DESIGN.md b/docs/parameter-golf/DESIGN.md index cb48d57..2439e07 100644 --- a/docs/parameter-golf/DESIGN.md +++ b/docs/parameter-golf/DESIGN.md @@ -166,8 +166,11 @@ $$S(x) - 1 = \underbrace{4x}_{S_1} \cdot \underbrace{\frac{1}{1-3x}}_{G}$$ $$\underbrace{[\text{GQA},\ \text{CfC},\ \text{CfC},\ \text{CfC}]}_{\text{one Geode level}} \times 4 = 16 \text{ layers}$$ -4 GQA + 12 CfC, ratio 3:1 (CfC:GQA). More CfC-heavy than LFM 2.5's -empirical 10:6 = 1.67:1 — predicted by the Geode for short-context (2048-token) +4 GQA + 12 CfC, ratio 3:1 (CfC:GQA). Note that LFM 2.5's 10:6 cannot be +reduced to 5:3 — these are absolute layer counts (10 CfC + 6 GQA = 16 layers +total), not a bare ratio. Reducing to 5:3 would halve the depth to 8 layers. +Our Geode-derived 12:4 is also 16 layers total, but more CfC-heavy +(ratio 3:1 vs 1.67:1), predicted by the Geode for short-context (2048-token) workloads where less attention is needed. ### 3.3 Information flow