From d935b2397154a6118b8e9e7d4f567e23c6acfcbb Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 22 Mar 2026 00:01:43 +0000 Subject: [PATCH] Remove superseded parameter-golf docs that present implementation hazards MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Deleted five documents that contain outdated or explicitly-superseded strategies which could mislead an AI attempting to implement the approach: - APPROACH_INITIAL.md: proposed WASM training and int5/int6 mixed precision; explicitly superseded by APPROACH_REVISED and then APPROACH - APPROACH_REVISED.md: intermediate draft with Z₈/Z₁₂/Z₁₆ layer hierarchy, abandoned in favor of uniform Z₄ in the final synthesis - IMPLEMENTATION.md: tactical roadmap for the initial approach, including TypeScript WASM quantization and the wrong LTC/NCP architecture - STRATEGY.md: early brainstorm mixing int5/int6 stages, references nonexistent CONTROL_TENSOR and logit_softcap mechanisms - DESIGN_REVISION_PLAN.md: working notes on what to change in DESIGN.md, not the design itself; implies DESIGN.md is incomplete Canonical documents retained: APPROACH.md, DESIGN.md, WILDBERGER_RUBINE_REVIEW.md, code.py. https://claude.ai/code/session_01Sqiv17Pm8uMUNsGAKuKgYS --- docs/parameter-golf/APPROACH_INITIAL.md | 650 ------------- docs/parameter-golf/APPROACH_REVISED.md | 579 ------------ docs/parameter-golf/DESIGN_REVISION_PLAN.md | 423 --------- docs/parameter-golf/IMPLEMENTATION.md | 991 -------------------- docs/parameter-golf/STRATEGY.md | 55 -- 5 files changed, 2698 deletions(-) delete mode 100644 docs/parameter-golf/APPROACH_INITIAL.md delete mode 100644 docs/parameter-golf/APPROACH_REVISED.md delete mode 100644 docs/parameter-golf/DESIGN_REVISION_PLAN.md delete mode 100644 docs/parameter-golf/IMPLEMENTATION.md delete mode 100644 docs/parameter-golf/STRATEGY.md diff --git a/docs/parameter-golf/APPROACH_INITIAL.md b/docs/parameter-golf/APPROACH_INITIAL.md deleted file mode 100644 index d2adfcd..0000000 --- a/docs/parameter-golf/APPROACH_INITIAL.md +++ /dev/null @@ -1,650 +0,0 @@ -# Parameter Golf: Q² Winning Strategy - -> **Challenge**: Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s, evaluated by compression on the FineWeb validation set (bits per byte). - -## Executive Summary - -The Q² framework provides a revolutionary approach to winning the Parameter Golf challenge by leveraging **structural quantization** rather than traditional reconstruction quantization. Our method combines: - -1. **Quaternary quantization** (Q²) for extreme parameter compression with minimal information loss -2. **Liquid Time Constant (LTC) networks** replacing traditional attention mechanisms -3. **Mixed-precision adaptive quantization** guided by the Wildberger-Rubine Geode framework -4. **Progressive coarse-to-fine training** exploiting hierarchical quantization structure - -**Projected outcome**: Achieve **sub-1.10 bits/byte** on FineWeb validation while fitting comfortably within 16MB. - ---- - -## Contents - -1. [Challenge Analysis](#1-challenge-analysis) -2. [Why Q² is Uniquely Suited](#2-why-q2-is-uniquely-suited) -3. [The Core Architecture](#3-the-core-architecture) -4. [Training Strategy](#4-training-strategy) -5. [Quantization Approach](#5-quantization-approach) -6. [Implementation Roadmap](#6-implementation-roadmap) -7. [Expected Performance](#7-expected-performance) -8. [Risk Mitigation](#8-risk-mitigation) - ---- - -## 1. Challenge Analysis - -### 1.1 Constraints - -- **Parameter budget**: 16MB = 16,000,000 bytes maximum -- **Training time**: 10 minutes on 8×H100 SXM -- **Evaluation metric**: Bits per byte on FineWeb validation (compression) -- **Current SOTA**: 1.1428 bpb (thwu1, Int5-MLP + BigramHash) - -### 1.2 Key Insights from Leaderboard - -The top submissions reveal critical patterns: - -1. **Quantization is essential**: All top entries use int5, int6, or mixed precision -2. **MLP expansion helps**: 2.6×–3× MLP width is common -3. **Vocabulary compression**: BigramHash tokenizers (10240 vocab) outperform standard BPE -4. **Training optimizations**: Muon optimizer, SWA (Stochastic Weight Averaging), sliding window eval -5. **Architecture innovation**: SmearGate activations, orthogonal initialization - -### 1.3 The Fundamental Trade-Off - -Parameter Golf is an **L(N) optimization problem**: minimize loss given fixed parameter count N. Traditional approaches face a hard limit: - -``` -At fp16: 16MB / 2 bytes = 8M parameters -At int8: 16MB / 1 byte = 16M parameters -At int6: 16MB / 0.75 bytes ≈ 21M parameters -At int5: 16MB / 0.625 bytes ≈ 25M parameters -``` - -Going below int5 (sub-5-bit) causes catastrophic accuracy loss with standard reconstruction quantization. - -**Our thesis**: Q²'s structural quantization breaks this barrier by preserving relational geometry rather than pointwise values. - ---- - -## 2. Why Q² is Uniquely Suited - -### 2.1 Structural vs. Reconstruction Quantization - -From §D-2.4 of DESIGN.md, Q² distinguishes itself through **structural quantization**: - -- **Reconstruction quantization** (GPTQ, BQQ, QUAD): Minimize $\|W - \hat{W}\|_F^2$ -- **Structural quantization** (Q²): Preserve distances, trajectories, complement relationships - -Parameter Golf is fundamentally a **compression task**. The evaluation metric (bits per byte) measures how well the model predicts the next byte. This is equivalent to asking: *how well does the model preserve the relational structure of language*? - -### 2.2 The Lee Metric Advantage - -Q² encodes weights/activations to $\mathbb{Z}_4 = \{A, B, C, D\}$ with the Lee metric: - -$$d_L(u, v) = \sum_{i=1}^{n} \min(|u_i - v_i|, 4 - |u_i - v_i|)$$ - -Key properties: - -1. **Exact distance computation via `popcnt(XOR)`** (§D-2.7): Gray encoding makes distance calculation hardware-accelerated -2. **Complement involution** (§D-2.8): $\theta: \mathbb{Z}_4 \to \mathbb{Z}_4$ by $\theta(x) = x + 2 \pmod{4}$ captures semantic opposition -3. **Equiprobable thresholds** (§D-2.5): Maximizes entropy per dimension ($I = 2$ bits) - -### 2.3 The Geode Factorization - -From §D-4.1, the Wildberger-Rubine Geode framework provides: - -$$S - 1 = S_1 \cdot G$$ - -where $S$ is the generating function for all structured codewords, $S_1$ is the first quantization step, and $G = 1/(1-3x)$ is the Geode counting refinement possibilities. - -**Application to Parameter Golf**: This enables **hierarchical quantization** where: - -1. **Coarse level** ($S_1$): Few bits encode high-level structure (what the leaderboard calls "block files") -2. **Fine level** ($G$): Remaining bits encode within-block refinement -3. **Progressive training**: Train coarse first, then refine - -### 2.4 Mixed-Precision via Hyper-Catalan - -From §D-4.3, the hyper-Catalan framework counts mixed-precision allocations: - -$$C_{(m_2, m_3, m_4)} = \frac{(E-1)!}{(V-1)! \cdot m_2! \cdot m_3! \cdot m_4!}$$ - -- $m_2$: binary dimensions (1 bit) -- $m_3$: ternary dimensions (1.585 bits) -- $m_4$: quaternary dimensions (2 bits) - -**Winning insight**: Allocate bits based on variance per layer/channel, guided by the hyper-Catalan structure rather than ad-hoc heuristics. - ---- - -## 3. The Core Architecture - -### 3.1 Overall Structure - -``` -Input → Embedding (int6) → LTC Blocks (int5 core) → Output (int6) → Softmax -``` - -**Parameter allocation** (targeting ~22-25M params at int5 effective): - -| Component | Params | Precision | Bytes | Notes | -|-----------|-------:|-----------|------:|-------| -| Embedding | 10240 × 384 = 3.9M | int6 | 2.9M | BigramHash vocab | -| 12× LTC blocks | 15M | int5 | 9.4M | See §3.2 | -| Output projection | 384 × 10240 = 3.9M | int6 (tied) | 0 | Tied with embedding | -| Total | ~19M | mixed | ~12.3M | 23% headroom | - -### 3.2 LTC Block Architecture - -Replace standard transformer blocks with **Liquid Time Constant (LTC)** blocks from Hasani et al.: - -```python -class LTCBlock: - def __init__(self, dim=384, mlp_ratio=3.0): - # Replace self-attention with CfC (Closed-form Continuous-time) cells - self.cfc = CfCCell(dim, hidden_size=dim) - - # MLP with SmearGate activation (from leaderboard) - self.mlp = MLP(dim, int(dim * mlp_ratio), activation='smeargelu') - - # Layer norms (kept at higher precision) - self.ln1 = LayerNorm(dim) - self.ln2 = LayerNorm(dim) -``` - -**Why LTC over Attention**: - -1. **Linear complexity**: $O(n)$ vs $O(n^2)$ for attention with sequence length $n$ -2. **ODE-based dynamics**: Continuous-time formulation → smoother weight landscapes → better quantization -3. **Proven efficiency**: Liquid AI's LFM 2.5 uses 10 LIV Convolution Blocks + 6 GQA blocks for 32k context - -**Key modification for Parameter Golf**: - -- Use **Neural Circuit Policies (NCP)** variant with wiring constraints -- Sparse connectivity reduces parameter count by 40-60% vs dense -- Quantize to int5 (5 bits) for LTC weights, int6 for critical paths - -### 3.3 Vocabulary Strategy - -Follow leaderboard insight: **BigramHash tokenizer** - -```python -# Bigram-based compression -vocab_size = 10240 # vs standard 50k+ BPE -# Encode common bigrams as single tokens -# Reserve ~1024 tokens for rare unigrams -``` - -**Advantage**: 5× smaller embedding matrix while maintaining coverage. - -### 3.4 Depth vs. Width Trade-off - -Current leaderboard models: 10-11 layers × 512-768 dim - -**Our approach**: **12 layers × 384 dim** - -Rationale: - -- Depth helps more than width for compression tasks (proven in distillation literature) -- Narrower layers → each weight matters more → quantization-aware training more effective -- LTC blocks compensate for reduced width via better temporal integration - ---- - -## 4. Training Strategy - -### 4.1 Three-Phase Training - -#### Phase 1: Coarse Quantization (3 min) - -1. Train at **int8** precision with full model -2. Use **Muon optimizer** (Adam variant, proven on leaderboard) -3. Sequence length: 2048 (following leaderboard) -4. Focus: Learn coarse structure ($S_1$ from Geode factorization) - -#### Phase 2: Progressive Refinement (5 min) - -1. Transition to **mixed int6/int5** via QAT (Quantization-Aware Training) -2. Implement **Geode-guided progressive quantization**: - - Start at layer 12 (output), move toward input - - Each layer: quantize, fine-tune, freeze -3. Use **SWA (Stochastic Weight Averaging)** starting at 50% (proven on leaderboard) - -#### Phase 3: Fine-Grained Optimization (2 min) - -1. Full model at **int5** (with int6 for embedding/output) -2. **Sliding window evaluation** (stride=64, proven on leaderboard) -3. Final SWA pass with weight decay = 0.04 (leaderboard optimal) - -### 4.2 Learning Rate Schedule - -```python -# Following Muon optimizer best practices -lr_max = 0.01 # Muon uses higher LR than Adam -warmup_steps = 100 -total_steps = ~15000 # 10 min / 0.04 sec per step - -schedule = cosine_annealing_with_warmup( - max_lr=lr_max, - warmup_steps=warmup_steps, - total_steps=total_steps, - min_lr=lr_max * 0.1 -) -``` - -### 4.3 Batch Size and Context - -```python -# Target: 8M tokens/batch across 8×H100 -batch_size_per_gpu = 32 -sequence_length = 4096 # Aggressive: leaderboard uses 2048-4096 -gradient_accumulation = 4 - -effective_batch_tokens = 32 × 4096 × 8 × 4 = 4.2M tokens -``` - -**Trade-off**: Longer context improves compression but reduces #steps. 4096 is optimal based on leaderboard progression (2048 → 4k → better scores). - ---- - -## 5. Quantization Approach - -### 5.1 Q² Quaternary Quantization for Weights - -From §D-2.5, quantize each weight $w_i$ to $\{A, B, C, D\}$: - -$$q(w_i) = \begin{cases} -A & w_i \leq -\tau^* \\ -B & -\tau^* < w_i \leq 0 \\ -C & 0 < w_i \leq \tau^* \\ -D & w_i > \tau^* -\end{cases}$$ - -where $\tau^* = \Phi^{-1}(3/4) / \sqrt{n}$ for equiprobable states. - -**Packing**: Each symbol = 2 bits via Gray encoding: - -- $A = 00$, $B = 01$, $C = 11$, $D = 10$ -- 4 symbols/byte → 256 params packed into 64 bytes - -### 5.2 Mixed-Precision Allocation - -From §D-4.3 and §P-17: - -```python -# Variance-guided bit allocation -for layer in model.layers: - variance = compute_per_channel_variance(layer.weight) - - if variance < threshold_low: - quantize_to_int5(layer) # 5 bits - elif variance < threshold_high: - quantize_to_int6(layer) # 6 bits - else: - keep_int8(layer) # 8 bits (critical paths only) -``` - -**Expected distribution**: - -- 70% of weights: int5 (0.625 bytes/param) -- 25% of weights: int6 (0.75 bytes/param) -- 5% of weights: int8 (1 byte/param) -- Embedding/output: int6 (proven on leaderboard) - -### 5.3 Analytical Threshold Computation - -From §D-4.4, use hyper-Catalan series for non-Gaussian distributions: - -$$\alpha = \sum_\mathbf{m} C_\mathbf{m} \cdot t_2^{m_2} t_3^{m_3} \cdots$$ - -**Application**: After each training phase, recompute thresholds based on actual weight distributions. This is more accurate than fixed percentiles. - -### 5.4 Compression Format - -Final 16MB artifact structure: - -``` -Header (1KB): - - Model config (layers, dim, vocab_size) - - Quantization parameters (thresholds per layer) - - Vocabulary (BigramHash table) - -Body (~15MB): - - Quantized weights (int5/int6 mixed) - - Packed via Gray encoding - - Compressed with zstd-22 (leaderboard standard) -``` - ---- - -## 6. Implementation Roadmap - -### 6.1 Infrastructure (Week 1) - -**Priority 1: Adapt existing Q² kernel** - -- [ ] Extend `src/q2.wat` to support weight quantization (currently activation-only) -- [ ] Add int5/int6 modes to dtype handling -- [ ] Implement hyper-Catalan threshold computation (§D-4.4) - -**Priority 2: LTC block implementation** - -- [ ] Port CfC (Closed-form Continuous-time) cells to PyTorch -- [ ] Integrate with Q² quantization-aware training -- [ ] Add SmearGate activation (leaderboard proven) - -**Priority 3: Training harness** - -- [ ] Fork `openai/parameter-golf` repo -- [ ] Adapt `train_gpt.py` to Q²+LTC architecture -- [ ] Integrate Muon optimizer - -### 6.2 Baseline (Week 2) - -**Target**: Match current baseline (1.2244 bpb) at 16MB - -- [ ] Train standard transformer with Q² int6 quantization -- [ ] Validate compression pipeline -- [ ] Establish evaluation harness - -### 6.3 Optimization (Week 3-4) - -**Target**: Beat current SOTA (1.1428 bpb) - -- [ ] Implement LTC blocks -- [ ] Add mixed-precision (int5/int6) -- [ ] Tune three-phase training schedule -- [ ] Ablate: LTC vs attention, int5 vs int6, etc. - -### 6.4 Submission (Week 5) - -**Target**: Achieve sub-1.10 bpb - -- [ ] Hyperparameter sweep -- [ ] Verify reproducibility (3+ runs) -- [ ] Package submission per challenge requirements -- [ ] Submit PR to `parameter-golf` repo - ---- - -## 7. Expected Performance - -### 7.1 Baseline Projections - -Conservative estimates based on leaderboard scaling: - -| Approach | Params | Precision | Compression | Expected bpb | -|----------|-------:|-----------|-------------|-------------:| -| Current SOTA | ~21M | int5/int6 | BigramHash | 1.1428 | -| Q² + Standard Attn | 22M | int5/int6 | BigramHash | 1.13 | -| Q² + LTC (ours) | 23M | int5/int6 | BigramHash | **1.10** | -| Q² + LTC + Mixed | 25M | int5/int6/int8 | BigramHash | **1.08** | - -### 7.2 Key Advantages - -1. **Better quantization**: Structural preservation → +0.02-0.03 bpb over reconstruction -2. **LTC efficiency**: Linear complexity → deeper models in same time → +0.01-0.02 bpb -3. **Mixed-precision**: Hyper-Catalan guidance → optimal bit allocation → +0.01 bpb -4. **Hierarchical training**: Geode factorization → better convergence → +0.01 bpb - -**Total expected gain**: 0.05-0.07 bpb → **sub-1.10 target achievable** - -### 7.3 Stretch Goal - -If LTC blocks prove exceptionally effective: - -- **Aggressive depth**: 16 layers × 320 dim -- **More int5**: 85% of weights at int5 -- **Target**: **1.05 bpb** (unprecedented for 16MB) - ---- - -## 8. Risk Mitigation - -### 8.1 Technical Risks - -**Risk 1: LTC blocks underperform** - -*Likelihood*: Medium | *Impact*: High - -*Mitigation*: - -- Early ablation study (Week 2): LTC vs standard attention -- Fallback: Hybrid architecture (6 LTC + 6 attention layers) -- Conservative estimate: Pure attention with Q² still beats SOTA - -**Risk 2: Quantization to int5 too aggressive** - -*Likelihood*: Low | *Impact*: Medium - -*Mitigation*: - -- Q² structural quantization proven to 2-bit in literature (§R-2.2, BQQ) -- Fallback: 90% int6 + 10% int8 still fits in 16MB -- Progressive quantization allows layer-wise adjustment - -**Risk 3: Training time insufficient** - -*Likelihood*: Low | *Impact*: High - -*Mitigation*: - -- Profile early: confirm ~0.04 sec/step on 8×H100 -- Optimize data loading: prefetch, pin memory -- Reduce context if needed: 3072 → 2048 adds 50% more steps - -**Risk 4: Evaluation time exceeds 10 min** - -*Likelihood*: Low | *Impact*: Critical - -*Mitigation*: - -- Q² inference is fast: `popcnt(XOR)` for distances -- Sliding window eval adds overhead: profile early -- Worst case: standard eval (not sliding) still competitive - -### 8.2 Strategic Risks - -**Risk: Another team beats us to sub-1.10** - -*Likelihood*: Medium | *Impact*: Medium - -*Mitigation*: - -- Move fast: aim for submission by Week 4 -- Continuous leaderboard monitoring -- Multiple innovations (LTC, Q², mixed-precision) → hard to replicate quickly - -**Risk: Challenge rules change** - -*Likelihood*: Low | *Impact*: High - -*Mitigation*: - -- Stay engaged in Discord (#parameter-golf-announcements) -- Modular design allows quick pivots -- Q² framework general → adapts to rule changes - ---- - -## 9. Why This Will Win - -### 9.1 The Core Insight - -Parameter Golf is not a **model compression** problem—it's a **geometric compression** problem. - -The challenge asks: *What is the smallest discrete representation that preserves linguistic structure?* - -From §D-1.6: - -> The question is not: how many bits per dimension approximate angular distance on $S^{n-1}$? The question is: what is the natural discrete coordinate system of the L1 unit ball? - -Q² answers this question. The four cells $\{A, B, C, D\}$ are the minimum alphabet preserving: - -1. Sign (which side of hyperplane) -2. Magnitude class (near boundary or committed) -3. Complement structure (opposition via $\theta$) - -**This is not an engineering trick—it's a mathematical necessity.** - -### 9.2 The Leaderboard Trajectory - -Current progression: - -``` -1.2244 → 1.2197 → 1.2147 → ... → 1.1630 → 1.1586 → 1.1502 → 1.1458 → 1.1428 -(baseline) (int6+sliding) (int5+MLP3x) -``` - -Each improvement adds one innovation: - -- FP16 embed → int6/int8 mixed → int6 blocks → int5 → mixed int5/int6 -- Longer context (1024 → 2048 → 4096) -- Better optimizers (Adam → Muon) -- Better eval (standard → sliding window) -- Better activations (ReLU → SmearGate) - -**Our submission adds three innovations simultaneously**: - -1. **Structural quantization** (Q²) vs reconstruction quantization (all current entries) -2. **LTC blocks** vs attention (all current entries) -3. **Geode-guided mixed-precision** vs ad-hoc mixed-precision (current SOTA) - -If each is worth 0.015-0.02 bpb (conservative), we reach **1.09-1.10 bpb**. - -### 9.3 The Secret Weapon: Hierarchical Training - -From §D-4.1, the Geode factorization: - -$$S = 1 + S_1 \cdot G$$ - -predicts that **coarse-to-fine training should outperform flat training** at fixed parameter budget. - -**No current leaderboard entry exploits this.** - -All entries train the full model end-to-end, then quantize. We train the hierarchy explicitly: - -1. Phase 1: Learn coarse structure (block assignment) -2. Phase 2: Learn within-block refinement -3. Phase 3: Learn fine-grained transitions - -This is **not curriculum learning** (that varies data difficulty). This is **architectural curriculum** (varies model resolution). - -Mathematical justification: §D-4.1 proves the factorization is exact, not approximate. We're training the true hierarchical decomposition. - ---- - -## 10. Next Steps - -### Immediate Actions (This Week) - -1. **Set up Parameter Golf environment** - ```bash - git clone https://github.com/openai/parameter-golf.git - cd parameter-golf - python3 data/cached_challenge_fineweb.py --variant sp1024 - ``` - -2. **Prototype Q² weight quantization** - - Extend `src/q2.wat` for weights - - Test on toy transformer (2 layers) - - Measure compression ratio and accuracy - -3. **Research LTC/CfC implementation** - - Review Liquid AI papers (Hasani et al.) - - Find existing PyTorch implementations - - Identify integration points with Q² - -4. **Join Parameter Golf community** - - Discord: #parameter-golf-discussions - - Monitor leaderboard daily - - Engage with top teams (learn strategies) - -### Success Metrics - -**Week 1**: Prototype works, compresses to <16MB - -**Week 2**: Matches baseline (1.22 bpb) - -**Week 3**: Beats int6 baseline (1.20 bpb) - -**Week 4**: Competitive with SOTA (1.15 bpb) - -**Week 5**: Submission ready (target: 1.10 bpb) - ---- - -## 11. Conclusion - -The Q² framework is **uniquely positioned** to win Parameter Golf because it solves the right problem: **structural compression**. - -While other teams optimize reconstruction error ($\|W - \hat{W}\|_F^2$), we optimize geometric preservation ($d_L$). The mathematics is rigorous (Wildberger-Rubine Geode, Hammons et al. Gray map), the implementation is feasible (existing Q² kernel), and the timeline is realistic (5 weeks). - -**Our competitive advantages**: - -1. **Theory-driven**: Mathematical framework, not trial-and-error -2. **Efficient architecture**: LTC blocks → more depth → better compression -3. **Optimal quantization**: Structural preservation → higher quality at lower bits -4. **Hierarchical training**: Geode factorization → faster convergence - -**The path to victory**: - -``` -Baseline (1.22) → Q²-standard (1.13) → Q²-LTC (1.10) → Q²-LTC-Mixed (1.08) -``` - -We don't need all innovations to work perfectly—even conservative estimates put us **ahead of current SOTA**. - -**Recommendation**: Proceed with full implementation. - ---- - -## Appendices - -### A. Relevant Q² Documentation - -- **DESIGN.md**: Complete mathematical framework (§1-§5) -- **RELATED_WORK.md**: Comparison with GPTQ, BQQ, QUAD, QuES -- **PREDICTIONS.md**: Testable predictions (P15-P17 relevant for mixed-precision) - -### B. Parameter Golf Resources - -- **Repository**: https://github.com/openai/parameter-golf -- **Challenge page**: https://openai.com/index/parameter-golf/ -- **Leaderboard**: Updated in repo `records/track_10min_16mb/` -- **Discord**: https://discord.com/invite/openai - -### C. Key Papers - -1. **Wildberger & Rubine (2025)**: "A Hyper-Catalan Series Solution to Polynomial Equations, and the Geode." *Amer. Math. Monthly* 132:5, 383-402. - - Provides Geode factorization and mixed-precision framework - -2. **Hammons et al. (1994)**: "The $\mathbb{Z}_4$-linearity of Kerdock, Preparata, Goethals, and related codes." *IEEE Trans. Inform. Theory* 40:2, 301-319. - - Proves Gray map isometry theorem - -3. **Hasani et al. (2021)**: "Liquid Time-constant Networks." *AAAI 2021*. - - Introduces LTC framework - -4. **Hasani et al. (2023)**: "Closed-form Continuous-time Neural Networks." *Nature Machine Intelligence*. - - CfC variant used in Liquid AI's models - -### D. Team and Resources - -**Required expertise**: - -- PyTorch model training (critical) -- WASM/low-level optimization (moderate) -- Mathematical foundations (helpful but not critical) - -**Compute requirements**: - -- Development: 1×H100 for 1-2 weeks (~$300-600 on RunPod) -- Final runs: 8×H100 for multiple 10-min runs (~$200-400) -- Total estimated cost: **$500-1000** - -**Timeline**: 5 weeks from start to submission - ---- - -**Document version**: 1.0 -**Last updated**: 2026-03-21 -**Author**: Claude (Q² Project) -**Status**: Ready for implementation diff --git a/docs/parameter-golf/APPROACH_REVISED.md b/docs/parameter-golf/APPROACH_REVISED.md deleted file mode 100644 index 7d360d2..0000000 --- a/docs/parameter-golf/APPROACH_REVISED.md +++ /dev/null @@ -1,579 +0,0 @@ -# Parameter Golf: Revised Strategy (PyTorch-Native Q²) - -> **Status**: Revised based on feedback -> **Supersedes**: APPROACH_INITIAL.md (initial strategy) -> **Related**: [DESIGN.md](../../DESIGN.md), [RELATED_WORK.md](../../RELATED_WORK.md) - -## Executive Summary - -Based on critical feedback, this revised strategy focuses on: - -1. **Pure PyTorch/GPU implementation** - No WASM, leverage 8×H100 GPUs directly -2. **Even bit widths (power-of-2 where possible)** - Z₄ (2-bit), Z₈ (4-bit), Z₁₂ (6-bit transitional), Z₁₆ (8-bit) -3. **Cache-line optimization** - Design for 64-byte cache lines and SIMD registers -4. **Geode-guided architecture** - Exploit hierarchical structure for efficiency - -**Target**: Sub-1.10 bits/byte on FineWeb validation, <16MB artifact, <10 min training on 8×H100 - ---- - -## 1. Why the Original Approach Needed Revision - -### 1.1 WASM Misconception - -**Original error**: Proposed extending `src/q2.wat` for weight quantization and using WASM for training. - -**Correction**: Parameter Golf provides 8×H100 GPUs for training. WASM: -- Cannot leverage GPUs effectively -- Adds unnecessary complexity for training -- Was designed for browser inference, not GPU training - -**New direction**: Pure PyTorch implementation optimized for H100 tensor cores. - -### 1.2 int5 Instability - -**Original error**: Proposed int5 quantization (5 bits per weight) following leaderboard trends. - -**Correction**: Odd bit widths are mathematically problematic: -- p-adic numbers unstable at odd values -- Cache line alignment requires power-of-2 widths -- Cannot efficiently pack into 64-byte cache lines - -**Example**: int5 packing is messy: -``` -5 bits × 12.8 = 64 bits (not integer number of weights per register) -``` - -vs. power-of-2: -``` -2 bits × 32 = 64 bits (Z₄: 32 weights per register) -4 bits × 16 = 64 bits (Z₈: 16 weights per register) -6 bits × 10 + 4 padding = 64 bits (Z₁₂: requires padding) -8 bits × 8 = 64 bits (Z₁₆: 8 weights per register, perfect alignment) -``` - -### 1.3 The Z_N Ring Hierarchy - -From feedback: Consider Z_N rings as natural extensions of Z₄. - -**Key insight**: Nature runs on Z₄ (DNA base pairs), but codons (triplets) and higher structures suggest hierarchical composition. - -The Geode factorization S - 1 = S₁ · G suggests: -- **S₁**: Coarse quantization at lower precision (Z₄, 2-bit) -- **G**: Refinement at higher precision (Z₈, Z₁₆) - -This is not arbitrary mixing—it's the mathematical structure of hierarchical quantization. - ---- - -## 2. Revised Architecture - -### 2.1 Quantization Hierarchy - -Instead of uniform int5/int6, use **structured mixed precision** guided by Geode: - -| Component | Ring | Bits | Rationale | -|-----------|------|------|-----------| -| **Embedding** | Z₁₆ | 8 | High-capacity input/output layer | -| **Early layers (1-4)** | Z₈ | 4 | Coarse feature extraction | -| **Middle layers (5-8)** | Z₁₂ | 6 | Transition zone (compromise) | -| **Deep layers (9-12)** | Z₄ | 2 | Structural encoding (minimal) | - -**Mathematical justification**: -- Z₄ ⊂ Z₈ ⊂ Z₁₆ (2 | 4 | 8, natural divisibility) -- Z₁₂ = 3 × Z₄ (codon-like structure) -- Progressive refinement matches Geode factorization - -### 2.2 Model Architecture - -```python -# Simplified architecture (no LTC complexity initially) - -Input (BigramHash 10240 vocab) - ↓ -Embedding (384 dim, Z₁₆/8-bit) [3.9M params, 3.9MB] - ↓ -4× Attention blocks (Z₈/4-bit) [~4M params, 2MB] - ↓ -4× Attention blocks (Z₁₂/6-bit) [~4M params, 3MB] - ↓ -4× Attention blocks (Z₄/2-bit) [~4M params, 1MB] - ↓ -Output (tied, Z₁₆/8-bit) [0 params] -``` - -**Total**: ~12M params, ~10MB compressed (37% headroom) - -**Note**: Start with standard attention, not LTC. Proven architecture first, then optimize. - -### 2.3 Cache-Line Optimized Packing - -Design for 64-byte (512-bit) cache lines on H100: - -```python -# Z₄ (2-bit): Perfect alignment -# 32 weights × 2 bits = 64 bits = 1 register -# 256 weights = 8 registers = 64 bytes = 1 cache line - -class Z4Quantizer: - @staticmethod - def pack_cache_line(weights: torch.Tensor) -> torch.Tensor: - """Pack 256 weights into 64 bytes (1 cache line)""" - assert weights.shape[-1] % 256 == 0 - # Quantize to {0, 1, 2, 3} - quantized = quantize_z4(weights) # → [batch, ..., n] - # Pack 4 symbols per byte (Gray encoded) - packed = pack_4_per_byte(quantized) # → [batch, ..., n//4] - return packed - -# Z₈ (4-bit): Perfect alignment -# 16 weights × 4 bits = 64 bits = 1 register -# 128 weights = 8 registers = 64 bytes = 1 cache line - -class Z8Quantizer: - @staticmethod - def pack_cache_line(weights: torch.Tensor) -> torch.Tensor: - """Pack 128 weights into 64 bytes (1 cache line)""" - assert weights.shape[-1] % 128 == 0 - quantized = quantize_z8(weights) # → {0..7} - packed = pack_2_per_byte(quantized) # → [batch, ..., n//2] - return packed - -# Z₁₆ (8-bit): Perfect alignment -# 8 weights × 8 bits = 64 bits = 1 register -# 64 weights = 8 registers = 64 bytes = 1 cache line - -class Z16Quantizer: - @staticmethod - def pack_cache_line(weights: torch.Tensor) -> torch.Tensor: - """Pack 64 weights into 64 bytes (1 cache line)""" - assert weights.shape[-1] % 64 == 0 - quantized = quantize_z16(weights) # → {0..15} - return quantized.to(torch.uint8) # 1 byte per weight -``` - ---- - -## 3. PyTorch Implementation Strategy - -### 3.1 Core Quantization Kernel - -```python -import torch -import torch.nn as nn -import torch.nn.functional as F - -class Q2Quantize(torch.autograd.Function): - """ - Q² quantization to Z₄ (2-bit) with Gray encoding - Optimized for H100 tensor cores - """ - - @staticmethod - def forward(ctx, weight: torch.Tensor, tau: float) -> torch.Tensor: - """ - Quantize weights to {0, 1, 2, 3} = {A, B, C, D} - - Args: - weight: Input weights (any shape) - tau: Equiprobable threshold - - Returns: - Quantized weights in range [0, 3] - """ - # Vectorized quantization (GPU-friendly) - sym = torch.zeros_like(weight, dtype=torch.long) - sym = torch.where(weight <= -tau, 0, sym) # A: strong negative - sym = torch.where((weight > -tau) & (weight <= 0), 1, sym) # B: weak negative - sym = torch.where((weight > 0) & (weight <= tau), 2, sym) # C: weak positive - sym = torch.where(weight > tau, 3, sym) # D: strong positive - - # Gray encode: g = sym ⊕ (sym >> 1) - gray = sym ^ (sym >> 1) - - # Dequantize for forward pass (use original symbols, not Gray code) - # Map {0, 1, 2, 3} → {-1.5τ, -0.5τ, +0.5τ, +1.5τ} - dequant_map = torch.tensor( - [-1.5, -0.5, 0.5, 1.5], - dtype=weight.dtype, - device=weight.device - ) - weight_q = dequant_map[sym] * tau - - ctx.save_for_backward(weight, weight_q, torch.tensor(tau)) - return weight_q - - @staticmethod - def backward(ctx, grad_output: torch.Tensor) -> tuple: - """Straight-through estimator (STE)""" - return grad_output, None - - -class Q2Linear(nn.Module): - """ - Linear layer with Q² quantization - Supports Z₄, Z₈, Z₁₂, Z₁₆ - """ - - def __init__( - self, - in_features: int, - out_features: int, - bias: bool = True, - z_ring: int = 4, # 4, 8, 12, or 16 - ): - super().__init__() - self.in_features = in_features - self.out_features = out_features - self.z_ring = z_ring - self.bits = {4: 2, 8: 4, 12: 6, 16: 8}[z_ring] - - # Full-precision weights (will be quantized during forward) - self.weight = nn.Parameter(torch.randn(out_features, in_features)) - if bias: - self.bias = nn.Parameter(torch.zeros(out_features)) - else: - self.register_parameter('bias', None) - - # Compute equiprobable threshold - self.register_buffer( - 'tau', - torch.tensor(0.6745 / (in_features ** 0.5)) - ) - - def forward(self, x: torch.Tensor) -> torch.Tensor: - # Quantize weights during training - if self.training: - weight_q = Q2Quantize.apply(self.weight, self.tau) - else: - # Use cached quantized weights during inference - weight_q = self.weight_quantized if hasattr(self, 'weight_quantized') else self.weight - - return F.linear(x, weight_q, self.bias) - - def finalize_quantization(self): - """Call before exporting model""" - with torch.no_grad(): - self.weight_quantized = Q2Quantize.apply(self.weight, self.tau) - # Can delete full-precision weights to save memory - del self.weight -``` - -### 3.2 Geode-Guided Progressive Training - -```python -def train_progressive( - model: nn.Module, - dataloader: DataLoader, - max_steps: int = 15000, -): - """ - Three-phase Geode-guided training: - Phase 1: All layers at Z₁₆ (8-bit) - learn coarse structure - Phase 2: Progressive quantization layer-by-layer - Phase 3: Fine-tune at target precision - """ - optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) - - # Phase 1: Coarse learning (30% of steps) - phase1_steps = int(max_steps * 0.3) - for step in range(phase1_steps): - loss = train_step(model, dataloader, optimizer) - if step % 100 == 0: - print(f"Phase 1 [{step}/{phase1_steps}]: loss={loss:.4f}") - - # Phase 2: Progressive quantization (50% of steps) - phase2_steps = int(max_steps * 0.5) - layers_to_quantize = get_quantizable_layers(model) - - for layer_idx, layer in enumerate(layers_to_quantize): - # Quantize this layer - quantize_layer(layer, target_z_ring=get_target_z_ring(layer_idx)) - - # Fine-tune for a few steps - for step in range(phase2_steps // len(layers_to_quantize)): - loss = train_step(model, dataloader, optimizer) - - print(f"Quantized layer {layer_idx}, loss={loss:.4f}") - - # Phase 3: Final fine-tuning (20% of steps) - phase3_steps = max_steps - phase1_steps - phase2_steps - for step in range(phase3_steps): - loss = train_step(model, dataloader, optimizer) - if step % 100 == 0: - print(f"Phase 3 [{step}/{phase3_steps}]: loss={loss:.4f}") - - -def get_target_z_ring(layer_idx: int) -> int: - """Assign Z-ring based on layer depth (Geode hierarchy)""" - if layer_idx < 4: - return 8 # Z₈ (4-bit) for early layers - elif layer_idx < 8: - return 12 # Z₁₂ (6-bit) for middle layers - else: - return 4 # Z₄ (2-bit) for deep layers -``` - -### 3.3 H100 Optimization - -```python -# Use bfloat16 for non-quantized operations (H100 native) -torch.set_float32_matmul_precision('high') - -# Enable TF32 for matmul (H100 feature) -torch.backends.cuda.matmul.allow_tf32 = True - -# Compile model for faster execution (PyTorch 2.0+) -model = torch.compile(model, mode='max-autotune') - -# Use CUDA graphs for fixed-size batches -# (eliminates kernel launch overhead) -``` - ---- - -## 4. Why This Beats int5 Approaches - -### 4.1 Cache-Line Efficiency - -**int5 (current SOTA)**: -``` -5 bits/weight → 12.8 weights per 64-bit register -Cannot align to cache lines without padding -Memory access patterns irregular -``` - -**Q² Z₄ (our approach)**: -``` -2 bits/weight → 32 weights per 64-bit register -Perfect cache line alignment (256 weights = 64 bytes) -SIMD-friendly (AVX-512 processes 32 weights at once) -``` - -**Speedup estimate**: 1.3-1.5× faster matmul due to better memory access patterns. - -### 4.2 Mathematical Stability - -From feedback: p-adic numbers unstable at odd values. - -**int5**: No ring structure, arbitrary quantization levels -**Z₄, Z₈, Z₁₆**: Proper rings with well-defined arithmetic - -Example: Adding two Z₄ values: -```python -# Z₄ arithmetic (modulo 4) -a = 2 # C (weak positive) -b = 3 # D (strong positive) -c = (a + b) % 4 = 1 # B (weak negative after wrap) - -# This preserves cyclic structure -# Lee distance: d_L(c, a) = min(|1-2|, 4-|1-2|) = 1 -``` - -int5 has no such structure—quantization levels are ad-hoc. - -### 4.3 Geode Hierarchy - -The progression Z₄ → Z₈ → Z₁₆ matches natural hierarchy: - -- **DNA**: 4 bases (Z₄) -- **Codons**: 64 = 4³ combinations -- **Amino acids**: 20 encoded by codons - -Our architecture mirrors this: -- **Deep layers**: Z₄ (structural encoding, like DNA) -- **Middle layers**: Z₁₂ = 3×Z₄ (codon-like composition) -- **Shallow layers**: Z₁₆ (high-capacity, like protein structures) - ---- - -## 5. Revised Parameter Budget - -### 5.1 Target Configuration - -```python -vocab_size = 10240 # BigramHash -d_model = 512 # Wider than original (better for quantization) -n_layers = 12 -n_heads = 8 -mlp_ratio = 4 # Standard 4× expansion - -# Parameter counts by component: -embedding = vocab_size × d_model = 5.24M params -layer_1_4 (Z₈) = 4 × (4×d_model² + d_model×mlp_ratio) ≈ 4.2M params -layer_5_8 (Z₁₂) = 4 × (4×d_model² + d_model×mlp_ratio) ≈ 4.2M params -layer_9_12 (Z₄) = 4 × (4×d_model² + d_model×mlp_ratio) ≈ 4.2M params -output (tied) = 0 params - -Total = 17.84M params -``` - -### 5.2 Compressed Size - -```python -# Embedding (Z₁₆, 8-bit) -embedding_size = 5.24M × 1 byte = 5.24 MB - -# Layers 1-4 (Z₈, 4-bit) -layer_1_4_size = 4.2M × 0.5 bytes = 2.1 MB - -# Layers 5-8 (Z₁₂, 6-bit) -layer_5_8_size = 4.2M × 0.75 bytes = 3.15 MB - -# Layers 9-12 (Z₄, 2-bit) -layer_9_12_size = 4.2M × 0.25 bytes = 1.05 MB - -# Total before compression -subtotal = 11.54 MB - -# After zstd-22 compression (typical 0.8×) -final_size ≈ 9.2 MB - -# Headroom: 16 MB - 9.2 MB = 6.8 MB (42%) -``` - -**Conclusion**: Significant headroom allows for: -- Larger model (18M → 22M params) -- Higher precision for critical layers -- Additional optimization techniques - ---- - -## 6. Implementation Plan (Revised) - -### Week 1: PyTorch Q² Core - -**Day 1-2**: Implement Z₄, Z₈, Z₁₆ quantizers -```python -# Files to create: -q2_pytorch/ - quantizers.py # Z₄, Z₈, Z₁₂, Z₁₆ classes - layers.py # Q2Linear, Q2Embedding - utils.py # Packing, Gray encoding - test_quant.py # Unit tests -``` - -**Day 3-4**: Integrate with standard transformer -```python -# Fork parameter-golf repo -# Replace nn.Linear with Q2Linear -# Verify training works at Z₁₆ (8-bit baseline) -``` - -**Day 5-7**: Progressive quantization -```python -# Implement Geode-guided training loop -# Test phase transitions -# Profile GPU utilization -``` - -### Week 2: Optimization - -**Day 8-10**: Cache-line optimization -- Verify alignment with `torch.cuda.memory_stats()` -- Profile memory bandwidth utilization -- Optimize packing functions - -**Day 11-14**: Hyperparameter tuning -- Learning rate sweep -- Layer depth vs. width trade-offs -- Z-ring allocation optimization - -### Week 3-4: Competition Tuning - -**Day 15-21**: Leaderboard chasing -- Match SOTA (1.14 bpb) -- Beat SOTA (target: 1.10 bpb) -- Reproducibility testing (5+ runs) - -**Day 22-25**: Submission prep -- Package code -- Write documentation -- Submit PR - ---- - -## 7. Why This Will Win - -### 7.1 Novel Contributions - -1. **First use of Z_N ring hierarchy** in Parameter Golf -2. **Cache-line optimized quantization** (no current entry does this) -3. **Geode-guided progressive training** (mathematically grounded) -4. **Power-of-2 bit widths** (stability + speed) - -### 7.2 Expected Performance - -| Metric | Current SOTA | Our Target | Advantage | -|--------|--------------|------------|-----------| -| Bits/byte | 1.1428 | 1.10 | 0.04 bpb | -| Training time | ~9.5 min | ~8 min | 1.5 min faster | -| Compressed size | ~15 MB | ~9 MB | 6 MB headroom | -| Inference speed | baseline | 1.3× faster | Better packing | - -### 7.3 Risk Assessment - -**Technical risks**: LOW -- Standard transformer architecture (proven) -- PyTorch native (no exotic dependencies) -- Power-of-2 bit widths (hardware-friendly) -- Multiple fallback options - -**Timeline risk**: MEDIUM -- 3-4 weeks realistic -- Can submit incrementally (match SOTA first, then beat it) - -**Competition risk**: MEDIUM -- Leaderboard active but slowing down -- Our approach is novel enough to be hard to replicate quickly - ---- - -## 8. Comparison to Original Strategy - -| Aspect | Original | Revised | Why Changed | -|--------|----------|---------|-------------| -| Implementation | WASM + PyTorch | Pure PyTorch | GPUs available, WASM unnecessary | -| Quantization | int5/int6 mixed | Z₄/Z₈/Z₁₂/Z₁₆ | Power-of-2 stability, ring structure | -| Architecture | LTC blocks | Standard attention | Proven baseline first | -| Training | Flat → progressive | Geode-guided | Explicit hierarchical training | -| Optimization | Generic | Cache-line aware | H100 specific optimizations | - -**Key takeaway**: Original was overengineered. Revised is simpler, faster, and mathematically cleaner. - ---- - -## 9. Next Steps - -**Immediate** (today): -1. Create `q2_pytorch/` directory structure -2. Implement Z₄ quantizer with tests -3. Verify packing correctness - -**This week**: -1. Complete all quantizers (Z₄, Z₈, Z₁₂, Z₁₆) -2. Integrate with forked parameter-golf repo -3. Train baseline Z₁₆ model - -**Next week**: -1. Implement progressive training -2. Optimize for H100 -3. First submission attempt - ---- - -## 10. Acknowledgments - -This revised strategy incorporates critical feedback on: -- Avoiding WASM for GPU-accelerated training -- Using power-of-2 bit widths for stability -- Exploring Z_N ring hierarchy -- Cache-line optimization for performance - -The core Q² mathematical framework (Lee metric, Gray map, Geode factorization) remains valid. The implementation strategy is now properly aligned with the hardware constraints and mathematical structure of the problem. - ---- - -**Document Status**: Ready for implementation -**Last Updated**: 2026-03-21 -**Supersedes**: APPROACH_INITIAL.md diff --git a/docs/parameter-golf/DESIGN_REVISION_PLAN.md b/docs/parameter-golf/DESIGN_REVISION_PLAN.md deleted file mode 100644 index c514dec..0000000 --- a/docs/parameter-golf/DESIGN_REVISION_PLAN.md +++ /dev/null @@ -1,423 +0,0 @@ -# DESIGN.md Revision Plan - -> Section-by-section assessment: what is removed, revised, new, or unchanged -> when Q² generalizes beyond semantic embeddings and integrates Wildberger-Rubine. - ---- - -## Structural change - -The current document has three parts: - -1. **The Embedding Problem** (§1) — motivation from embeddings -2. **Quantization** (§2) — the Z₄ machinery -3. **The Transition Key** (§3) — indexing - -The revised document should have four parts: - -1. **The Quantization Problem** — general motivation (what quantization is, - why 4 cells, why L1). Embeddings become one application, not the frame. -2. **The Quaternary Coordinate** — the Z₄ machinery (Lee, Gray, complement). - Unchanged in substance. -3. **The Transition Key** — indexing. Mostly unchanged, gains hyper-Catalan - counting. -4. **The Combinatorial Structure** — NEW. Wildberger-Rubine integration: - Geode factorization, Euler constraint, threshold geometry, mixed-precision. - ---- - -## Section-by-section - -### Title - -| Current | Revised | -|:--------|:--------| -| "Quaternary Quantization: Design" | **Same** | - -No change needed — it's already general. - ---- - -### §1 header - -| Current | Revised | -|:--------|:--------| -| "The Embedding Problem" | **"The Quantization Problem"** | - -**REVISED.** The section should open with the general question — what is the -natural discrete coordinate system for a continuous signal? — not with -embeddings specifically. - ---- - -### §1.1 Embeddings and the unit hypersphere - -**REVISED → demoted.** The content is correct but application-specific. It -should move to a later section or appendix ("Application: Embeddings") or -become a motivating example within a broader §1.1 that introduces the general -setting: you have a point in $\mathbb{R}^n$ (or on $S^{n-1}$) and want to -discretize it. - -The mean-pooling digression ("Why not mean-pool?") is entirely -embedding-specific. **Demote or remove** — it's a design note for the embedding -application, not a general principle. - ---- - -### §1.2 Incommensurability - -**REVISED → demoted.** Incommensurability is about comparing two embedding -models. It's not a general quantization problem — it's a property of the -embedding application. The mathematical content ($Q \in O(n)$, rotational -invariance) is sound but belongs under "Application: Embeddings." - -The closing paragraph about relational geometry surviving rotation is the -important general insight. **Extract** the principle: quantization should capture -relational structure (differences, trajectories) rather than absolute -coordinates. That principle is general — it applies to any signal, not just -embeddings. - ---- - -### §1.3 Concentration of measure - -**SAME.** This is already general. Volume concentration, shell thickness, the -$1/\sqrt{n}$ noise floor — all apply to any high-dimensional quantization -problem. No changes needed. - ---- - -### §1.4 The surface-area information bound - -**REVISED.** The content stays, but gains the Wildberger-Rubine connection: - -- **Add:** The Euler polytope formula $V - E + F = 1$ governs the combinatorial - capacity of any quantization lattice. The number of cells ($F$), boundaries - ($E$), and vertices ($V$) are constrained — you cannot design them - independently. This is the combinatorial analog of the Bekenstein-Hawking - surface-area bound already cited. (Ref: Wildberger & Rubine 2025, where the - hyper-Catalan coefficient $(E-1)!/((V-1)! \cdot \mathbf{m}!)$ is governed by - exactly this relation.) - -- **Revise** the sentence "The semantic content of a document is encoded in - which region of $S^{n-1}$ its embedding occupies" → generalize to "The - information content of a signal is encoded in which region of the space its - representation occupies." - ---- - -### §1.5 The L1 metric and the cross-polytope - -**SAME.** Already general. L1 decomposition, cross-polytope geometry, no -coupling between dimensions — this is framework-level mathematics. No changes. - ---- - -### §1.6 The quaternary coordinate - -**SAME.** Already general. The derivation of four cells from sign × magnitude -class does not depend on embeddings. The final paragraph about float32 -activations being "overcomplete representations of an ordinal position" could -be generalized slightly — replace "float32 activation" with "continuous -measurement" — but the logic is unchanged. - -Minor wording revision: "for L1 retrieval" → "for L1 distance computation." - ---- - -### §1.7 Index design consequences - -**REVISED → split.** - -The three consequences currently listed are: - -1. Coordinate-frame incommensurability — **embedding-specific, demote** -2. Thermal + geometric constraints coincide — **embedding-specific, demote** -3. The quantization problem restated — **general, keep and promote** - -Consequence 3 is the key statement: "What is the natural discrete coordinate -system of the L1 unit ball?" This should be the *opening* of §1, not a -consequence buried in §1.7. - ---- - -### §1.8 Relational geometry and the lingua franca - -**REVISED → demoted.** The king/queen analogy, the lingua franca argument, the -transition key as "indexing shape rather than position" — all embedding-specific -narrative. The underlying principle (quantize trajectories, not positions) is -general and should be extracted into the framework. The specific embedding -application (cross-model invariance, Procrustes avoidance) moves to -"Application: Embeddings." - ---- - -### §2 header: Quantization - -**SAME.** Already general. - ---- - -### §2.1 The thermal constraint - -**REVISED.** Currently frames the constraint as "an LLM is already running on a -phone." This is one instance of a general principle: **quantization operates -under a resource budget.** The thermal constraint is real and important, but the -general statement is: given a fixed compute/memory/energy budget, what is the -highest-fidelity discrete representation? - -Keep the thermal constraint as a concrete example. Add the general framing. - ---- - -### §2.2 Binary quantization - -**SAME.** Already general. Sign-bit quantization, 1 bit per dimension, Hamming -distance. No embedding-specific content. - ---- - -### §2.3 Ternary quantization - -**SAME.** Already general. The $\mathbb{Z}_3$ no-involution argument is pure -algebra. - ---- - -### §2.4 Standard Q4 (different problem) - -**REVISED.** Currently says "this is weight quantization, not activation -quantization for retrieval." In a general framework, the distinction is between: - -- **Reconstruction quantization** (GPTQ, AWQ): minimize $\|W - \hat{W}\|_F^2$. - The goal is to approximate the original signal. -- **Structural quantization** (Q²): preserve relational/topological structure. - The goal is to preserve distances, not values. - -The section should frame this as two different quantization objectives, not as -"different problem, not our concern." - ---- - -### §2.5 Quaternary semantic quantization - -**REVISED.** Title change: drop "semantic." - -| Current | Revised | -|:--------|:--------| -| "Quaternary semantic quantization" | **"Quaternary quantization"** | - -The four requirements (sign, magnitude class, complement structure, minimum -alphabet) are already general. No changes to the derivation. - -**Add (Wildberger-Rubine):** After the empirical calibration paragraph, add a -new paragraph on **analytical threshold computation for non-Gaussian sources:** - -> For source distributions expressible as polynomial or mixture models, the -> equiprobable threshold $\tau^*$ can be computed analytically via the -> hyper-Catalan series (Wildberger & Rubine 2025). The threshold equation -> $F(\tau) = k/4$ for CDF $F$ becomes a polynomial in the distribution -> parameters, and the series $\alpha = \sum_\mathbf{m} C_\mathbf{m} \cdot -> t_2^{m_2} t_3^{m_3} \cdots$ converges without iteration. Truncation order -> trades precision for compute cost — a natural fit for the resource-constrained -> setting of §2.1. - -This does not replace the empirical calibration — it adds a second path. - ---- - -### §2.6 The Lee metric - -**REVISED (minor).** The paragraph "Why this metric matches semantic distance" -uses "semantic distance" language throughout. Generalize: - -- "the concept was weakly activated" → "the coordinate was weakly committed" -- "semantic opposition" → "structural opposition" -- "Strong-negative and strong-positive share strong commitment to their - respective directions" — this is already general, keep as-is. - -The math, examples, and diagrams are unchanged. - ---- - -### §2.7 The Gray map - -**SAME.** Pure algebra: the $\phi$ isometry, Hammons et al. 1994, popcnt(XOR). -Nothing embedding-specific. No changes. - ---- - -### §2.8 The complement involution - -**SAME.** Pure algebra. No changes. - ---- - -### §3 header: The Transition Key - -**SAME.** Already general. - ---- - -### §3.1 Run-reduction - -**REVISED (minor).** Two small wording changes: - -- "a document that visits a semantic state" → "a signal that visits a - quantization state" -- "The key records which states were visited" — already general, keep. - -The algorithm, lemmas, and proofs are unchanged. - ---- - -### §3.2 The 64-bit integer key - -**SAME.** Pure number theory. Base-4 encoding, 64-bit fit, bit layout. No -embedding-specific content. - ---- - -### §3.3 MSB alignment and prefix semantics - -**REVISED (minor).** "Semantic depth" → "resolution depth" or just "depth." -The term "semantic" here means "meaningful," which is general, but it will read -as embedding-specific in context. - ---- - -### §3.4 Block file organisation - -**SAME.** Pure index design. Block count, hex ranges, binary search. No changes. - ---- - -### §3.5 Window queries - -**REVISED (minor).** The table column "Semantic meaning" → "Meaning" or -"Interpretation." The content ("first 31 transitions match") is already general. - ---- - -### §3.6 Bucket density - -**REVISED.** Keep all existing content. **Add** the hyper-Catalan trie counting: - -> **Admissible sequence count.** The transition trie has a root of arity $q = 4$ -> and all subsequent nodes of arity $q - 1 = 3$ (each successor must differ from -> its predecessor). The number of distinct transition sequences of length $k$ is: -> -> $$D(k) = q \cdot (q-1)^{k-1} = 4 \cdot 3^{k-1}$$ -> -> For $k = 32$ (the key capacity), $D(32) = 4 \cdot 3^{31} \approx -> 2.17 \times 10^{15}$, occupying $\approx 1.2 \times 10^{-4}$ of the $2^{64}$ -> address space. -> -> More generally, the number of distinct subtree patterns of mixed arity in the -> transition trie — relevant when branching factors vary under transition -> constraints — is counted by the hyper-Catalan number -> $C_\mathbf{m} = (E-1)!/((V-1)! \cdot \mathbf{m}!)$ where $V - E + F = 1$ -> (Wildberger & Rubine 2025). This provides exact combinatorial formulas for -> non-uniform bucket density when the source distribution favors certain -> branching patterns over others. - ---- - -### NEW §4: The Combinatorial Structure - -**NEW section.** This is where the bulk of the Wildberger-Rubine integration -lives. - -#### §4.1 The Geode factorization and hierarchical quantization - -The factorization $S - 1 = S_1 \cdot G$ as the algebraic skeleton of -progressive/multi-resolution quantization. First level ($S_1$) = coarse cell. -Geode ($G$) = refinement within. Recursive structure gives coarse-to-fine for -free. Connects to §3.3 (prefix semantics) and §3.5 (window queries). - -#### §4.2 Euler's polytope formula as a quantization constraint - -$V - E + F = 1$ constrains the topology of admissible quantization lattices. -For $\mathbb{Z}_4$: $V=4, E=4, F=1$. For product $\mathbb{Z}_4^n$: the face -structure is computable. When extending to $q > 4$, this gives an a priori -enumeration of admissible lattice geometries. - -#### §4.3 Mixed-precision quantization - -The Bi-Tri (and higher) hyper-Catalan arrays count the number of distinct -codebooks with $m_2$ binary dimensions, $m_3$ ternary, $m_4$ quaternary. This -bounds the search space for adaptive bit allocation: spend 2 bits on -high-variance dimensions, 1 bit on low-variance ones. - -#### §4.4 Threshold geometry - -For non-Gaussian distributions (mixtures, heavy-tailed), the equiprobable -threshold equation $F(\tau) = k/q$ is a polynomial in the distribution -parameters. The hyper-Catalan series solves it combinatorially. Truncation -order maps to precision/cost tradeoff. - -#### §4.5 Reconstruction and series reversion - -If Q² ever needs a decode path (lossy compression, not just retrieval), the -optimal reconstruction point $\mathbb{E}[x \mid q(x) = s]$ requires inverting -the CDF within each cell. The Lagrange-inversion / hyper-Catalan series -provides this without numerical root-finding. - ---- - -### NEW: Application section (or appendix) - -**Application: Embeddings.** Collects the embedding-specific material currently -scattered through §1: - -- §1.1 (hypersphere, mean-pooling) -- §1.2 (incommensurability, Procrustes) -- §1.7 (thermal + geometric coincidence, coordinate-frame constraint) -- §1.8 (relational geometry, king/queen, lingua franca) -- §2.1 thermal constraint as concrete instance - -This is not deleted — it's relocated. The embedding application is the primary -use case and should be presented prominently, but as an application of the -general framework rather than as the frame itself. - ---- - -## Summary table - -| Section | Status | Nature of change | -|:--------|:------:|:-----------------| -| Title | **Same** | — | -| §1 header | **Revised** | "Embedding Problem" → "Quantization Problem" | -| §1.1 | **Demoted** | Moves to Application: Embeddings | -| §1.2 | **Demoted** | Moves to Application: Embeddings (principle extracted) | -| §1.3 | **Same** | Already general | -| §1.4 | **Revised** | Gains Euler V-E+F connection; wording generalized | -| §1.5 | **Same** | Already general | -| §1.6 | **Same** | Already general (minor wording) | -| §1.7 | **Split** | Item 3 promoted to §1 opener; items 1-2 demoted | -| §1.8 | **Demoted** | Moves to Application: Embeddings (principle extracted) | -| §2.1 | **Revised** | Generalized framing; thermal example kept | -| §2.2 | **Same** | Already general | -| §2.3 | **Same** | Already general | -| §2.4 | **Revised** | Reframed as reconstruction vs. structural quantization | -| §2.5 | **Revised** | Drop "semantic" from title; add analytical thresholds | -| §2.6 | **Revised** | Minor wording: "semantic distance" → "structural distance" | -| §2.7 | **Same** | Pure algebra | -| §2.8 | **Same** | Pure algebra | -| §3.1 | **Revised** | Minor wording: "semantic state" → "quantization state" | -| §3.2 | **Same** | Pure number theory | -| §3.3 | **Revised** | Minor: "semantic depth" → "resolution depth" | -| §3.4 | **Same** | Pure index design | -| §3.5 | **Revised** | Minor: column header generalized | -| §3.6 | **Revised** | Gains hyper-Catalan trie counting | -| §4 | **New** | Combinatorial Structure (Wildberger-Rubine) | -| §4.1 | **New** | Geode factorization → hierarchical quantization | -| §4.2 | **New** | Euler V-E+F as lattice constraint | -| §4.3 | **New** | Mixed-precision counting | -| §4.4 | **New** | Threshold geometry for non-Gaussian | -| §4.5 | **New** | Reconstruction via series reversion | -| App. | **New** | Application: Embeddings (relocated from §1) | - -**Count:** 8 Same, 12 Revised, 6 New, 0 Removed. - -Nothing is deleted. Content moves; framing changes; new theory is added. diff --git a/docs/parameter-golf/IMPLEMENTATION.md b/docs/parameter-golf/IMPLEMENTATION.md deleted file mode 100644 index ea04923..0000000 --- a/docs/parameter-golf/IMPLEMENTATION.md +++ /dev/null @@ -1,991 +0,0 @@ -# Parameter Golf: Implementation Roadmap - -> **Status**: Ready for implementation -> **Related**: [APPROACH_INITIAL.md](APPROACH_INITIAL.md) - -This document provides tactical implementation details for the Q² Parameter Golf strategy. - ---- - -## Quick Reference - -### Key Numbers - -- **Target score**: <1.10 bits/byte (current SOTA: 1.1428) -- **Parameter budget**: 16MB = 16,000,000 bytes -- **Training time**: 10 minutes on 8×H100 SXM -- **Effective parameters at int5**: ~25M params - -### Architecture Summary - -``` -Input (BigramHash 10240) - ↓ -Embedding (384 dim, int6) [3.9M params, 2.9MB] - ↓ -12× LTC Blocks (384 dim, MLP 3×, int5) [15M params, 9.4MB] - ↓ -Output (tied with Embedding, int6) [0 params, 0MB] - ↓ -Softmax -``` - -**Total**: ~19M params, ~12.3MB compressed, 3.7MB headroom - ---- - -## Phase 1: Foundation (Days 1-3) - -### Day 1: Environment Setup - -**Goal**: Get Parameter Golf baseline running - -```bash -# Clone and setup -git clone https://github.com/openai/parameter-golf.git -cd parameter-golf -python3 -m venv .venv -source .venv/bin/activate -pip install torch numpy sentencepiece huggingface-hub datasets tqdm - -# Download data (10 shards for quick iteration) -python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10 - -# Verify baseline runs -RUN_ID=baseline_test \ -DATA_PATH=./data/datasets/fineweb10B_sp1024/ \ -TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \ -VOCAB_SIZE=1024 \ -MAX_WALLCLOCK_SECONDS=60 \ -torchrun --standalone --nproc_per_node=1 train_gpt.py -``` - -**Success criteria**: - -- Baseline trains successfully -- Understand `train_gpt.py` structure -- Confirm data loading pipeline - -### Day 2: Q² Integration - Weight Quantization - -**Goal**: Extend Q² kernel for weight quantization - -**Current state**: `src/q2.wat` handles activation quantization (inference-time) - -**Needed**: Weight quantization (training-time) - -**Implementation**: - -1. **Add weight quantization mode to q2.ts**: - -```typescript -// src/q2.ts (new function) -export function q2QuantizeWeights( - weights: Float32Array, - dim: number, - precision: 5 | 6 | 8 = 6 -): Uint8Array { - // Compute per-tensor or per-channel thresholds - const tau = computeEquiprobableThreshold(weights, dim); - - // Quantize to {A, B, C, D} = {0, 1, 2, 3} - const quantized = new Uint8Array(Math.ceil(weights.length * precision / 8)); - - for (let i = 0; i < weights.length; i++) { - const w = weights[i]; - let sym: number; - - if (w <= -tau) sym = 0; // A: strong negative - else if (w <= 0) sym = 1; // B: weak negative - else if (w <= tau) sym = 2; // C: weak positive - else sym = 3; // D: strong positive - - // Gray encode - const gray = sym ^ (sym >> 1); - - // Pack based on precision - packSymbol(quantized, i, gray, precision); - } - - return quantized; -} -``` - -2. **Add analytical threshold computation** (§D-4.4): - -```typescript -function computeEquiprobableThreshold( - weights: Float32Array, - dim: number -): number { - // For Gaussian approximation: τ* = Φ^(-1)(3/4) / √n - const invCDF_75 = 0.6745; // Φ^(-1)(0.75) - const n = dim; - return invCDF_75 / Math.sqrt(n); -} - -// For non-Gaussian: use empirical quartiles -function computeEmpiricalThreshold(weights: Float32Array): number { - const sorted = Float32Array.from(weights).sort(); - const q25 = sorted[Math.floor(sorted.length * 0.25)]; - const q75 = sorted[Math.floor(sorted.length * 0.75)]; - return (Math.abs(q25) + Math.abs(q75)) / 2; -} -``` - -3. **Test on toy model**: - -```typescript -// test/weight-quantization.test.ts -import { q2QuantizeWeights } from '../src/q2'; - -test('weight quantization preserves distribution', () => { - const weights = new Float32Array(1024); - // Generate Gaussian N(0, 1/32) - for (let i = 0; i < weights.length; i++) { - weights[i] = gaussianRandom(0, 1/32); - } - - const quantized = q2QuantizeWeights(weights, 32, 6); - - // Verify: ~25% in each of {A, B, C, D} - const counts = countSymbols(quantized, 6); - expect(counts.A).toBeCloseTo(256, 30); // ±30 for variance - expect(counts.B).toBeCloseTo(256, 30); - expect(counts.C).toBeCloseTo(256, 30); - expect(counts.D).toBeCloseTo(256, 30); -}); -``` - -**Success criteria**: - -- Weight quantization function works -- Equiprobable distribution verified -- Compression ratio: 4 weights/byte (int6) or 3.2 weights/byte (int5) - -### Day 3: PyTorch QAT Integration - -**Goal**: Hook Q² quantization into PyTorch training - -**Implementation**: - -```python -# q2_pytorch/quantize.py -import torch -import torch.nn as nn - -class Q2QuantizeWeight(torch.autograd.Function): - """Quantization-Aware Training for Q² weights""" - - @staticmethod - def forward(ctx, weight, tau, precision=6): - # Quantize to {0, 1, 2, 3} - sym = torch.zeros_like(weight, dtype=torch.long) - sym[weight <= -tau] = 0 # A - sym[(weight > -tau) & (weight <= 0)] = 1 # B - sym[(weight > 0) & (weight <= tau)] = 2 # C - sym[weight > tau] = 3 # D - - # Dequantize to {-1.5, -0.5, +0.5, +1.5} * tau (centered) - dequant_table = torch.tensor( - [-1.5, -0.5, 0.5, 1.5], dtype=weight.dtype, device=weight.device - ) - weight_q = dequant_table[sym] * tau - - ctx.save_for_backward(weight, weight_q, tau) - return weight_q - - @staticmethod - def backward(ctx, grad_output): - # Straight-through estimator (STE) - weight, weight_q, tau = ctx.saved_tensors - grad_weight = grad_output.clone() - - # Gradient clipping at quantization boundaries - # (optional refinement) - return grad_weight, None, None - - -class Q2Linear(nn.Linear): - """Linear layer with Q² quantization""" - - def __init__(self, in_features, out_features, bias=True, precision=6): - super().__init__(in_features, out_features, bias) - self.precision = precision - self.register_buffer('tau', torch.tensor(0.6745 / (in_features ** 0.5))) - - def forward(self, x): - # Quantize weights on-the-fly during training - if self.training: - weight_q = Q2QuantizeWeight.apply(self.weight, self.tau, self.precision) - else: - # Use pre-quantized weights during inference - weight_q = self.weight_quantized - - return nn.functional.linear(x, weight_q, self.bias) - - def quantize_weights(self): - """Finalize quantization (call before export)""" - with torch.no_grad(): - self.weight_quantized = Q2QuantizeWeight.apply( - self.weight, self.tau, self.precision - ) -``` - -**Test**: - -```python -# Test QAT on 2-layer MLP -model = nn.Sequential( - Q2Linear(768, 3072, precision=6), - nn.GELU(), - Q2Linear(3072, 768, precision=6), -) - -# Train for a few steps -optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) -for _ in range(100): - x = torch.randn(32, 768) - y = model(x) - loss = y.pow(2).mean() - loss.backward() - optimizer.step() - optimizer.zero_grad() - -# Quantize for export -for layer in model: - if isinstance(layer, Q2Linear): - layer.quantize_weights() -``` - -**Success criteria**: - -- QAT trains without errors -- Loss converges (even on random data) -- Weight quantization called successfully - ---- - -## Phase 2: Core Architecture (Days 4-10) - -### Day 4-5: LTC Block Implementation - -**Goal**: Implement Closed-form Continuous-time (CfC) cells - -**Background**: LTC/CfC from Hasani et al. (2021, 2023) - -**Key equations**: - -$$\frac{dx}{dt} = f(x, u, \theta)$$ - -where $f$ is a learned ODE. CfC uses closed-form solution: - -$$x_{t+1} = x_t + \int_t^{t+1} f(x(\tau), u, \theta) d\tau$$ - -**Implementation** (adapted from `ncps` library): - -```python -# q2_pytorch/cfc.py -import torch -import torch.nn as nn - -class CfCCell(nn.Module): - """Closed-form Continuous-time Cell (Hasani et al. 2023)""" - - def __init__(self, input_size, hidden_size, sparsity=0.5): - super().__init__() - self.input_size = input_size - self.hidden_size = hidden_size - - # Wiring: sparse connectivity (NCP-style) - # Only connect sparsity% of neuron pairs - self.register_buffer( - 'wiring_mask', - self._create_sparse_mask(hidden_size, sparsity) - ) - - # ODE parameters: f(x) = -a*x + b*σ(Wx + Uu + bias) - self.w_tau = nn.Parameter(torch.randn(hidden_size)) # Time constants - self.w_in = nn.Linear(input_size, hidden_size) - self.w_rec = nn.Linear(hidden_size, hidden_size, bias=False) - self.w_out = nn.Linear(hidden_size, hidden_size) - - # Activation - self.activation = nn.Tanh() - - # Apply sparsity mask to recurrent weights - self.w_rec.weight.data *= self.wiring_mask - - def _create_sparse_mask(self, size, sparsity): - """Create sparse connectivity mask""" - mask = torch.rand(size, size) < sparsity - mask.fill_diagonal_(True) # Always self-connect - return mask.float() - - def forward(self, x, h): - """ - Args: - x: input (batch, input_size) - h: hidden state (batch, hidden_size) - Returns: - h_next: next hidden state (batch, hidden_size) - """ - # Compute ODE right-hand side - u_in = self.w_in(x) - # Apply sparsity mask to recurrent weight matrix, then compute recurrent input - u_rec = torch.matmul(h, (self.w_rec.weight * self.wiring_mask).t()) - u = self.activation(u_in + u_rec) - - # Closed-form solution: Euler integration with learned time constants - tau = torch.sigmoid(self.w_tau) # Time constants in (0, 1) - h_next = (1 - tau) * h + tau * u - - return h_next - - -class LTCBlock(nn.Module): - """LTC block replacing transformer attention""" - - def __init__(self, dim, mlp_ratio=3.0, sparsity=0.5): - super().__init__() - self.dim = dim - - # CfC cell - self.cfc = CfCCell(dim, dim, sparsity=sparsity) - - # MLP (following leaderboard: 3× expansion) - mlp_hidden = int(dim * mlp_ratio) - self.mlp = nn.Sequential( - nn.Linear(dim, mlp_hidden), - SmearGeLU(), # Leaderboard-proven activation - nn.Linear(mlp_hidden, dim), - ) - - # Layer norms - self.ln1 = nn.LayerNorm(dim) - self.ln2 = nn.LayerNorm(dim) - - def forward(self, x, h=None): - """ - Args: - x: input (batch, seq_len, dim) - h: optional hidden state from previous sequence - Returns: - output: (batch, seq_len, dim) - h_last: final hidden state for next sequence - """ - batch, seq_len, dim = x.shape - - # Initialize hidden state - if h is None: - h = torch.zeros(batch, dim, device=x.device) - - # Process sequence with CfC - outputs = [] - for t in range(seq_len): - x_t = x[:, t, :] - h = self.cfc(self.ln1(x_t), h) - outputs.append(h) - - x = torch.stack(outputs, dim=1) # (batch, seq_len, dim) - - # Add & Norm - x = x + torch.stack(outputs, dim=1) - - # MLP - x = x + self.mlp(self.ln2(x)) - - return x, h - - -class SmearGeLU(nn.Module): - """SmearGate activation (from leaderboard)""" - def forward(self, x): - # SmearGate: smooth variant of GeLU - return x * torch.sigmoid(1.702 * x) -``` - -**Test**: - -```python -# Test LTC block -block = LTCBlock(dim=384, mlp_ratio=3.0) - -x = torch.randn(32, 128, 384) # batch=32, seq=128, dim=384 -output, h = block(x) - -assert output.shape == (32, 128, 384) -assert h.shape == (32, 384) - -# Test sequential processing -output2, h2 = block(x, h) # Continue from previous state -``` - -**Success criteria**: - -- LTC block processes sequences -- Hidden state propagates correctly -- Gradients flow (test with dummy loss) - -### Day 6-7: Full Model Architecture - -**Goal**: Assemble full Q²-LTC model - -```python -# q2_pytorch/model.py -import torch -import torch.nn as nn -from .cfc import LTCBlock -from .quantize import Q2Linear - -class Q2ParameterGolfModel(nn.Module): - """Q² model for Parameter Golf challenge""" - - def __init__( - self, - vocab_size=10240, # BigramHash - dim=384, - n_layers=12, - mlp_ratio=3.0, - cfc_sparsity=0.5, - precision=6, # int6 default, can mix with int5 - ): - super().__init__() - self.vocab_size = vocab_size - self.dim = dim - self.n_layers = n_layers - - # Embedding (int6 as proven on leaderboard) - self.embed = nn.Embedding(vocab_size, dim) - - # LTC blocks - self.blocks = nn.ModuleList([ - LTCBlock(dim, mlp_ratio, cfc_sparsity) - for _ in range(n_layers) - ]) - - # Final layer norm - self.ln_f = nn.LayerNorm(dim) - - # Output projection (tied with embedding) - # Will be set after embed is initialized - self.output = None - - def forward(self, idx, hidden_states=None): - """ - Args: - idx: input token indices (batch, seq_len) - hidden_states: optional list of hidden states from previous batch - Returns: - logits: (batch, seq_len, vocab_size) - new_hidden_states: list of hidden states for next batch - """ - x = self.embed(idx) # (batch, seq_len, dim) - - if hidden_states is None: - hidden_states = [None] * self.n_layers - - new_hidden_states = [] - for i, block in enumerate(self.blocks): - x, h = block(x, hidden_states[i]) - new_hidden_states.append(h) - - x = self.ln_f(x) - - # Output projection (tied) - if self.output is None: - # Tie output with embedding - self.output = nn.Linear(self.dim, self.vocab_size, bias=False) - self.output.weight = self.embed.weight - - logits = self.output(x) - - return logits, new_hidden_states - - def quantize_model(self, precision_map=None): - """ - Quantize all weights for export - - Args: - precision_map: dict mapping layer name to precision (5, 6, or 8) - If None, use int6 for all - """ - if precision_map is None: - precision_map = {} - - for name, module in self.named_modules(): - if isinstance(module, nn.Linear): - # Determine precision for this layer - precision = precision_map.get(name, 6) - - # Quantize weights - # (In practice, replace nn.Linear with Q2Linear) - pass # TODO: implement weight replacement - - def count_parameters(self): - """Count total parameters""" - return sum(p.numel() for p in self.parameters()) - - def estimate_compressed_size(self, precision_map=None): - """ - Estimate compressed size in bytes - - Args: - precision_map: dict mapping layer types to average precision - """ - if precision_map is None: - precision_map = {'embed': 6, 'ltc': 5, 'mlp': 5, 'output': 0} - - total_bytes = 0 - - # Embedding (int6) - embed_params = self.vocab_size * self.dim - total_bytes += embed_params * 6 / 8 - - # LTC blocks (mixed int5/int6) - ltc_params = 0 - for block in self.blocks: - ltc_params += sum(p.numel() for p in block.parameters()) - - total_bytes += ltc_params * 5 / 8 # Conservative int5 - - # Output (tied, no extra params) - - # Add zstd compression factor (typical: 0.8×) - total_bytes *= 0.8 - - return total_bytes -``` - -**Test**: - -```python -# Test full model -model = Q2ParameterGolfModel( - vocab_size=10240, - dim=384, - n_layers=12, - mlp_ratio=3.0, -) - -print(f"Total parameters: {model.count_parameters():,}") -print(f"Estimated size: {model.estimate_compressed_size() / 1e6:.2f} MB") - -# Forward pass -idx = torch.randint(0, 10240, (32, 128)) -logits, hidden = model(idx) - -assert logits.shape == (32, 128, 10240) -``` - -**Success criteria**: - -- Model initializes -- Parameters < 25M -- Estimated compressed size < 13MB (headroom for optimization) -- Forward pass works - -### Day 8-10: Training Loop Integration - -**Goal**: Integrate Q² model into Parameter Golf training script - -**Create** `q2_train_gpt.py` (fork of `train_gpt.py`): - -```python -# Key modifications to train_gpt.py: - -# 1. Replace GPT model with Q2ParameterGolfModel -from q2_pytorch.model import Q2ParameterGolfModel - -model = Q2ParameterGolfModel( - vocab_size=args.vocab_size, - dim=384, - n_layers=12, - mlp_ratio=3.0, -).to(device) - -# 2. Use Muon optimizer (proven on leaderboard) -from muon import Muon # Custom optimizer - -optimizer = Muon( - model.parameters(), - lr=0.01, - momentum=0.99, - weight_decay=0.04, -) - -# 3. Three-phase training schedule -def get_precision_schedule(step, total_steps): - """ - Phase 1 (0-30%): int8 (coarse learning) - Phase 2 (30-80%): int6/int5 mixed (refinement) - Phase 3 (80-100%): int5 (final) - """ - if step < total_steps * 0.3: - return 8 - elif step < total_steps * 0.8: - return 6 - else: - return 5 - -# 4. Geode-guided progressive quantization -def quantize_progressively(model, current_layer, precision): - """Quantize layers 0..current_layer, leave rest at higher precision""" - for i, block in enumerate(model.blocks): - if i <= current_layer: - # Quantize this block - for module in block.modules(): - if isinstance(module, nn.Linear): - # Apply Q² quantization - pass - -# 5. Sliding window evaluation (proven on leaderboard) -def evaluate_sliding_window(model, data, window=2048, stride=64): - """Evaluate with sliding window for longer effective context""" - total_loss = 0 - total_tokens = 0 - - for i in range(0, len(data) - window, stride): - chunk = data[i:i+window] - logits, _ = model(chunk[:-1]) - loss = F.cross_entropy( - logits.view(-1, logits.size(-1)), - chunk[1:].view(-1) - ) - total_loss += loss.item() * (window - 1) - total_tokens += (window - 1) - - return total_loss / total_tokens -``` - -**Success criteria**: - -- Training starts successfully -- GPU memory fits on H100 (80GB) -- Training completes in ~10 minutes -- Model compresses to <16MB - ---- - -## Phase 3: Optimization (Days 11-20) - -### Day 11-13: Hyperparameter Tuning - -**Key hyperparameters to tune**: - -1. **Model architecture**: - - `dim`: 320, 384, 448 - - `n_layers`: 10, 12, 14 - - `mlp_ratio`: 2.5, 3.0, 3.5 - - `cfc_sparsity`: 0.4, 0.5, 0.6 - -2. **Training**: - - `lr`: 0.008, 0.01, 0.012 - - `weight_decay`: 0.03, 0.04, 0.05 - - `batch_size`: Varies based on context length - - `seq_length`: 2048, 3072, 4096 - -3. **Quantization**: - - `int5_ratio`: 0.6, 0.7, 0.8 (% of layers at int5) - - `embed_precision`: 6, 7 - - `threshold_method`: 'analytical', 'empirical', 'learned' - -**Tuning strategy**: - -```python -# Grid search (if time permits) -configs = [ - {'dim': 384, 'n_layers': 12, 'mlp_ratio': 3.0}, - {'dim': 320, 'n_layers': 14, 'mlp_ratio': 3.0}, - {'dim': 448, 'n_layers': 10, 'mlp_ratio': 2.5}, -] - -results = [] -for config in configs: - # Train for 10 min - score = train_and_evaluate(**config) - results.append((config, score)) - -# Pick best -best_config = min(results, key=lambda x: x[1]) -``` - -### Day 14-16: Ablation Studies - -**Questions to answer**: - -1. **Does LTC beat attention?** - - Train Q²-Attention baseline - - Compare to Q²-LTC at same parameter count - - Expected: LTC wins by 0.01-0.02 bpb - -2. **Does structural quantization beat reconstruction?** - - Implement GPTQ-style reconstruction quantization - - Compare to Q² structural quantization - - Expected: Q² wins by 0.02-0.03 bpb - -3. **Does Geode-guided training help?** - - Train flat (standard end-to-end) - - Train with progressive quantization - - Expected: Progressive wins by 0.01 bpb - -4. **What's the optimal int5/int6 ratio?** - - Try 50%, 70%, 90% int5 - - Expected: 70% optimal (balance compression & quality) - -### Day 17-20: Final Optimization - -**Last mile improvements**: - -1. **Gradient accumulation tuning** - - Find optimal batch size vs. # steps trade-off - - Larger batches → fewer steps but better gradients - -2. **Learning rate warmup/cooldown** - - Tune warmup duration - - Tune final LR (for SWA) - -3. **Tokenizer optimization** - - BigramHash parameters - - Character vs. byte-level encoding - -4. **Compression pipeline** - - Test zstd compression levels (20, 21, 22) - - Ensure final artifact < 16MB - -**Target by end of Phase 3**: **1.10-1.12 bpb** consistently - ---- - -## Phase 4: Submission (Days 21-25) - -### Day 21-23: Final Runs - -**Reproducibility testing**: - -```bash -# Run 5 times with different seeds -for seed in 1 2 3 4 5; do - SEED=$seed \ - RUN_ID=q2_ltc_final_seed${seed} \ - torchrun --standalone --nproc_per_node=8 q2_train_gpt.py -done - -# Compute mean and std -python analyze_runs.py -``` - -**Statistical significance**: - -- Need to beat 1.1428 by ≥0.005 nats (challenge requirement) -- Target: 1.10 mean, σ < 0.005 -- p < 0.01 via t-test - -### Day 24: Prepare Submission - -**Required files** (per Parameter Golf guidelines): - -1. **README.md**: - ```markdown - # Q² + Liquid Time Constants for Parameter Golf - - ## Summary - This submission combines Q² structural quantization with Liquid Time - Constant (LTC) blocks to achieve state-of-the-art compression on - FineWeb validation. - - **Score**: 1.10 bpb (avg over 5 runs, σ=0.004) - - ## Key Innovations - 1. Structural quantization (Q²) instead of reconstruction quantization - 2. LTC blocks instead of attention (linear complexity) - 3. Geode-guided progressive quantization - 4. Mixed int5/int6 precision guided by hyper-Catalan framework - - ## Reproducibility - ```bash - # Clone repo - git clone https://github.com//parameter-golf-q2 - cd parameter-golf-q2 - - # Setup - pip install -r requirements.txt - python3 data/cached_challenge_fineweb.py --variant sp1024 - - # Train - torchrun --standalone --nproc_per_node=8 q2_train_gpt.py - ``` - - ## Authors - Q² Project Team - ``` - -2. **submission.json**: - ```json - { - "name": "Q² Project", - "github_id": "devlux76", - "val_bpb": 1.10, - "runs": [1.098, 1.102, 1.099, 1.101, 1.100], - "date": "2026-04-15", - "description": "Q² structural quantization + LTC blocks", - "innovations": [ - "Structural quantization (Lee metric on Z4)", - "Liquid Time Constant blocks (linear complexity)", - "Geode-guided progressive quantization", - "Hyper-Catalan mixed-precision allocation" - ] - } - ``` - -3. **Train logs** (5 runs) - -4. **q2_train_gpt.py** (full script) - -5. **requirements.txt**: - ``` - torch>=2.0.0 - numpy - sentencepiece - ``` - -### Day 25: Submit - -**Submission process**: - -1. Create folder: `records/track_10min_16mb/2026-04-15_Q2_LTC_Structural_Quant/` - -2. Add all required files - -3. Create PR to `openai/parameter-golf`: - ```bash - git checkout -b q2-ltc-submission - git add records/track_10min_16mb/2026-04-15_Q2_LTC_Structural_Quant/ - git commit -m "Q² + LTC: 1.10 bpb via structural quantization" - git push origin q2-ltc-submission - ``` - -4. Create PR on GitHub - -5. Monitor for CI verification - ---- - -## Risk Mitigation Checklist - -### Before Day 10 (Architecture Lock) - -- [ ] Q² weight quantization works -- [ ] LTC blocks train successfully -- [ ] Full model fits in memory -- [ ] Estimated size < 14MB (2MB headroom) - -**If blocked**: Fall back to standard attention, Q² still competitive - -### Before Day 15 (Optimization Lock) - -- [ ] Training completes in < 10 min -- [ ] Score < 1.15 bpb (competitive) -- [ ] Ablations show each component helps - -**If blocked**: Simplify architecture (fewer layers or narrower) - -### Before Day 20 (Submission Lock) - -- [ ] Score < 1.12 bpb (beats SOTA) -- [ ] Reproducible (3+ runs within 0.01 bpb) -- [ ] Submission files ready - -**If blocked**: Submit as "non-record" with detailed analysis of approach - ---- - -## Success Metrics by Phase - -| Phase | End Date | Metric | Target | Stretch | -|-------|----------|--------|--------|---------| -| 1. Foundation | Day 3 | Code works | ✓ Pass tests | Early finish | -| 2. Core Arch | Day 10 | Trains | ✓ < 10 min | < 8 min | -| 3. Optimization | Day 20 | Score | < 1.12 bpb | < 1.10 bpb | -| 4. Submission | Day 25 | PR | Submitted | Accepted | - ---- - -## Daily Standup Template - -**What I did yesterday**: - -- [Completed tasks] - -**What I'm doing today**: - -- [Planned tasks] - -**Blockers**: - -- [Any issues] - -**Metrics**: - -- Training time: X min -- Current score: X.XX bpb -- Model size: X.X MB - ---- - -## Resources - -### External Dependencies - -- **PyTorch**: 2.0+ (H100 support) -- **ncps**: For LTC/CfC reference implementation - ```bash - pip install ncps - ``` -- **Muon optimizer**: Custom (include in repo) - -### Compute - -- **Development**: 1×H100 on RunPod (~$2-3/hour) - - Budget: $300-500 for 2-3 weeks -- **Final runs**: 8×H100 SXM - - Budget: $200 for 10-15 runs - -**Total estimated cost**: $500-700 - -### Team - -- **Developer** (1 person, full-time equivalent): - - PyTorch experience (required) - - WASM/low-level (helpful) - - ML research (helpful) - ---- - -## Next Actions - -**Immediate** (Today): - -1. Set up RunPod account and spin up 1×H100 -2. Clone `parameter-golf` repo and verify baseline -3. Start Q² weight quantization implementation - -**This Week**: - -1. Complete Phase 1 (Foundation) -2. Begin Phase 2 (Core Architecture) -3. First full model training run - -**By End of Month**: - -1. Complete all phases -2. Submit to Parameter Golf -3. Publish results - ---- - -**Document Status**: Living document, update as implementation progresses - -**Last Updated**: 2026-03-21 - -**Owner**: Q² Project diff --git a/docs/parameter-golf/STRATEGY.md b/docs/parameter-golf/STRATEGY.md deleted file mode 100644 index 7ae764b..0000000 --- a/docs/parameter-golf/STRATEGY.md +++ /dev/null @@ -1,55 +0,0 @@ -# Parameter Golf strategy - -This note outlines how to point Q² at OpenAI's Parameter Golf challenge (train in <10 minutes on 8×H100, 16 MB artifact cap, scored by tokenizer-agnostic val_bpb on FineWeb). It combines the structural quantization stack (§D-2–§D-4) with liquid recurrent blocks inspired by Hasani et al. (LTC / CfC / LIV) and the current leaderboard tactics (int5/6, BigramHash, sliding eval). - -## Constraints and current bar -- Submission artifact = compressed model bytes + `train_gpt.py` bytes ≤ 16 000 000. -- Wall-clock training ≤ 10 minutes on 8×H100 SXM; evaluation can be longer but runs on the same hardware profile. -- Metric is `val_bpb` on FineWeb validation; lower is better. Tokenizer is free but must report bytes correctly. -- Current leaderboard (2026-03-20) tops out at ~1.1428 bpb with 10L int5/int6 + BigramHash(10240) + SWA + Muon WD + sliding-window eval. - -## Fit for Q² -- Use the equiprobable quaternary thresholds (§D-2.5) so each symbol carries ~2 bits; Gray + Lee structure (§D-2.7, §D-2.8) preserves complement geometry when we quantize weights, activations, and transition keys. -- Mixed-precision oracle from transition density (§D-3.6, §P-17): channels with short runs/high Lee curvature stay at int6/fp16; long-run channels drop to q2. -- Geode progression (§D-4.1, §P-16): start with coarse q2, then refine a small subset of facets (heads/MLP rows) that remain loss-critical after a short probe run. - -## Wildberger–Geode structure (answering “the structure itself”) -- Treat each attention head / MLP row as a **facet** in the Geode factorization (§D-4.1): parameter-tie facets that live on the same stratum so depth grows without new parameters; liquid blocks share the same tying map. -- Enforce the **Euler polytope constraint** (§D-4.2) as a budget rule: limit new facets by requiring V−E+F to stay constant when we add heads; this caps parameter growth and keeps the artifact under 16 MB. -- Use **hierarchical unlock**: Stage A trains only the lowest-dimension facets; Stage B unlocks higher-curvature facets whose transition density spikes; Stage C demotes low-curvature facets back to q2 while keeping int6 for the unlocked set. -- Apply **trie sparsity** (§P15) to hashed vocab: prune BigramHash buckets whose transition sequences collapse to shallow tries, freeing bytes for higher-curvature facets. -- During export, **facet ordering** follows Geode traversal so zstd sees long runs (same structure as §D-3.4 block layout), improving compression without changing weights. - -## Architecture: liquid + tied + hashed -- **Tokenizer/embeddings** - - Keep the leaderboard BigramHash(10k) idea but align with Q² by forcing equiprobable symbol usage: hash collisions are re-mapped with the complement involution so the four code points stay balanced (§D-2.5). - - Tie input/output embeddings (already in baseline) and share them across encoder/decoder halves; keep embeddings at int6 to avoid early entropy loss. -- **Backbone** - - Replace half of the MLP blocks with a **liquid gated branch**: a compact CfC/LTC cell (3–5 learnable time constants per head) that maintains a per-head state across tokens. This recycles compute across depth and extends effective context without adding attention parameters (mirrors LFM 2.5’s efficiency). - - Shallow attention (6–8 layers, 4–6 heads, 2 KV heads) plus SmearGate-style mixing; depth is recovered by **parameter tying** (ALBERT-style) + liquid recurrence. This keeps parameter count low for the 16 MB cap while letting us spend wall-clock on more steps. - - Apply **run-reduction transition keys** to per-channel activations on the fly; use transition density as a regularizer (penalize over-fragmented trajectories) to keep q2 thresholds stable under liquid dynamics. -- **Regularization** - - Skip-weight vectors and q_gain stay in fp16, guarded by the upstream `CONTROL_TENSOR` mechanism in the OpenAI Parameter Golf baseline (not defined in this repo), but are stored with per-row scaling; everything else targets q2/int5. - - When integrating with the upstream Parameter Golf baseline, enable its existing “logit softcap” mechanism on the logits and combine it with stochastic depth on the liquid branch to stabilize fast compilation; this repo does not currently ship a `logit_softcap` implementation. - -## Training plan (10-minute budget) -- **Stage A (6–7 min):** train at seq_len 1024 with tied weights and liquid branch active, using a cosine LR warmup→flat→linear warmdown. In the upstream Parameter Golf baseline, compile `zeropower_via_newtonschulz5` and the liquid cell (or an equivalent zero-power preconditioner); in this repo, treat that call as a suggested hook rather than an existing function. Freeze embeddings for the first 500 steps to stabilize hash collisions. -- **Stage B (2–3 min):** switch to seq_len 2048 with **sliding-window eval (stride 64)** to harvest lower bpb; unfreeze embeddings and apply Muon WD on matrices only. Run a brief QAT pass that drives weights toward equiprobable q2 levels (clip at the τ* thresholds from §D-2.5). -- **Stage C (≤1 min):** structural mixed-precision sweep: promote the densest 5–10% rows (by transition density) to int6, everything else to q2/int5. Capture a checkpoint before compression for ablation. - -## Compression and artifact packing -- Design an export path that packs weights into an int8-compatible container while storing the payload as **q2/int5 per-row with 16-bit scales**; keep small control tensors in fp16. Reorder tensors by transition-key order so a downstream compressor (for example, zstd run in the submission pipeline) sees long runs (RLE-friendly). -- Target parameter budget: ≤28 M params → q2 payload ≈ 7 MB; with scales/metadata + code we stay below 15 MB. If BigramHash pushes vocab beyond 10k, cap at 12k and prune rare buckets to stay under budget. -- In the parameter-golf repo, add a round-trip export/import helper (e.g., `final_int8_zlib_roundtrip`) and validate against the tokenizer-agnostic bpb path; reject any run that inflates compressed artifact size past 15.5 MB. - -## Execution checklist for the parameter-golf repo -1) Add BigramHash tokenizer variant with complement-aware collision resolution; keep the val_bpb byte accounting unchanged. -2) Implement a **LiquidBlock** (CfC/LTC) that plugs into `train_gpt.py` alongside the existing MLP, with parameter tying across depth (Geode stratum-aware). -3) Add transition-density logging to drive the mixed q2/int6 schedule (P17), plus a tiny QAT head that nudges weights to τ*; unlock higher-dimension facets only when transition density spikes (Geode progression). -4) Swap export to q2/int5 per-row + fp16 control tensors and rely on an external high-ratio compressor (for example, zstd run at a strong setting in the submission toolchain); order tensors by Geode traversal to boost run-length compression. -5) Run the 3-stage schedule above, capture logs for p<0.01 delta over the 1.1428 bpb SOTA. - -## Risks and mitigations -- **Liquid stability:** clamp time constants and use per-head RMSNorm to prevent exploding hidden state; fall back to pure MLP if compilation fails. -- **Tokenizer drift:** complement-aware hashing must keep byte accounting identical to baseline; add a regression that compares bytes-per-token vs. the stock SP tokenizer on the val set. -- **Artifact overrun:** if the compressed artifact exceeds 15.5 MB, lower vocab to 8k or increase the int6→q2 demotion threshold until size passes.