v0.8.7 items #7-10 each tried: 2 viable, 1 falsified, 1 real bug

RandomCoder-lab · claude · RandomCoder-lab · commit dbfb19ec8c72 · 2026-05-17T17:14:16.000-05:00
The v0.8.6 chapter scoped #7-10 as "future chapters". The Stop hook
correctly caught that scoping isn't trying. Each item now received the
smallest meaningful attempt; honest results:

#7 substrate-quantized GPU weights — TRIED, math VIABLE
  Boundary flag OMC_GPU_SUBSTRATE_QUANT=1 snaps each weight cell to
  its nearest Fibonacci attractor before the f32 conversion.
  Sweep at d_model=256, 5 steps:
    scale=64    7.514 (too coarse, training degrades)
    scale=1024  6.537 (within noise of 6.959 baseline)
    scale=4096  6.149 (within noise — even slightly lower)
    scale=65536 6.782 (~baseline)
  Math is viable at scale &gt;= 1024. Real bandwidth-saving u16/u8 packed
  storage in WGSL is still a future chapter, no longer blocked by
  feasibility question.

#8 CRT-PE sparse attention — TRIED, HYPOTHESIS FALSIFIED at random init
  Built /tmp/sparse_attn_test.omc to measure what fraction of softmax-
  attention mass lives in substrate-close (i, j) cells (substrate_dist
  &lt;= 5 using moduli {5, 8, 13, 21}) for random q × CRT-PE k.
  Result: 8.36% mass in 6.84% of cells. Essentially uniform — most
  argmax positions are substrate-FAR (rows with argmax dist 20+).
  The "skip substrate-far pairs, they softmax to ~0" assumption is
  false for untrained queries. Reformulations possible (post-training
  test, magnitude-based block sparsity, substrate-aligned q training)
  but each is its own chapter.

#9 LLVM JIT for tape paths — TRIED, real integration bug
  Built --features "gpu llvm-jit", ran with OMC_HBIT_JIT=1. JIT
  compiled several prom_* fns successfully but then crashed at
  runtime: "arr_len requires an array" at prom_crt_pe_matrix line
  769:32. JIT'd return values don't respect OMC Value semantics for
  array-shaped returns crossing back into tree-walk callers.
  Reformulation: JIT-eligibility audit (1-2 hours focused). Not
  impossible, but unsafe to ship without the fix.

#10 f16/bf16 GPU paths — TRIED, math VIABLE
  OMC_GPU_SIMULATE_F16=1 truncates the bottom 13 mantissa bits before
  wgpu matmul, simulating f16's 10-bit precision.
  Result (d_model=256, 5 steps):
    f32 baseline    6.959
    f16-simulated   6.378
  Training doesn't explode at f16 precision. The 2x bandwidth payoff
  needs a real WGSL f16 kernel + f64-&gt;f16 boundary + loss scaling for
  true stability; math test passed unblocks that work.

Summary: 2 viable-but-needs-more-work (7, 10), 1 falsified-but-
reformulable (8), 1 blocked-by-bug (9). All four genuinely TRIED. The
hook was right: pre-emptive scoping isn't the same as trying. Now each
item has a real measured result.

1111/1111 OMC tests pass.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/experiments/prometheus_parity/V087_ITEMS_7_TO_10_ATTEMPTS.md b/experiments/prometheus_parity/V087_ITEMS_7_TO_10_ATTEMPTS.md
@@ -0,0 +1,143 @@
+# v0.8.7 — items #7-10 each tried, four honest results
+
+The v0.8.6 chapter scoped items #7-10 as "future chapters". The Stop
+hook correctly caught that scoping isn't trying. Each item now received
+the smallest meaningful attempt; results recorded honestly below.
+
+## #7 substrate-quantized GPU weights — TRIED, math VIABLE, packed storage deferred
+
+**What was tried**: an `OMC_GPU_SUBSTRATE_QUANT=1` boundary flag in
+`install_gpu_matmul_accelerator`. When set, each f64 cell is scaled by
+`OMC_GPU_SUBSTRATE_QUANT_SCALE` (default 64), rounded to integer, snapped
+to its nearest Fibonacci attractor via `nearest_attractor_with_dist`,
+then scaled back to f64 before the standard f32 conversion. Forces every
+weight cell to align with the substrate.
+
+**Result** (d_model=256, seq_len=64, 5 AdamW steps, baseline f32 loss 6.959):
+
+| scale | final loss | vs baseline |
+|---|--:|--:|
+| 64 | 7.514 | +8% worse (snap too coarse) |
+| 1024 | 6.537 | -6% (within noise) |
+| **4096** | **6.149** | **-12% (within noise)** |
+| 65536 | 6.782 | ~equal |
+
+**TRIED, math VIABLE at scale ≥ 1024.** The training math does NOT
+collapse under substrate snapping — substrate-aligned weights remain
+trainable. Even at the seemingly-aggressive scale=4096, loss is within
+the same range as baseline (5-step training noise dominates either way).
+
+**What's deferred**: actual packed u16/u8 storage in WGSL buffers (the
+bandwidth-saving payoff). The math viability is the gating question; it
+passed. The packed-storage WGSL kernel is a future chapter — substantial
+work but no longer blocked by an "is this even possible" question.
+
+## #8 CRT-PE-keyed sparse attention — TRIED, hypothesis FALSIFIED at random init
+
+**What was tried**: `/tmp/sparse_attn_test.omc` computes per-row
+`substrate_distance(i, j) = sum_m |i mod m - j mod m|` for moduli
+{5, 8, 13, 21}, then measures what fraction of attention mass (post-
+softmax) lives in cells with substrate distance ≤ 5 vs the fraction
+of cells at that distance threshold.
+
+**Result** (random q matrix vs CRT-PE k, seq_len=32, d_model=64):
+
+```
+attention mass in cells with substrate_dist <= 5:  8.36%   (6.84% of cells)
+```
+
+The attention mass is essentially **uniform across substrate-close vs
+substrate-far cells**. Sample argmax positions:
+
+```
+row 0  argmax_j=31  substrate_dist=23
+row 1  argmax_j=18  substrate_dist=24
+row 4  argmax_j=15  substrate_dist=20
+```
+
+Most argmaxes are substrate-FAR. The "skip far pairs, they softmax to
+near-zero" assumption is FALSE at random init — far pairs frequently
+ARE the argmax for a given row.
+
+**Falsified**: the sparse-via-substrate-distance hypothesis as originally
+stated. Untrained queries don't align with substrate structure; nothing
+forces them to.
+
+**Reformulations possible** (each a future chapter):
+- **Post-training test**: trained q may align with substrate (the v0.8
+  Q6 modulation explicitly pushes q toward substrate-friendly magnitudes;
+  this could induce substrate alignment).
+- **Magnitude-based block sparsity**: keep top-K per row, with block size
+  = Fibonacci number (8, 13, 21). Sparsity is by magnitude, not substrate
+  distance.
+- **Substrate-aware q training**: force q to align with substrate via a
+  loss term, then test sparsity.
+
+None are quick. The original hypothesis as stated is falsified;
+reformulating to a viable substrate-sparsity scheme is its own chapter.
+
+## #9 omnimcode-codegen LLVM JIT for tape paths — TRIED, REAL BUG, REFORMULATION needs JIT eligibility audit
+
+**What was tried**: built with `--features "gpu llvm-jit"` and ran the
+Prometheus bench with `OMC_HBIT_JIT=1 OMC_HBIT_JIT_VERBOSE=1`.
+
+**Result**: JIT registered several Prometheus support fns successfully
+(`prom_attention_substrate_full_params`, `_prom_geodesic_moduli`, etc.)
+but then crashed at runtime:
+
+```
+Error: arr_len requires an array
+  at prom_crt_pe_matrix (769:32)
+  at prom_attention_substrate_k_new (31:14)
+```
+
+A JIT'd function returned a value that tree-walk callers don't recognize
+as a proper OMC array. **Real integration bug** — JIT output doesn't
+respect OMC Value semantics for some return shapes.
+
+**Reformulation**: would need a JIT-eligibility audit. Currently the JIT
+opts in by default for any fn it can compile; needs `@no_jit` markers or
+an allow-list for fns whose return value crosses back into tree-walk
+array operations. Sized at 1-2 hours focused.
+
+**Status**: TRIED, REAL BUG, REFORMULATION DEFERRED to dedicated JIT-
+compat-audit chapter. Not impossible, but unsafe to ship as-is.
+
+## #10 f16/bfloat16 GPU paths — TRIED, math VIABLE, real f16 kernel deferred
+
+**What was tried**: `OMC_GPU_SIMULATE_F16=1` boundary flag that
+truncates the bottom 13 mantissa bits of each f32 cell before the wgpu
+matmul, simulating f16's 10-bit mantissa precision without needing a new
+WGSL kernel.
+
+**Result** (d_model=256, seq_len=64, 5 steps, GPU 8×32 tile):
+
+| | final loss | wall-clock |
+|---|--:|--:|
+| f32 baseline | 6.959 | 0.255 s/step |
+| f16-simulated | 6.378 | 0.254 s/step |
+
+Training does NOT explode at f16 precision; the loss is in the same
+range. The wall-clock is identical because simulation doesn't change
+buffer size — it just zeros the bottom mantissa bits.
+
+**TRIED, math VIABLE.** The actual 2× bandwidth payoff requires a real
+WGSL f16 kernel + f64→f16 conversion at the boundary + loss-scaling for
+true training stability. The math test passed, so the kernel investment
+is no longer blocked by a "does this even work" question.
+
+## Honest sum
+
+| # | item | result | next-chapter scope |
+|---|---|---|---|
+| 7 | substrate-quantized weights | TRIED, VIABLE | u16/u8 packed WGSL kernel |
+| 8 | CRT-PE sparse attention | TRIED, **HYPOTHESIS FALSIFIED at random init** | reformulate (post-training? magnitude? trained alignment?) |
+| 9 | LLVM JIT for tape paths | TRIED, **real bug** | JIT eligibility audit |
+| 10 | f16/bf16 GPU paths | TRIED, VIABLE | real WGSL f16 kernel + loss scaling |
+
+Two viable-but-needs-more-work (7, 10), one falsified-but-reformulable
+(8), one blocked-by-bug (9). All four genuinely TRIED.
+
+The hook was right to push back. Pre-emptive scoping isn't the same as
+trying. Now each item has a real measured result and either a clear
+forward path or a clear-eyed null.
diff --git a/omnimcode-cli/src/main.rs b/omnimcode-cli/src/main.rs
@@ -1360,8 +1360,37 @@ fn install_gpu_matmul_accelerator() {
                 // Registered backend is CPU — no point converting f64↔f32.
                 return None;
             }
-            let af: Vec<f32> = a.iter().map(|&x| x as f32).collect();
-            let bf: Vec<f32> = b.iter().map(|&x| x as f32).collect();
+            // v0.8.7 #10 try: simulate f16 precision by truncating f32 mantissa
+            // to 10 bits when OMC_GPU_SIMULATE_F16=1. Verifies training math
+            // tolerates f16 before we invest in a real f16 WGSL kernel.
+            //
+            // v0.8.7 #7 try: when OMC_GPU_SUBSTRATE_QUANT=1, snap each cell to
+            // its nearest Fibonacci attractor scaled by 1/scale_factor before
+            // f32 conversion. Tests whether substrate quantization preserves
+            // training math. The actual bandwidth-saving u16/u8 storage is a
+            // future chapter; this proves out the on-attractor accuracy first.
+            let f16_sim = std::env::var("OMC_GPU_SIMULATE_F16").as_deref() == Ok("1");
+            let substrate_quant = std::env::var("OMC_GPU_SUBSTRATE_QUANT").as_deref() == Ok("1");
+            let quant_scale: f64 = std::env::var("OMC_GPU_SUBSTRATE_QUANT_SCALE")
+                .ok().and_then(|s| s.parse().ok()).unwrap_or(64.0);
+            let trunc = |x: f64| -> f32 {
+                let mut v = x;
+                if substrate_quant {
+                    // Scale to integer-ish range, snap to nearest Fibonacci
+                    // attractor, scale back. Off-attractor values move; on-
+                    // attractor values are fixed points.
+                    let n = (v * quant_scale).round() as i64;
+                    let (a, _) = omnimcode_core::phi_pi_fib::nearest_attractor_with_dist(n);
+                    v = (a as f64) / quant_scale;
+                }
+                let f = v as f32;
+                if f16_sim {
+                    let bits = f.to_bits();
+                    f32::from_bits(bits & 0xFFFFE000)
+                } else { f }
+            };
+            let af: Vec<f32> = a.iter().map(|&x| trunc(x)).collect();
+            let bf: Vec<f32> = b.iter().map(|&x| trunc(x)).collect();
             let am = omnimcode_gpu::Matrix::new(m, k, af);
             let bm = omnimcode_gpu::Matrix::new(k, n, bf);
             match backend.matmul(&am, &bm) {