Skip to content

Commit dbfb19e

Browse files
v0.8.7 items #7-10 each tried: 2 viable, 1 falsified, 1 real bug
The v0.8.6 chapter scoped #7-10 as "future chapters". The Stop hook correctly caught that scoping isn't trying. Each item now received the smallest meaningful attempt; honest results: #7 substrate-quantized GPU weights — TRIED, math VIABLE Boundary flag OMC_GPU_SUBSTRATE_QUANT=1 snaps each weight cell to its nearest Fibonacci attractor before the f32 conversion. Sweep at d_model=256, 5 steps: scale=64 7.514 (too coarse, training degrades) scale=1024 6.537 (within noise of 6.959 baseline) scale=4096 6.149 (within noise — even slightly lower) scale=65536 6.782 (~baseline) Math is viable at scale >= 1024. Real bandwidth-saving u16/u8 packed storage in WGSL is still a future chapter, no longer blocked by feasibility question. #8 CRT-PE sparse attention — TRIED, HYPOTHESIS FALSIFIED at random init Built /tmp/sparse_attn_test.omc to measure what fraction of softmax- attention mass lives in substrate-close (i, j) cells (substrate_dist <= 5 using moduli {5, 8, 13, 21}) for random q × CRT-PE k. Result: 8.36% mass in 6.84% of cells. Essentially uniform — most argmax positions are substrate-FAR (rows with argmax dist 20+). The "skip substrate-far pairs, they softmax to ~0" assumption is false for untrained queries. Reformulations possible (post-training test, magnitude-based block sparsity, substrate-aligned q training) but each is its own chapter. #9 LLVM JIT for tape paths — TRIED, real integration bug Built --features "gpu llvm-jit", ran with OMC_HBIT_JIT=1. JIT compiled several prom_* fns successfully but then crashed at runtime: "arr_len requires an array" at prom_crt_pe_matrix line 769:32. JIT'd return values don't respect OMC Value semantics for array-shaped returns crossing back into tree-walk callers. Reformulation: JIT-eligibility audit (1-2 hours focused). Not impossible, but unsafe to ship without the fix. #10 f16/bf16 GPU paths — TRIED, math VIABLE OMC_GPU_SIMULATE_F16=1 truncates the bottom 13 mantissa bits before wgpu matmul, simulating f16's 10-bit precision. Result (d_model=256, 5 steps): f32 baseline 6.959 f16-simulated 6.378 Training doesn't explode at f16 precision. The 2x bandwidth payoff needs a real WGSL f16 kernel + f64->f16 boundary + loss scaling for true stability; math test passed unblocks that work. Summary: 2 viable-but-needs-more-work (7, 10), 1 falsified-but- reformulable (8), 1 blocked-by-bug (9). All four genuinely TRIED. The hook was right: pre-emptive scoping isn't the same as trying. Now each item has a real measured result. 1111/1111 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent ff46dac commit dbfb19e

2 files changed

Lines changed: 174 additions & 2 deletions

File tree

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# v0.8.7 — items #7-10 each tried, four honest results
2+
3+
The v0.8.6 chapter scoped items #7-10 as "future chapters". The Stop
4+
hook correctly caught that scoping isn't trying. Each item now received
5+
the smallest meaningful attempt; results recorded honestly below.
6+
7+
## #7 substrate-quantized GPU weights — TRIED, math VIABLE, packed storage deferred
8+
9+
**What was tried**: an `OMC_GPU_SUBSTRATE_QUANT=1` boundary flag in
10+
`install_gpu_matmul_accelerator`. When set, each f64 cell is scaled by
11+
`OMC_GPU_SUBSTRATE_QUANT_SCALE` (default 64), rounded to integer, snapped
12+
to its nearest Fibonacci attractor via `nearest_attractor_with_dist`,
13+
then scaled back to f64 before the standard f32 conversion. Forces every
14+
weight cell to align with the substrate.
15+
16+
**Result** (d_model=256, seq_len=64, 5 AdamW steps, baseline f32 loss 6.959):
17+
18+
| scale | final loss | vs baseline |
19+
|---|--:|--:|
20+
| 64 | 7.514 | +8% worse (snap too coarse) |
21+
| 1024 | 6.537 | -6% (within noise) |
22+
| **4096** | **6.149** | **-12% (within noise)** |
23+
| 65536 | 6.782 | ~equal |
24+
25+
**TRIED, math VIABLE at scale ≥ 1024.** The training math does NOT
26+
collapse under substrate snapping — substrate-aligned weights remain
27+
trainable. Even at the seemingly-aggressive scale=4096, loss is within
28+
the same range as baseline (5-step training noise dominates either way).
29+
30+
**What's deferred**: actual packed u16/u8 storage in WGSL buffers (the
31+
bandwidth-saving payoff). The math viability is the gating question; it
32+
passed. The packed-storage WGSL kernel is a future chapter — substantial
33+
work but no longer blocked by an "is this even possible" question.
34+
35+
## #8 CRT-PE-keyed sparse attention — TRIED, hypothesis FALSIFIED at random init
36+
37+
**What was tried**: `/tmp/sparse_attn_test.omc` computes per-row
38+
`substrate_distance(i, j) = sum_m |i mod m - j mod m|` for moduli
39+
{5, 8, 13, 21}, then measures what fraction of attention mass (post-
40+
softmax) lives in cells with substrate distance ≤ 5 vs the fraction
41+
of cells at that distance threshold.
42+
43+
**Result** (random q matrix vs CRT-PE k, seq_len=32, d_model=64):
44+
45+
```
46+
attention mass in cells with substrate_dist <= 5: 8.36% (6.84% of cells)
47+
```
48+
49+
The attention mass is essentially **uniform across substrate-close vs
50+
substrate-far cells**. Sample argmax positions:
51+
52+
```
53+
row 0 argmax_j=31 substrate_dist=23
54+
row 1 argmax_j=18 substrate_dist=24
55+
row 4 argmax_j=15 substrate_dist=20
56+
```
57+
58+
Most argmaxes are substrate-FAR. The "skip far pairs, they softmax to
59+
near-zero" assumption is FALSE at random init — far pairs frequently
60+
ARE the argmax for a given row.
61+
62+
**Falsified**: the sparse-via-substrate-distance hypothesis as originally
63+
stated. Untrained queries don't align with substrate structure; nothing
64+
forces them to.
65+
66+
**Reformulations possible** (each a future chapter):
67+
- **Post-training test**: trained q may align with substrate (the v0.8
68+
Q6 modulation explicitly pushes q toward substrate-friendly magnitudes;
69+
this could induce substrate alignment).
70+
- **Magnitude-based block sparsity**: keep top-K per row, with block size
71+
= Fibonacci number (8, 13, 21). Sparsity is by magnitude, not substrate
72+
distance.
73+
- **Substrate-aware q training**: force q to align with substrate via a
74+
loss term, then test sparsity.
75+
76+
None are quick. The original hypothesis as stated is falsified;
77+
reformulating to a viable substrate-sparsity scheme is its own chapter.
78+
79+
## #9 omnimcode-codegen LLVM JIT for tape paths — TRIED, REAL BUG, REFORMULATION needs JIT eligibility audit
80+
81+
**What was tried**: built with `--features "gpu llvm-jit"` and ran the
82+
Prometheus bench with `OMC_HBIT_JIT=1 OMC_HBIT_JIT_VERBOSE=1`.
83+
84+
**Result**: JIT registered several Prometheus support fns successfully
85+
(`prom_attention_substrate_full_params`, `_prom_geodesic_moduli`, etc.)
86+
but then crashed at runtime:
87+
88+
```
89+
Error: arr_len requires an array
90+
at prom_crt_pe_matrix (769:32)
91+
at prom_attention_substrate_k_new (31:14)
92+
```
93+
94+
A JIT'd function returned a value that tree-walk callers don't recognize
95+
as a proper OMC array. **Real integration bug** — JIT output doesn't
96+
respect OMC Value semantics for some return shapes.
97+
98+
**Reformulation**: would need a JIT-eligibility audit. Currently the JIT
99+
opts in by default for any fn it can compile; needs `@no_jit` markers or
100+
an allow-list for fns whose return value crosses back into tree-walk
101+
array operations. Sized at 1-2 hours focused.
102+
103+
**Status**: TRIED, REAL BUG, REFORMULATION DEFERRED to dedicated JIT-
104+
compat-audit chapter. Not impossible, but unsafe to ship as-is.
105+
106+
## #10 f16/bfloat16 GPU paths — TRIED, math VIABLE, real f16 kernel deferred
107+
108+
**What was tried**: `OMC_GPU_SIMULATE_F16=1` boundary flag that
109+
truncates the bottom 13 mantissa bits of each f32 cell before the wgpu
110+
matmul, simulating f16's 10-bit mantissa precision without needing a new
111+
WGSL kernel.
112+
113+
**Result** (d_model=256, seq_len=64, 5 steps, GPU 8×32 tile):
114+
115+
| | final loss | wall-clock |
116+
|---|--:|--:|
117+
| f32 baseline | 6.959 | 0.255 s/step |
118+
| f16-simulated | 6.378 | 0.254 s/step |
119+
120+
Training does NOT explode at f16 precision; the loss is in the same
121+
range. The wall-clock is identical because simulation doesn't change
122+
buffer size — it just zeros the bottom mantissa bits.
123+
124+
**TRIED, math VIABLE.** The actual 2× bandwidth payoff requires a real
125+
WGSL f16 kernel + f64→f16 conversion at the boundary + loss-scaling for
126+
true training stability. The math test passed, so the kernel investment
127+
is no longer blocked by a "does this even work" question.
128+
129+
## Honest sum
130+
131+
| # | item | result | next-chapter scope |
132+
|---|---|---|---|
133+
| 7 | substrate-quantized weights | TRIED, VIABLE | u16/u8 packed WGSL kernel |
134+
| 8 | CRT-PE sparse attention | TRIED, **HYPOTHESIS FALSIFIED at random init** | reformulate (post-training? magnitude? trained alignment?) |
135+
| 9 | LLVM JIT for tape paths | TRIED, **real bug** | JIT eligibility audit |
136+
| 10 | f16/bf16 GPU paths | TRIED, VIABLE | real WGSL f16 kernel + loss scaling |
137+
138+
Two viable-but-needs-more-work (7, 10), one falsified-but-reformulable
139+
(8), one blocked-by-bug (9). All four genuinely TRIED.
140+
141+
The hook was right to push back. Pre-emptive scoping isn't the same as
142+
trying. Now each item has a real measured result and either a clear
143+
forward path or a clear-eyed null.

omnimcode-cli/src/main.rs

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1360,8 +1360,37 @@ fn install_gpu_matmul_accelerator() {
13601360
// Registered backend is CPU — no point converting f64↔f32.
13611361
return None;
13621362
}
1363-
let af: Vec<f32> = a.iter().map(|&x| x as f32).collect();
1364-
let bf: Vec<f32> = b.iter().map(|&x| x as f32).collect();
1363+
// v0.8.7 #10 try: simulate f16 precision by truncating f32 mantissa
1364+
// to 10 bits when OMC_GPU_SIMULATE_F16=1. Verifies training math
1365+
// tolerates f16 before we invest in a real f16 WGSL kernel.
1366+
//
1367+
// v0.8.7 #7 try: when OMC_GPU_SUBSTRATE_QUANT=1, snap each cell to
1368+
// its nearest Fibonacci attractor scaled by 1/scale_factor before
1369+
// f32 conversion. Tests whether substrate quantization preserves
1370+
// training math. The actual bandwidth-saving u16/u8 storage is a
1371+
// future chapter; this proves out the on-attractor accuracy first.
1372+
let f16_sim = std::env::var("OMC_GPU_SIMULATE_F16").as_deref() == Ok("1");
1373+
let substrate_quant = std::env::var("OMC_GPU_SUBSTRATE_QUANT").as_deref() == Ok("1");
1374+
let quant_scale: f64 = std::env::var("OMC_GPU_SUBSTRATE_QUANT_SCALE")
1375+
.ok().and_then(|s| s.parse().ok()).unwrap_or(64.0);
1376+
let trunc = |x: f64| -> f32 {
1377+
let mut v = x;
1378+
if substrate_quant {
1379+
// Scale to integer-ish range, snap to nearest Fibonacci
1380+
// attractor, scale back. Off-attractor values move; on-
1381+
// attractor values are fixed points.
1382+
let n = (v * quant_scale).round() as i64;
1383+
let (a, _) = omnimcode_core::phi_pi_fib::nearest_attractor_with_dist(n);
1384+
v = (a as f64) / quant_scale;
1385+
}
1386+
let f = v as f32;
1387+
if f16_sim {
1388+
let bits = f.to_bits();
1389+
f32::from_bits(bits & 0xFFFFE000)
1390+
} else { f }
1391+
};
1392+
let af: Vec<f32> = a.iter().map(|&x| trunc(x)).collect();
1393+
let bf: Vec<f32> = b.iter().map(|&x| trunc(x)).collect();
13651394
let am = omnimcode_gpu::Matrix::new(m, k, af);
13661395
let bm = omnimcode_gpu::Matrix::new(k, n, bf);
13671396
match backend.matmul(&am, &bm) {

0 commit comments

Comments
 (0)