Skip to content

Commit 8fcf7ee

Browse files
Honest writeup: geodesic A/B inconclusive + K-frozen attention bug
A/B result (3 seeds × 250 steps, 8-token windows): seed 42: vanilla=2.464 geodesic=2.713 (+10.1%, worse) seed 7: vanilla=2.507 geodesic=2.479 (-1.1%, better) seed 123: vanilla=2.272 geodesic=2.620 (+15.3%, worse) mean: vanilla=2.414 geodesic=2.604 (+7.9%) wins: 1/3 — INCONCLUSIVE The PyTorch 3/3-seed win at -0.4% did NOT replicate. BUT — code review during the run surfaced a real bug in the attention layer: h k = tape_matmul(x_id, K_w); # K_w trainable h k_val = tape_value(k); # rip out value h kt_val = arr_transpose(k_val); # OMC-space transpose h kt = tape_const(kt_val); # re-inject as CONSTANT h scores = tape_matmul(q, kt); # grad only flows to q The tape_value → tape_const sequence severs gradient flow through K. K_w never receives gradient from the attention score path. K stays frozen at its random init throughout training. Both arms ran broken attention. The geodesic bias was being added to scores from RANDOM-FROZEN keys, not learned ones. That's an entirely different experiment than the PyTorch one. Cause: no tape_transpose Rust builtin. The OMC-space transpose trick was an expedient hack while we focused on getting the forward pass to run. It made the forward correct but the backward broken in a way that affects ONLY K's score-path gradient (K's bias-path gradient via softmax doesn't exist because softmax has no bias from K). What's needed for meaningful replication: 1. Add tape_transpose Rust builtin (forward swaps dims; backward transposes the upstream gradient) 2. Add test_attention_backward_flows_to_QKV regression test 3. Re-run the A/B with both arms having trainable K 4. Whichever way it lands then is the real signal Full writeup with reasoning: experiments/prometheus_parity/GEODESIC_AB_RESULTS.md This is a fail-forward result + a real bug found + a clean fix path. The geodesic primitive is still validated (3/3 in PyTorch, numerically identical port to OMC). What we learned: when you ship attention without testing backward, the first real experiment is the one that catches it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 05c9e4a commit 8fcf7ee

1 file changed

Lines changed: 79 additions & 0 deletions

File tree

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Geodesic attention A/B — first Prometheus replication attempt
2+
3+
## Result (3 seeds × 250 steps, 8-token windows)
4+
5+
| seed | vanilla loss | geodesic loss | delta | outcome |
6+
|---|--:|--:|--:|---|
7+
| 42 | 2.464 | 2.713 | +10.1% | geodesic worse |
8+
| 7 | 2.507 | 2.479 | −1.1% | geodesic better |
9+
| 123 | 2.272 | 2.620 | +15.3% | geodesic worse |
10+
| **mean** | **2.414** | **2.604** | **+7.9%** | **1/3 wins** |
11+
12+
**Verdict: inconclusive, leaning negative.** The PyTorch result that
13+
won 3/3 seeds at -0.4% did NOT replicate in this Prometheus run.
14+
15+
## Honest caveat — the K-frozen attention bug
16+
17+
While the A/B was training, a code review surfaced a real bug in
18+
`prom_attention_forward`:
19+
20+
```omc
21+
h k = tape_matmul(x_id, K_w); # K_w is a trainable param
22+
h k_val = tape_value(k); # rip out the value
23+
h kt_val = arr_transpose(k_val); # transpose in OMC space
24+
h kt = tape_const(kt_val); # re-inject as a CONSTANT
25+
h scores = tape_matmul(q, kt); # gradient flows ONLY to q
26+
```
27+
28+
The `tape_value → arr_transpose → tape_const` sequence severs
29+
gradient flow through K. `K_w` gets zero gradient from the attention
30+
score path. K is effectively frozen at its random init throughout
31+
training.
32+
33+
**This means both arms (A and B) ran broken attention.** The
34+
geodesic bias was being added to scores `q · K_random^T`, not
35+
scores from a learned K. We're testing whether the geodesic bias
36+
helps when keys are random — an entirely different question from
37+
the PyTorch experiment where K was trainable.
38+
39+
## Why the result is unsurprising given the bug
40+
41+
In the PyTorch experiment, K was trained alongside Q and V. The
42+
geodesic bias added a positional inductive prior on top of
43+
*learned* attention. The model could discover patterns like "attend
44+
to nearby positions for short-range dependencies" and the bias
45+
nudged it toward Fibonacci-coprime distance metrics specifically.
46+
47+
In our Prometheus run, K is fixed at random. The attention scores
48+
have no learned structure. Adding a positional bias to random
49+
scores either:
50+
- adds random noise (no benefit) — most likely
51+
- accidentally provides the ONLY structure → tiny effect either direction
52+
53+
The result is consistent with "broken attention plus a bias either
54+
hurts (overrides random noise that happened to work) or doesn't help
55+
much (random noise was already meaningless)."
56+
57+
## What's needed for a meaningful replication
58+
59+
1. **Add `tape_transpose` Rust builtin.** Differentiable transpose
60+
so K trains through the score path. ~30 lines forward + backward.
61+
2. **Verify K's gradient is non-zero** after one training step.
62+
3. **Re-run the A/B with both arms having trainable K.**
63+
4. If geodesic still loses 0/3 at this scale, then we have a real
64+
negative — substrate bias doesn't help when corpus is small +
65+
model is small + training is short. That's a legit honest finding.
66+
5. If geodesic wins or ties, the PyTorch result replicates.
67+
68+
## Lesson
69+
70+
We shipped a layer without testing its end-to-end gradient flow.
71+
`test_prometheus.omc` has 10 tests covering every other layer and
72+
zero touching attention. That's the regression-prevention gap to
73+
close before any further A/B testing.
74+
75+
The fail-forward path:
76+
1. Fix K (add tape_transpose)
77+
2. Add `test_attention_backward_flows_to_QKV` to lock it
78+
3. Re-run this A/B
79+
4. Report whichever result lands (real win OR real null)

0 commit comments

Comments
 (0)