Commit 690d1ec
Transformerless LM end-to-end: CRT-PE wins -19.9% vs sinusoidal
First end-to-end empirical evidence that the harmonic substrate
substitutions identified in experiments 0-12 carry over to a real
LM training task. Tiny char-level model (102K params, 2 layers,
d_model=64, seq_len=64) trained 600 steps on a small text corpus.
Three architectures with identical parameter count, differing
only in PE and attention scoring:
standard: sinusoidal PE + pure softmax
crt_only: CRT-Fib PE + pure softmax
hybrid: CRT-Fib PE + softmax x HBit-tension gate
Headline numbers (5-seed mean):
arch mean val loss vs standard win rate
standard 0.5095 -- --
crt_only 0.4082 -19.9% 4 / 5
hybrid 0.4831 -5.2% 4 / 5
The CRT-PE substitution alone is the architectural win. CRT also
has lower variance across seeds (range 0.35-0.48 vs standard's
0.35-0.61), making it both better-on-average and more reliable.
The hybrid architecture is also better than standard but worse than
crt_only. Consistent with experiment-12: the gate helps in
adversarial regimes (off-manifold distractors) but pays a cost in
clean training where all keys are on-distribution.
Why CRT wins: sinusoidal periods (1, 1/10000^(2/d), ...) all wrap
quickly within seq_len=64. CRT-Fibonacci periods {5, 8, 13, 21, 34,
55, 89, 144} — by Chinese Remainder Theorem, the joint residue
tuple uniquely identifies positions in [0, ~24M), well past any
practical sequence length. Distinct positional codes let the model
learn position-specific attention patterns more cleanly.
Honest caveats (full list in README):
- Tiny corpus (~1.5 KB), tiny model (102K params)
- Char-level next-token prediction only
- Vaswani sinusoidal is a 2017 baseline; modern transformers use
rotary/ALiBi/T5-relative/learned PE — not compared
- 1 of 5 seeds (123) had standard outperform crt_only — robust win,
not universal
Architectural significance:
- The FIRST end-to-end model where a per-component harmonic
substitution beats the transformer baseline at training time
- The transformerless-LLM thesis now has its first proof point on
the actual training dynamic, not just isolated metrics
- Next: scale (10x bigger model, 100x bigger corpus)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent 1d1edea commit 690d1ec
5 files changed
Lines changed: 533 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
0 commit comments