Skip to content

Commit f372b38

Browse files
ZhuoranYangclaude
andcommitted
Widen blog content layout and fix multi-image figures
- Remove empty ad column, change grid to 5-col with content spanning 4 - Remove prose max-width cap (max-w-none) for full-width content - Fix multi-image figures to use flex layout for side-by-side display - Update links: add arXiv placeholder, demo URL, use clean hyperlinks - Add cover figure caption Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a381cd4 commit f372b38

2 files changed

Lines changed: 20 additions & 27 deletions

File tree

content/posts/modular_addition_feature_learning.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -92,8 +92,9 @@ toc: true
9292

9393
<strong style="font-size: 0.8em; letter-spacing: 1px;">LINKS:</strong>
9494
<small>
95-
<strong>GitHub:</strong> <a href="https://github.com/Y-Agent/modular-addition-feature-learning" target="_blank">https://github.com/Y-Agent/modular-addition-feature-learning</a>,
96-
<strong>Demo:</strong> <a href="#" target="_blank">Coming soon</a>
95+
<strong>arXiv:</strong> <a href="#" target="_blank">Coming soon</a>,
96+
<strong>GitHub:</strong> <a href="https://github.com/Y-Agent/modular-addition-feature-learning" target="_blank">Code</a>,
97+
<strong>Demo:</strong> <a href="https://huggingface.co/spaces/y-agent/modular-addition-feature-learning" target="_blank">HuggingFace Space</a>
9798
</small>
9899

99100
<small><em><strong>Cover Figure:</strong> A two-layer neural network (left) learns to compute $17 + 9 = 3 \pmod{23}$ by discovering Fourier features: each hidden neuron becomes a cosine wave at a single frequency (center). These waves, spread symmetrically with different phases and at different frequencies, combine into a "majority vote" that peaks sharply at the correct answer on the modular number line (right).</em></small>
@@ -226,10 +227,10 @@ Figure 2. **Fourier sparsity of learned weights (10 representative neurons).** W
226227

227228
We can also verify the cosine structure by fitting cosine curves directly to the raw weight vectors. The fits are nearly perfect:
228229

229-
<p align="center">
230-
<img src="/images/modular_addition_feature_learning/full_training_para_origin_lineplot_in.jpg" width="48%">
231-
<img src="/images/modular_addition_feature_learning/full_training_para_origin_lineplot_out.jpg" width="48%">
232-
</p>
230+
<div style="display: flex; justify-content: center; gap: 4px; flex-wrap: nowrap;">
231+
<img src="/images/modular_addition_feature_learning/full_training_para_origin_lineplot_in.jpg" style="width: 48%; min-width: 0;">
232+
<img src="/images/modular_addition_feature_learning/full_training_para_origin_lineplot_out.jpg" style="width: 48%; min-width: 0;">
233+
</div>
233234

234235
Figure 3. **Cosine fits to individual neurons.** For several representative neurons, we plot the raw learned weight values (blue dots/curves) against the index $j = 0, 1, \ldots, p-1$ and overlay a best-fit cosine of the form $a \cdot \cos(\omega_k j + \phi)$ (red dashed curves), where $a$ and $\phi$ are the fitted magnitude and phase. **Left:** input weights $\theta_m[j]$. **Right:** output weights $\xi_m[j]$. The fits are nearly perfect (residuals are negligible), confirming that each neuron's weight vector is well-described by a single cosine at one Fourier frequency. Different neurons fit to different frequencies $k$ and different phases $\phi_m$, $\psi_m$, consistent with the diversification described in Observation 3.
235236

@@ -485,10 +486,10 @@ where $\widetilde{\mathcal{D}}_m^k(0)$ is the initial phase misalignment. The do
485486

486487
Since the initial misalignments are independent and uniformly distributed, different neurons generically select different winning frequencies. With enough neurons ($M \gg (p-1)/2$), all frequencies are covered, explaining **Observation 1** (single-frequency structure) and contributing to **Observation 3** (frequency balance).
487488

488-
<p align="center">
489-
<img src="/images/modular_addition_feature_learning/lottery_mech_phase.jpg" height="360">
490-
<img src="/images/modular_addition_feature_learning/lottery_mech_magnitude.jpg" height="360">
491-
</p>
489+
<div style="display: flex; justify-content: center; gap: 4px; flex-wrap: nowrap;">
490+
<img src="/images/modular_addition_feature_learning/lottery_mech_phase.jpg" style="width: 48%; min-width: 0;">
491+
<img src="/images/modular_addition_feature_learning/lottery_mech_magnitude.jpg" style="width: 48%; min-width: 0;">
492+
</div>
492493

493494
Figure 9. **The lottery ticket race within a single neuron.** We track all $(p-1)/2 = 11$ frequency components within one neuron during training under random initialization. **Left (phase misalignment):** Each curve shows $\mathcal{D}_m^k(t) = (2\phi_m^k - \psi_m^k) \bmod 2\pi$ (rescaled to $[-\pi, \pi)$) for a different frequency $k$. The winning frequency (Freq. 7, highlighted in red) starts with the smallest initial misalignment and converges to $\mathcal{D} = 0$ fastest. The losing frequencies drift slowly but their alignment does not matter because their magnitudes never grow large enough to compete. **Right (magnitudes):** The corresponding output magnitude $\beta_m^k(t)$ for each frequency. The legend is sorted by initial magnitude. Once the winning frequency's phase aligns ($\mathcal{D} \approx 0$), its magnitude undergoes explosive super-linear growth (visible as the red curve pulling away after step 1,000), while all other frequencies grow only slowly and remain far below the winner. This is the <span class="hl-purple">positive feedback loop</span>: better alignment leads to faster growth, which in turn accelerates alignment. The winner is determined at initialization by the combination of initial magnitude and initial phase misalignment.
494495

@@ -530,12 +531,12 @@ Our analysis, built on the understanding of the mechanism and dynamics from the
530531

531532
### 5.1 Overview of the Three Stages
532533

533-
<p align="center">
534-
<img src="/images/modular_addition_feature_learning/grokk_loss.jpg" width="24%">
535-
<img src="/images/modular_addition_feature_learning/grokk_acc.jpg" width="24%">
536-
<img src="/images/modular_addition_feature_learning/grokk_abs_phase_diff.jpg" width="24%">
537-
<img src="/images/modular_addition_feature_learning/grokk_avg_ipr.jpg" width="24%">
538-
</p>
534+
<div style="display: flex; justify-content: center; gap: 4px; flex-wrap: nowrap;">
535+
<img src="/images/modular_addition_feature_learning/grokk_loss.jpg" style="width: 24%; min-width: 0;">
536+
<img src="/images/modular_addition_feature_learning/grokk_acc.jpg" style="width: 24%; min-width: 0;">
537+
<img src="/images/modular_addition_feature_learning/grokk_abs_phase_diff.jpg" style="width: 24%; min-width: 0;">
538+
<img src="/images/modular_addition_feature_learning/grokk_avg_ipr.jpg" style="width: 24%; min-width: 0;">
539+
</div>
539540

540541
Figure 13. **Four progress measures reveal the three stages of grokking.** The network is trained on 75% of all $(x,y)$ pairs with weight decay $\lambda = 2.0$. The three colored background regions (yellow, tan, gray) correspond to the three stages, separated by dashed vertical lines at ~2,200 and ~10,000 steps. **(a) Loss:** Training loss (dark blue) drops rapidly to near zero within the first ~2,000 steps (Stage I), while test loss (dark red) remains high, then drops sharply during Stage II and slowly decays during Stage III. **(b) Accuracy:** Training accuracy (dark blue) reaches 100% early; test accuracy (dark red) plateaus at ~70% during memorization (because the symmetric architecture gives "free" accuracy on $(y,x)$ pairs), then climbs to ~95% during Stage II, and converges toward 100% during Stage III. **(c) Phase alignment:** The average $|\sin(\mathcal{D}_m^\star)|$ (measuring misalignment of the dominant frequency's phases) decreases throughout, confirming that phase alignment continues to improve even after memorization. **(d) Frequency sparsity (IPR) and parameter norm:** The average IPR (brown, left axis; higher = sparser) increases throughout training as neurons concentrate energy at fewer frequencies. The parameter norm (dark blue, right axis) grows rapidly during Stages I-II when the loss gradient dominates, then plateaus and slightly decreases once weight decay takes over in Stage III.
541542

layouts/_default/single.html

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,17 +35,9 @@
3535
{{ .Store.Set "hasMathjax" true }}
3636
{{ end }}
3737

38-
<div class="lg:grid lg:grid-cols-4 gap-4 mt-4 px-4">
39-
<div class="hidden lg:block">
40-
{{ if fileExists "layouts/partials/adsense.html" }}
41-
<div class="dream-adsense w-2/3">
42-
{{ partialCached "adsense.html" . }}
43-
</div>
44-
{{ end }}
45-
</div>
46-
47-
<div class="lg:col-span-2">
48-
<article class="mx-auto prose prose-quoteless dark:prose-invert" id="dream-single-post-main" itemscope
38+
<div class="lg:grid lg:grid-cols-5 gap-4 mt-4 px-4">
39+
<div class="lg:col-span-4">
40+
<article class="mx-auto prose prose-quoteless dark:prose-invert max-w-none" id="dream-single-post-main" itemscope
4941
itemtype="http://schema.org/Article">
5042
{{ template "_internal/schema.html" . }}
5143

0 commit comments

Comments
 (0)