Skip to content

Commit b495007

Browse files
apartsinclaude
andcommitted
Fix math rendering: 193 bare math spans, broken equations, ordering inversions
- Fix all 193 bare <span class="math"> tags missing $...$ KaTeX delimiters - Fix broken S4 state-space and RoPE equations in section-4.3 - Fix prose-in-math-block in section-5.2 (repetition penalty formula) - Fix 24 caption/output ordering inversions in Parts 9-10 - Create p3_math_rendering.py audit script (6 issue categories) - Tune p2_missing_output.py: 61 to 22 issues (64% false positive reduction) - Remaining: 18 ORPHANED_MATH (false positives), 8 HTML_ENTITY (render fine) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b396b90 commit b495007

45 files changed

Lines changed: 1093 additions & 216 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

appendices/appendix-a-mathematical-foundations/index.html

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ <h3>Matrices and Matrix Multiplication</h3>
8989

9090
</div>
9191

92-
<p>This formula describes a linear layer: input <span class="math">X</span> (a batch of vectors) is multiplied by weight matrix <span class="math">W</span>, then bias vector <span class="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
92+
<p>This formula describes a linear layer: input <span class="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <span class="math">$W$</span>, then bias vector <span class="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
9393

9494
<pre><code class="language-text"># Matrix multiplication in NumPy
9595
X = np.random.randn(4, 768) # batch of 4 tokens, each 768-dim
@@ -101,11 +101,11 @@ <h3>Matrices and Matrix Multiplication</h3>
101101
<div class="code-caption"><strong>Code Fragment A.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
102102
<h3>Transpose and Symmetry</h3>
103103

104-
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">A</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
104+
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">$A$</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
105105

106106
<h3>Eigenvalues and Eigenvectors</h3>
107107

108-
<p>An <strong>eigenvector</strong> of a matrix <span class="math">A</span> is a vector <span class="math">v</span> such that multiplying by the matrix merely scales it:</p>
108+
<p>An <strong>eigenvector</strong> of a matrix <span class="math">$A$</span> is a vector <span class="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
109109

110110
<div class="math-block">
111111
$$A \cdot v = \lambda \cdot v$$
@@ -116,7 +116,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
116116

117117
<div class="callout note">
118118
<div class="callout-title">Practical Connection: Low-Rank Factorization</div>
119-
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">B</span> is (d, r) and <span class="math">A</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
119+
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">$B$</span> is (d, r) and <span class="math">$A$</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
120120
</div>
121121

122122
<div class="section-break">&bull; &bull; &bull;</div>
@@ -209,7 +209,7 @@ <h3>Expected Value and Variance</h3>
209209

210210
<div class="callout practical-example">
211211
<div class="callout-title">Practical Example: Temperature Sampling</div>
212-
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">z</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
212+
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
213213
</div>
214214

215215
<h3>The Chain Rule of Probability and Autoregressive Generation</h3>
@@ -237,15 +237,15 @@ <h3>Sampling Strategies: Top-k and Nucleus (Top-p)</h3>
237237
<p>Raw sampling from a model's full distribution often produces incoherent text because low-probability tokens accumulate enough mass to be selected occasionally. Two truncation strategies address this:</p>
238238

239239
<ul>
240-
<li><strong>Top-k sampling:</strong> Keep only the <span class="math">k</span> highest-probability tokens, set the rest to zero, and renormalize. With <span class="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <span class="math">k</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
241-
<li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <span class="math">p</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
240+
<li><strong>Top-k sampling:</strong> Keep only the <span class="math">$k$</span> highest-probability tokens, set the rest to zero, and renormalize. With <span class="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <span class="math">$k$</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
241+
<li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <span class="math">$p$</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
242242
</ul>
243243

244244
<p>These strategies are covered in depth with code examples in <a class="cross-ref" href="../../part-2-understanding-llms/module-08-reasoning-test-time-compute/section-8.1.html">Section 8.1</a>. Understanding them requires only the probability concepts above: truncating and renormalizing a distribution.</p>
245245

246246
<h3>Monte Carlo Estimation</h3>
247247

248-
<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <span class="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <span class="math">N</span> samples and approximate:</p>
248+
<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <span class="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <span class="math">$N$</span> samples and approximate:</p>
249249

250250
<div class="math-block">
251251
$$E_{x \sim P}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i), \quad x_i \sim P$$
@@ -255,7 +255,7 @@ <h3>Monte Carlo Estimation</h3>
255255

256256
<h3>Confidence Intervals and Statistical Testing for Evaluation</h3>
257257

258-
<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <span class="math">N</span> test examples with sample mean <span class="math">$\bar{x}$</span> and standard deviation <span class="math">s</span>, the 95% confidence interval is approximately:</p>
258+
<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <span class="math">$N$</span> test examples with sample mean <span class="math">$\bar{x}$</span> and standard deviation <span class="math">$s$</span>, the 95% confidence interval is approximately:</p>
259259

260260
<div class="math-block">
261261
$$\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{N}}$$
@@ -278,7 +278,7 @@ <h2>A.3 Calculus for Machine Learning</h2>
278278

279279
<h3>Derivatives and Gradients</h3>
280280

281-
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">x</span>.</p>
281+
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">$x$</span>.</p>
282282

283283
<p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
284284

@@ -411,14 +411,14 @@ <h3>Entropy</h3>
411411
<div class="code-caption"><strong>Code Fragment A.5:</strong> This snippet demonstrates the entropy function using NumPy. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
412412
<h3>Cross-Entropy</h3>
413413

414-
<p><strong>Cross-entropy</strong> measures how well a predicted distribution <span class="math">Q</span> matches a true distribution <span class="math">P</span>:</p>
414+
<p><strong>Cross-entropy</strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
415415

416416
<div class="math-block">
417417
$$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
418418

419419
</div>
420420

421-
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">Q</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
421+
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
422422

423423
<div class="callout key-insight">
424424
<div class="callout-title">Key Insight: Cross-Entropy and Perplexity</div>

appendices/appendix-a-mathematical-foundations/section-a.1.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ <h3>Matrices and Matrix Multiplication</h3>
7272

7373
</div>
7474

75-
<p>This formula describes a linear layer: input <span class="math">X</span> (a batch of vectors) is multiplied by weight matrix <span class="math">W</span>, then bias vector <span class="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
75+
<p>This formula describes a linear layer: input <span class="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <span class="math">$W$</span>, then bias vector <span class="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
7676

7777
<pre><code class="language-text"># Matrix multiplication in NumPy
7878
X = np.random.randn(4, 768) # batch of 4 tokens, each 768-dim
@@ -84,11 +84,11 @@ <h3>Matrices and Matrix Multiplication</h3>
8484
<div class="code-caption"><strong>Code Fragment A.1.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
8585
<h3>Transpose and Symmetry</h3>
8686

87-
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">A</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
87+
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">$A$</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
8888

8989
<h3>Eigenvalues and Eigenvectors</h3>
9090

91-
<p>An <strong>eigenvector</strong> of a matrix <span class="math">A</span> is a vector <span class="math">v</span> such that multiplying by the matrix merely scales it:</p>
91+
<p>An <strong>eigenvector</strong> of a matrix <span class="math">$A$</span> is a vector <span class="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
9292

9393
<div class="math-block">
9494
$$A \cdot v = \lambda \cdot v$$
@@ -99,7 +99,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
9999

100100
<div class="callout note">
101101
<div class="callout-title">Practical Connection: Low-Rank Factorization</div>
102-
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">B</span> is (d, r) and <span class="math">A</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
102+
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">$B$</span> is (d, r) and <span class="math">$A$</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
103103
</div>
104104

105105
<nav class="chapter-nav">

appendices/appendix-a-mathematical-foundations/section-a.2.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ <h3>Expected Value and Variance</h3>
116116

117117
<div class="callout practical-example">
118118
<div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-5.2.html">Temperature</a> Sampling</div>
119-
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">z</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
119+
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
120120
</div>
121121

122122
<nav class="chapter-nav">

appendices/appendix-a-mathematical-foundations/section-a.3.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ <h1>A.3 Calculus for Machine Learning</h1>
3636

3737
<h3>Derivatives and Gradients</h3>
3838

39-
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">x</span>.</p>
39+
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">$x$</span>.</p>
4040

4141
<p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
4242

appendices/appendix-a-mathematical-foundations/section-a.4.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,14 +68,14 @@ <h3>Entropy</h3>
6868
<div class="code-caption"><strong>Code Fragment A.4.1:</strong> This snippet demonstrates the entropy function using <a href="https://numpy.org/" target="_blank" rel="noopener">NumPy</a>. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
6969
<h3>Cross-Entropy</h3>
7070

71-
<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">Q</span> matches a true distribution <span class="math">P</span>:</p>
71+
<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
7272

7373
<div class="math-block">
7474
$$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
7575

7676
</div>
7777

78-
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">Q</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
78+
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
7979

8080
<div class="callout key-insight">
8181
<div class="callout-title">Key Insight: Cross-Entropy and <a class="cross-ref" href="../appendix-b-ml-essentials/section-b.4.html">Perplexity</a></div>

0 commit comments

Comments
 (0)