You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: appendices/appendix-a-mathematical-foundations/index.html
+12-12Lines changed: 12 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ <h3>Matrices and Matrix Multiplication</h3>
89
89
90
90
</div>
91
91
92
-
<p>This formula describes a linear layer: input <spanclass="math">X</span> (a batch of vectors) is multiplied by weight matrix <spanclass="math">W</span>, then bias vector <spanclass="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
92
+
<p>This formula describes a linear layer: input <spanclass="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <spanclass="math">$W$</span>, then bias vector <spanclass="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
93
93
94
94
<pre><codeclass="language-text"># Matrix multiplication in NumPy
95
95
X = np.random.randn(4, 768) # batch of 4 tokens, each 768-dim
@@ -101,11 +101,11 @@ <h3>Matrices and Matrix Multiplication</h3>
101
101
<divclass="code-caption"><strong>Code Fragment A.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
102
102
<h3>Transpose and Symmetry</h3>
103
103
104
-
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <spanclass="math">A</span> has shape (m, n), then <spanclass="math">$A^T$</span> has shape (n, m). In attention, we compute <spanclass="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
104
+
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <spanclass="math">$A$</span> has shape (m, n), then <spanclass="math">$A^T$</span> has shape (n, m). In attention, we compute <spanclass="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
105
105
106
106
<h3>Eigenvalues and Eigenvectors</h3>
107
107
108
-
<p>An <strong>eigenvector</strong> of a matrix <spanclass="math">A</span> is a vector <spanclass="math">v</span> such that multiplying by the matrix merely scales it:</p>
108
+
<p>An <strong>eigenvector</strong> of a matrix <spanclass="math">$A$</span> is a vector <spanclass="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
109
109
110
110
<divclass="math-block">
111
111
$$A \cdot v = \lambda \cdot v$$
@@ -116,7 +116,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <spanclass="math">$W' = W + B \cdot A$</span>, where <spanclass="math">B</span> is (d, r) and <spanclass="math">A</span> is (r, d) with rank <spanclass="math">$r <<d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
119
+
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <spanclass="math">$W' = W + B \cdot A$</span>, where <spanclass="math">$B$</span> is (d, r) and <spanclass="math">$A$</span> is (r, d) with rank <spanclass="math">$r <<d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
@@ -209,7 +209,7 @@ <h3>Expected Value and Variance</h3>
209
209
210
210
<divclass="callout practical-example">
211
211
<divclass="callout-title">Practical Example: Temperature Sampling</div>
212
-
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <spanclass="math">z</span>, we compute <spanclass="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <spanclass="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <spanclass="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
212
+
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <spanclass="math">$z$</span>, we compute <spanclass="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <spanclass="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <spanclass="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
213
213
</div>
214
214
215
215
<h3>The Chain Rule of Probability and Autoregressive Generation</h3>
@@ -237,15 +237,15 @@ <h3>Sampling Strategies: Top-k and Nucleus (Top-p)</h3>
237
237
<p>Raw sampling from a model's full distribution often produces incoherent text because low-probability tokens accumulate enough mass to be selected occasionally. Two truncation strategies address this:</p>
238
238
239
239
<ul>
240
-
<li><strong>Top-k sampling:</strong> Keep only the <spanclass="math">k</span> highest-probability tokens, set the rest to zero, and renormalize. With <spanclass="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <spanclass="math">k</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
241
-
<li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <spanclass="math">p</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
240
+
<li><strong>Top-k sampling:</strong> Keep only the <spanclass="math">$k$</span> highest-probability tokens, set the rest to zero, and renormalize. With <spanclass="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <spanclass="math">$k$</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
241
+
<li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <spanclass="math">$p$</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
242
242
</ul>
243
243
244
244
<p>These strategies are covered in depth with code examples in <aclass="cross-ref" href="../../part-2-understanding-llms/module-08-reasoning-test-time-compute/section-8.1.html">Section 8.1</a>. Understanding them requires only the probability concepts above: truncating and renormalizing a distribution.</p>
245
245
246
246
<h3>Monte Carlo Estimation</h3>
247
247
248
-
<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <spanclass="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <spanclass="math">N</span> samples and approximate:</p>
248
+
<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <spanclass="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <spanclass="math">$N$</span> samples and approximate:</p>
@@ -255,7 +255,7 @@ <h3>Monte Carlo Estimation</h3>
255
255
256
256
<h3>Confidence Intervals and Statistical Testing for Evaluation</h3>
257
257
258
-
<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <spanclass="math">N</span> test examples with sample mean <spanclass="math">$\bar{x}$</span> and standard deviation <spanclass="math">s</span>, the 95% confidence interval is approximately:</p>
258
+
<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <spanclass="math">$N$</span> test examples with sample mean <spanclass="math">$\bar{x}$</span> and standard deviation <spanclass="math">$s$</span>, the 95% confidence interval is approximately:</p>
259
259
260
260
<divclass="math-block">
261
261
$$\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{N}}$$
@@ -278,7 +278,7 @@ <h2>A.3 Calculus for Machine Learning</h2>
278
278
279
279
<h3>Derivatives and Gradients</h3>
280
280
281
-
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <spanclass="math">$f(x)$</span>, the derivative <spanclass="math">$f'(x)$</span> or <spanclass="math">$df/dx$</span> is the slope of the function at point <spanclass="math">x</span>.</p>
281
+
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <spanclass="math">$f(x)$</span>, the derivative <spanclass="math">$f'(x)$</span> or <spanclass="math">$df/dx$</span> is the slope of the function at point <spanclass="math">$x$</span>.</p>
282
282
283
283
<p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
284
284
@@ -411,14 +411,14 @@ <h3>Entropy</h3>
411
411
<divclass="code-caption"><strong>Code Fragment A.5:</strong> This snippet demonstrates the entropy function using NumPy. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
412
412
<h3>Cross-Entropy</h3>
413
413
414
-
<p><strong>Cross-entropy</strong> measures how well a predicted distribution <spanclass="math">Q</span> matches a true distribution <spanclass="math">P</span>:</p>
414
+
<p><strong>Cross-entropy</strong> measures how well a predicted distribution <spanclass="math">$Q$</span> matches a true distribution <spanclass="math">$P$</span>:</p>
415
415
416
416
<divclass="math-block">
417
417
$$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
418
418
419
419
</div>
420
420
421
-
<p>This is the standard loss function for training language models. The "true distribution" <spanclass="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <spanclass="math">Q</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
421
+
<p>This is the standard loss function for training language models. The "true distribution" <spanclass="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <spanclass="math">$Q$</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
422
422
423
423
<divclass="callout key-insight">
424
424
<divclass="callout-title">Key Insight: Cross-Entropy and Perplexity</div>
Copy file name to clipboardExpand all lines: appendices/appendix-a-mathematical-foundations/section-a.1.html
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ <h3>Matrices and Matrix Multiplication</h3>
72
72
73
73
</div>
74
74
75
-
<p>This formula describes a linear layer: input <spanclass="math">X</span> (a batch of vectors) is multiplied by weight matrix <spanclass="math">W</span>, then bias vector <spanclass="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
75
+
<p>This formula describes a linear layer: input <spanclass="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <spanclass="math">$W$</span>, then bias vector <spanclass="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
76
76
77
77
<pre><codeclass="language-text"># Matrix multiplication in NumPy
78
78
X = np.random.randn(4, 768) # batch of 4 tokens, each 768-dim
@@ -84,11 +84,11 @@ <h3>Matrices and Matrix Multiplication</h3>
84
84
<divclass="code-caption"><strong>Code Fragment A.1.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
85
85
<h3>Transpose and Symmetry</h3>
86
86
87
-
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <spanclass="math">A</span> has shape (m, n), then <spanclass="math">$A^T$</span> has shape (n, m). In attention, we compute <spanclass="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
87
+
<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <spanclass="math">$A$</span> has shape (m, n), then <spanclass="math">$A^T$</span> has shape (n, m). In attention, we compute <spanclass="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
88
88
89
89
<h3>Eigenvalues and Eigenvectors</h3>
90
90
91
-
<p>An <strong>eigenvector</strong> of a matrix <spanclass="math">A</span> is a vector <spanclass="math">v</span> such that multiplying by the matrix merely scales it:</p>
91
+
<p>An <strong>eigenvector</strong> of a matrix <spanclass="math">$A$</span> is a vector <spanclass="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
92
92
93
93
<divclass="math-block">
94
94
$$A \cdot v = \lambda \cdot v$$
@@ -99,7 +99,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <spanclass="math">$W' = W + B \cdot A$</span>, where <spanclass="math">B</span> is (d, r) and <spanclass="math">A</span> is (r, d) with rank <spanclass="math">$r <<d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
102
+
<p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <spanclass="math">$W' = W + B \cdot A$</span>, where <spanclass="math">$B$</span> is (d, r) and <spanclass="math">$A$</span> is (r, d) with rank <spanclass="math">$r <<d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <spanclass="math">z</span>, we compute <spanclass="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <spanclass="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <spanclass="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
119
+
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <spanclass="math">$z$</span>, we compute <spanclass="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <spanclass="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <spanclass="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
Copy file name to clipboardExpand all lines: appendices/appendix-a-mathematical-foundations/section-a.3.html
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ <h1>A.3 Calculus for Machine Learning</h1>
36
36
37
37
<h3>Derivatives and Gradients</h3>
38
38
39
-
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <spanclass="math">$f(x)$</span>, the derivative <spanclass="math">$f'(x)$</span> or <spanclass="math">$df/dx$</span> is the slope of the function at point <spanclass="math">x</span>.</p>
39
+
<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <spanclass="math">$f(x)$</span>, the derivative <spanclass="math">$f'(x)$</span> or <spanclass="math">$df/dx$</span> is the slope of the function at point <spanclass="math">$x$</span>.</p>
40
40
41
41
<p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
Copy file name to clipboardExpand all lines: appendices/appendix-a-mathematical-foundations/section-a.4.html
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -68,14 +68,14 @@ <h3>Entropy</h3>
68
68
<divclass="code-caption"><strong>Code Fragment A.4.1:</strong> This snippet demonstrates the entropy function using <ahref="https://numpy.org/" target="_blank" rel="noopener">NumPy</a>. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
69
69
<h3>Cross-Entropy</h3>
70
70
71
-
<p><strong><aclass="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <spanclass="math">Q</span> matches a true distribution <spanclass="math">P</span>:</p>
71
+
<p><strong><aclass="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <spanclass="math">$Q$</span> matches a true distribution <spanclass="math">$P$</span>:</p>
72
72
73
73
<divclass="math-block">
74
74
$$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
75
75
76
76
</div>
77
77
78
-
<p>This is the standard loss function for training language models. The "true distribution" <spanclass="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <spanclass="math">Q</span> is the model's <aclass="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
78
+
<p>This is the standard loss function for training language models. The "true distribution" <spanclass="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <spanclass="math">$Q$</span> is the model's <aclass="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
79
79
80
80
<divclass="callout key-insight">
81
81
<divclass="callout-title">Key Insight: Cross-Entropy and <aclass="cross-ref" href="../appendix-b-ml-essentials/section-b.4.html">Perplexity</a></div>
0 commit comments