ApartsinProjects
diff --git a/‎appendices/appendix-a-mathematical-foundations/index.html‎
Lines changed: 12 additions & 12 deletions b/‎appendices/appendix-a-mathematical-foundations/index.html‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.1.html‎
Lines changed: 4 additions & 4 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.1.html‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.2.html‎
Lines changed: 1 addition & 1 deletion b/‎appendices/appendix-a-mathematical-foundations/section-a.2.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.3.html‎
Lines changed: 1 addition & 1 deletion b/‎appendices/appendix-a-mathematical-foundations/section-a.3.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.4.html‎
Lines changed: 2 additions & 2 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.4.html‎
Lines changed: 2 additions & 2 deletions
@@ -89,7 +89,7 @@ <h3>Matrices and Matrix Multiplication</h3>
 
 </div>
 
-<p>This formula describes a linear layer: input <span class="math">X</span> (a batch of vectors) is multiplied by weight matrix <span class="math">W</span>, then bias vector <span class="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
+<p>This formula describes a linear layer: input <span class="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <span class="math">$W$</span>, then bias vector <span class="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.2 below puts this into practice.</p>
 
 <pre><code class="language-text"># Matrix multiplication in NumPy
 X = np.random.randn(4, 768)   # batch of 4 tokens, each 768-dim
@@ -101,11 +101,11 @@ <h3>Matrices and Matrix Multiplication</h3>
 <div class="code-caption"><strong>Code Fragment A.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
 <h3>Transpose and Symmetry</h3>
 
-<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">A</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
+<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">$A$</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
 
 <h3>Eigenvalues and Eigenvectors</h3>
 
-<p>An <strong>eigenvector</strong> of a matrix <span class="math">A</span> is a vector <span class="math">v</span> such that multiplying by the matrix merely scales it:</p>
+<p>An <strong>eigenvector</strong> of a matrix <span class="math">$A$</span> is a vector <span class="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
 
 <div class="math-block">
     $$A \cdot v = \lambda \cdot v$$
@@ -116,7 +116,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
 
 <div class="callout note">
     <div class="callout-title">Practical Connection: Low-Rank Factorization</div>
-    <p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">B</span> is (d, r) and <span class="math">A</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
+    <p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">$B$</span> is (d, r) and <span class="math">$A$</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
 </div>
 
 <div class="section-break">&bull; &bull; &bull;</div>
@@ -209,7 +209,7 @@ <h3>Expected Value and Variance</h3>
 
 <div class="callout practical-example">
     <div class="callout-title">Practical Example: Temperature Sampling</div>
-    <p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">z</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
+    <p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
 </div>
 
 <h3>The Chain Rule of Probability and Autoregressive Generation</h3>
@@ -237,15 +237,15 @@ <h3>Sampling Strategies: Top-k and Nucleus (Top-p)</h3>
 <p>Raw sampling from a model's full distribution often produces incoherent text because low-probability tokens accumulate enough mass to be selected occasionally. Two truncation strategies address this:</p>
 
 <ul>
-    <li><strong>Top-k sampling:</strong> Keep only the <span class="math">k</span> highest-probability tokens, set the rest to zero, and renormalize. With <span class="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <span class="math">k</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
-    <li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <span class="math">p</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
+    <li><strong>Top-k sampling:</strong> Keep only the <span class="math">$k$</span> highest-probability tokens, set the rest to zero, and renormalize. With <span class="math">$k = 50$</span>, the model chooses among its 50 best guesses. The limitation is that the right <span class="math">$k$</span> varies by context: after "The capital of France is" only one token is sensible, but after "I feel" many continuations are valid.</li>
+    <li><strong>Nucleus (top-p) sampling:</strong> Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds a threshold <span class="math">$p$</span> (typically 0.9 or 0.95). This adapts the effective vocabulary size to the model's confidence at each step.</li>
 </ul>
 
 <p>These strategies are covered in depth with code examples in <a class="cross-ref" href="../../part-2-understanding-llms/module-08-reasoning-test-time-compute/section-8.1.html">Section 8.1</a>. Understanding them requires only the probability concepts above: truncating and renormalizing a distribution.</p>
 
 <h3>Monte Carlo Estimation</h3>
 
-<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <span class="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <span class="math">N</span> samples and approximate:</p>
+<p><strong>Monte Carlo methods</strong> estimate expected values by averaging samples. If you cannot compute <span class="math">$E_{x \sim P}[f(x)]$</span> analytically (because the distribution is too complex), you draw <span class="math">$N$</span> samples and approximate:</p>
 
 <div class="math-block">
     $$E_{x \sim P}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i), \quad x_i \sim P$$
@@ -255,7 +255,7 @@ <h3>Monte Carlo Estimation</h3>
 
 <h3>Confidence Intervals and Statistical Testing for Evaluation</h3>
 
-<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <span class="math">N</span> test examples with sample mean <span class="math">$\bar{x}$</span> and standard deviation <span class="math">s</span>, the 95% confidence interval is approximately:</p>
+<p>When comparing two models on a benchmark, a 2% accuracy difference might be meaningful or just noise. <strong>Confidence intervals</strong> quantify this uncertainty. For a metric computed over <span class="math">$N$</span> test examples with sample mean <span class="math">$\bar{x}$</span> and standard deviation <span class="math">$s$</span>, the 95% confidence interval is approximately:</p>
 
 <div class="math-block">
     $$\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{N}}$$
@@ -278,7 +278,7 @@ <h2>A.3 Calculus for Machine Learning</h2>
 
 <h3>Derivatives and Gradients</h3>
 
-<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">x</span>.</p>
+<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">$x$</span>.</p>
 
 <p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
 
@@ -411,14 +411,14 @@ <h3>Entropy</h3>
 <div class="code-caption"><strong>Code Fragment A.5:</strong> This snippet demonstrates the entropy function using NumPy. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
 <h3>Cross-Entropy</h3>
 
-<p><strong>Cross-entropy</strong> measures how well a predicted distribution <span class="math">Q</span> matches a true distribution <span class="math">P</span>:</p>
+<p><strong>Cross-entropy</strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
 
 <div class="math-block">
     $$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
 
 </div>
 
-<p>This is the standard loss function for training language models. The "true distribution" <span class="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">Q</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
+<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
 
 <div class="callout key-insight">
     <div class="callout-title">Key Insight: Cross-Entropy and Perplexity</div>
 
@@ -72,7 +72,7 @@ <h3>Matrices and Matrix Multiplication</h3>
 
 </div>
 
-<p>This formula describes a linear layer: input <span class="math">X</span> (a batch of vectors) is multiplied by weight matrix <span class="math">W</span>, then bias vector <span class="math">b</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
+<p>This formula describes a linear layer: input <span class="math">$X$</span> (a batch of vectors) is multiplied by weight matrix <span class="math">$W$</span>, then bias vector <span class="math">$b$</span> is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.</p>
 
 <pre><code class="language-text"># Matrix multiplication in NumPy
 X = np.random.randn(4, 768)   # batch of 4 tokens, each 768-dim
@@ -84,11 +84,11 @@ <h3>Matrices and Matrix Multiplication</h3>
 <div class="code-caption"><strong>Code Fragment A.1.2:</strong> This snippet demonstrates this approach using NumPy. Study the implementation details to understand how each component contributes to the overall computation. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
 <h3>Transpose and Symmetry</h3>
 
-<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">A</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
+<p>The <strong>transpose</strong> of a matrix flips rows and columns: if <span class="math">$A$</span> has shape (m, n), then <span class="math">$A^T$</span> has shape (n, m). In attention, we compute <span class="math">$Q \cdot K^T$</span> because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.</p>
 
 <h3>Eigenvalues and Eigenvectors</h3>
 
-<p>An <strong>eigenvector</strong> of a matrix <span class="math">A</span> is a vector <span class="math">v</span> such that multiplying by the matrix merely scales it:</p>
+<p>An <strong>eigenvector</strong> of a matrix <span class="math">$A$</span> is a vector <span class="math">$v$</span> such that multiplying by the matrix merely scales it:</p>
 
 <div class="math-block">
     $$A \cdot v = \lambda \cdot v$$
@@ -99,7 +99,7 @@ <h3>Eigenvalues and Eigenvectors</h3>
 
 <div class="callout note">
     <div class="callout-title">Practical Connection: Low-Rank Factorization</div>
-    <p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">B</span> is (d, r) and <span class="math">A</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
+    <p>LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: <span class="math">$W' = W + B \cdot A$</span>, where <span class="math">$B$</span> is (d, r) and <span class="math">$A$</span> is (r, d) with rank <span class="math">$r << d$</span>. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.</p>
 </div>
 
 <nav class="chapter-nav">
 
@@ -116,7 +116,7 @@ <h3>Expected Value and Variance</h3>
 
 <div class="callout practical-example">
     <div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-5.2.html">Temperature</a> Sampling</div>
-    <p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">z</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
+    <p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
 </div>
 
 <nav class="chapter-nav">
 
@@ -36,7 +36,7 @@ <h1>A.3 Calculus for Machine Learning</h1>
 
 <h3>Derivatives and Gradients</h3>
 
-<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">x</span>.</p>
+<p>A <strong>derivative</strong> tells you how fast a function's output changes when you nudge its input. For a function <span class="math">$f(x)$</span>, the derivative <span class="math">$f'(x)$</span> or <span class="math">$df/dx$</span> is the slope of the function at point <span class="math">$x$</span>.</p>
 
 <p>When a function has many inputs (as a loss function that depends on millions of weights), we compute <strong>partial derivatives</strong> with respect to each input. The collection of all partial derivatives is called the <strong>gradient</strong>:</p>
 
 
@@ -68,14 +68,14 @@ <h3>Entropy</h3>
 <div class="code-caption"><strong>Code Fragment A.4.1:</strong> This snippet demonstrates the entropy function using <a href="https://numpy.org/" target="_blank" rel="noopener">NumPy</a>. The function encapsulates reusable logic that can be applied across different inputs. Understanding the underlying numerical operations helps build intuition for how the model processes data.</div>
 <h3>Cross-Entropy</h3>
 
-<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">Q</span> matches a true distribution <span class="math">P</span>:</p>
+<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
 
 <div class="math-block">
     $$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
 
 </div>
 
-<p>This is the standard loss function for training language models. The "true distribution" <span class="math">P</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">Q</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
+<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
 
 <div class="callout key-insight">
     <div class="callout-title">Key Insight: Cross-Entropy and <a class="cross-ref" href="../appendix-b-ml-essentials/section-b.4.html">Perplexity</a></div>