Added AR models

guaaaaa · guaaaaa · commit a637992de8bf · 2025-12-30T14:40:00.000-05:00
diff --git a/notebooks/Explicit Distribution Learning.ipynb b/notebooks/Explicit Distribution Learning.ipynb
@@ -193,20 +193,90 @@
     "\n",
     "When we minimizing the divergence with respect to the model parameters $\\theta$, we notice the following:\n",
     "* **The Entropy Term:** $\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{data}(\\mathbf{x})]$ depends only on the true data. It is a constant with respect to $\\theta$ and does not change during training.\n",
-    "* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution:\n",
+    "* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution.\n",
     "    $$\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{\\theta}(\\mathbf{x})] \\approx \\frac{1}{n} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n",
     "\n",
     "Therefore, minimizing the divergence is equivalent to maximizing the average log-likelihood:\n",
     "$$\\arg \\min_{\\theta} D_{KL}(P_{data} \\parallel P_{\\theta}) = \\arg \\max_{\\theta} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n",
     "\n",
+    "Here, we assume each of our $n$ samples was sampled uniformly from the distribution, meaning each has a probability mass of $1/n$. However, the true probablity is not necessarily $1/n$ for each individual sample, making the entire objective of MLE or minimizing KL divergence only an estimate.\n",
+    "\n",
     "## The Lower Bound (Data Entropy)\n",
     "Note that even in a \"perfect\" training scenario where the model perfectly matches the empirical data, the log-likelihood does not necessarily reach zero. It is bounded by the **Entropy of the data**, $H(P_{data})$. This represents the inherent \"noise\" or uncertainty in the data that no model can eliminate."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b9e55612",
+   "metadata": {},
+   "source": [
+    " # Explicit Distribution Learning\n",
+    "Explicit distribution learning means that we learn the distribution and have a clear expression for it (through a neural network). Clearly, modeling data distribution as a whole is computationally infeasible because the model needs to cover an exponentially large data-space and direct sampling could be exponentially hard. Therefore, we use methods like MLE and minimizing KL divergence to estimate the distribution. The objective can be achieved through 3 different modelling apporaches\n",
+    "1. Autoregressive Modelling\n",
+    "2. Energy-based Modeling\n",
+    "3. Flow-based Modeling\n",
+    "\n",
+    "# Explicit Distribution Learning\n",
+    "Explicit distribution learning involves the construction of a generative model that provides an explicit functional form for the probability density function (or mass function), denoted as $p_{\\theta}(\\mathbf{x})$. Unlike implicit models (e.g., GANs), which only provide a mechanism to sample data, explicit models allow for the direct evaluation of the likelihood of a given observation.\n",
+    "\n",
+    "## The Fundamental Challenge: Intractability and Normalization\n",
+    "Modeling a high-dimensional data distribution $p_{data}(\\mathbf{x})$ (where $\\mathbf{x} \\in \\mathbb{R}^D$) is computationally intensive due to the **curse of dimensionality**. A valid probability distribution must satisfy the normalization constraint:\n",
+    "$$\\int_{\\mathbf{x}} p_{\\theta}(\\mathbf{x}) d\\mathbf{x} = 1$$\n",
+    "In high-dimensional spaces, calculating the denominator (the partition function) required to normalize a neural network's output is usually exponentially hard. Consequently, explicit modeling focuses on architectures that either bypass, simplify, or approximate this normalization while maintaining a clear expression for the density.\n",
+    "\n",
+    "The standard approach to estimate the distribution based through methods like MLE (minimizing KL divergence). In practice, the MLE objective is ususally implemented through 3 different modelling apporaches\n",
+    "\n",
+    "1. Autoregressive Modelling\n",
+    "2. Energy-based Modeling\n",
+    "3. Flow-based Modeling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bccceb15",
+   "metadata": {},
+   "source": [
+    "## Autoregressive (AR) Modeling\n",
+    "Autoregressive (AR) models belong to the class of **explicit density generative models**. They decompose the high-dimensional joint probability distribution of a data point into a sequence of conditional distributions. This structure allows for exact likelihood computation and stable training via MLE.\n",
+    "\n",
+    "\n",
+    "The core principle of AR modeling is the application of the **Probability Chain Rule**. A $d$-dimensional data sample $\\mathbf{x} = (x_1, x_2, \\dots, x_d)$ is represented as a product of $d$ univariate conditional distributions.\n",
+    "\n",
+    "For any specific component $x_i$ within a sample, the model predicts the probability of that component given all preceding components:\n",
+    "$$p(x_i | \\mathbf{x}_{<i}) = p(x_i | x_1, x_2, \\dots, x_{i-1})$$\n",
+    "\n",
+    "The entire sample's probability is the product of these conditionals:\n",
+    "$$p(\\mathbf{x}) = \\prod_{i=1}^{d} p(x_i | \\mathbf{x}_{<i})$$\n",
+    "By converting a complex $d$-dimensional joint distribution into $d$ one-dimensional distributions, the model simplifies the learning task while maintaining an **explicit** representation of the distributions for the sample.\n",
+    "\n",
+    "AR models assume a fixed ordering of dimensions (e.g., raster scan for images, left-to-right for text). The model assumes that $x_i$ depends only on the observed values $\\mathbf{x}_{<i}$, satisfying the **causal constraint**.\n",
+    "\n",
+    "### Architectural Pipeline\n",
+    "The transformation from input to probability follows this general flow:\n",
+    "1.  **Input Context:** Previous elements $\\mathbf{x}_{<i}$ are fed into the model.\n",
+    "2.  **Neural Network Feature Extractor:** A backbone architecture (e.g., RNN, LSTM, Transformer, or Causal CNN) processes the sequence to produce a hidden representation $h_i$.\n",
+    "3.  **Output Head:**\n",
+    "    * **Discrete Data (e.g., Text):** A Softmax layer over a vocabulary.\n",
+    "    * **Continuous Data (e.g., Audio/Images):** A parametric distribution, such as a **Gaussian Mixture Model (GMM)** or a discretized logistic distribution, where the network predicts parameters $\\mu_i$ and $\\sigma_i$.\n",
+    "4.  **Sampling:** A value is sampled from $p(x_i | \\mathbf{x}_{<i})$ and appended to the context for the next step $i+1$. The first value $p(x_0)$ is sampled from the prior after training.\n",
+    "\n",
+    "### MLE Objective and Loss Function\n",
+    "AR models are trained by maximizing the log-likelihood of the training data $\\mathcal{D}$. This is equivalent to minimizing the **Cross-Entropy Loss**:\n",
+    "$$\\mathcal{L}(\\theta) = -\\sum_{\\mathbf{x} \\in \\mathcal{D}} \\sum_{i=1}^{d} \\log p_{\\theta}(x_i | \\mathbf{x}_{<i})$$\n",
+    "\n",
+    "### Teacher Forcing\n",
+    "During training, we use **Teacher Forcing**. Instead of feeding the model's own (potentially erroneous) predictions from step $i-1$ into step $i$, we provide the **ground truth** values from the training set. This allows for parallelization during training (especially in Transformers and Causal CNNs) because all $x_i$ conditionals can be computed simultaneously, which also leads to faster convergence.\n",
+    "\n",
+    "### Challenges\n",
+    "1. **Exposure Bias**: A significant drawback of Teacher Forcing is **Exposure Bias**. During training, the model only sees ground truth context. During inference (generation), it sees its own generated (noisy) samples. Errors accumulate over time, leading to a divergence between the training and testing distributions.\n",
+    "\n",
+    "2. **Sampling Bottleneck**: While training of AR models can be parallelized, **inference is inherently sequential and recursive** since any future data depends on the past data. To generate $x_{100}$, the model must first generate and process $x_1$ through $x_{99}$. The computational complexity of sampling a single sequence is $O(d)$, where $d$ is the number of dimensions/tokens. This makes AR models significantly slower at test-time compared to parallel generative models like GANs or non-autoregressive Flows."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "96176061",
+   "id": "252a9932",
    "metadata": {},
    "outputs": [],
    "source": []