|
193 | 193 | "\n", |
194 | 194 | "When we minimizing the divergence with respect to the model parameters $\\theta$, we notice the following:\n", |
195 | 195 | "* **The Entropy Term:** $\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{data}(\\mathbf{x})]$ depends only on the true data. It is a constant with respect to $\\theta$ and does not change during training.\n", |
196 | | - "* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution:\n", |
| 196 | + "* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution.\n", |
197 | 197 | " $$\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{\\theta}(\\mathbf{x})] \\approx \\frac{1}{n} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n", |
198 | 198 | "\n", |
199 | 199 | "Therefore, minimizing the divergence is equivalent to maximizing the average log-likelihood:\n", |
200 | 200 | "$$\\arg \\min_{\\theta} D_{KL}(P_{data} \\parallel P_{\\theta}) = \\arg \\max_{\\theta} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n", |
201 | 201 | "\n", |
| 202 | + "Here, we assume each of our $n$ samples was sampled uniformly from the distribution, meaning each has a probability mass of $1/n$. However, the true probablity is not necessarily $1/n$ for each individual sample, making the entire objective of MLE or minimizing KL divergence only an estimate.\n", |
| 203 | + "\n", |
202 | 204 | "## The Lower Bound (Data Entropy)\n", |
203 | 205 | "Note that even in a \"perfect\" training scenario where the model perfectly matches the empirical data, the log-likelihood does not necessarily reach zero. It is bounded by the **Entropy of the data**, $H(P_{data})$. This represents the inherent \"noise\" or uncertainty in the data that no model can eliminate." |
204 | 206 | ] |
205 | 207 | }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "id": "b9e55612", |
| 211 | + "metadata": {}, |
| 212 | + "source": [ |
| 213 | + " # Explicit Distribution Learning\n", |
| 214 | + "Explicit distribution learning means that we learn the distribution and have a clear expression for it (through a neural network). Clearly, modeling data distribution as a whole is computationally infeasible because the model needs to cover an exponentially large data-space and direct sampling could be exponentially hard. Therefore, we use methods like MLE and minimizing KL divergence to estimate the distribution. The objective can be achieved through 3 different modelling apporaches\n", |
| 215 | + "1. Autoregressive Modelling\n", |
| 216 | + "2. Energy-based Modeling\n", |
| 217 | + "3. Flow-based Modeling\n", |
| 218 | + "\n", |
| 219 | + "# Explicit Distribution Learning\n", |
| 220 | + "Explicit distribution learning involves the construction of a generative model that provides an explicit functional form for the probability density function (or mass function), denoted as $p_{\\theta}(\\mathbf{x})$. Unlike implicit models (e.g., GANs), which only provide a mechanism to sample data, explicit models allow for the direct evaluation of the likelihood of a given observation.\n", |
| 221 | + "\n", |
| 222 | + "## The Fundamental Challenge: Intractability and Normalization\n", |
| 223 | + "Modeling a high-dimensional data distribution $p_{data}(\\mathbf{x})$ (where $\\mathbf{x} \\in \\mathbb{R}^D$) is computationally intensive due to the **curse of dimensionality**. A valid probability distribution must satisfy the normalization constraint:\n", |
| 224 | + "$$\\int_{\\mathbf{x}} p_{\\theta}(\\mathbf{x}) d\\mathbf{x} = 1$$\n", |
| 225 | + "In high-dimensional spaces, calculating the denominator (the partition function) required to normalize a neural network's output is usually exponentially hard. Consequently, explicit modeling focuses on architectures that either bypass, simplify, or approximate this normalization while maintaining a clear expression for the density.\n", |
| 226 | + "\n", |
| 227 | + "The standard approach to estimate the distribution based through methods like MLE (minimizing KL divergence). In practice, the MLE objective is ususally implemented through 3 different modelling apporaches\n", |
| 228 | + "\n", |
| 229 | + "1. Autoregressive Modelling\n", |
| 230 | + "2. Energy-based Modeling\n", |
| 231 | + "3. Flow-based Modeling" |
| 232 | + ] |
| 233 | + }, |
| 234 | + { |
| 235 | + "cell_type": "markdown", |
| 236 | + "id": "bccceb15", |
| 237 | + "metadata": {}, |
| 238 | + "source": [ |
| 239 | + "## Autoregressive (AR) Modeling\n", |
| 240 | + "Autoregressive (AR) models belong to the class of **explicit density generative models**. They decompose the high-dimensional joint probability distribution of a data point into a sequence of conditional distributions. This structure allows for exact likelihood computation and stable training via MLE.\n", |
| 241 | + "\n", |
| 242 | + "\n", |
| 243 | + "The core principle of AR modeling is the application of the **Probability Chain Rule**. A $d$-dimensional data sample $\\mathbf{x} = (x_1, x_2, \\dots, x_d)$ is represented as a product of $d$ univariate conditional distributions.\n", |
| 244 | + "\n", |
| 245 | + "For any specific component $x_i$ within a sample, the model predicts the probability of that component given all preceding components:\n", |
| 246 | + "$$p(x_i | \\mathbf{x}_{<i}) = p(x_i | x_1, x_2, \\dots, x_{i-1})$$\n", |
| 247 | + "\n", |
| 248 | + "The entire sample's probability is the product of these conditionals:\n", |
| 249 | + "$$p(\\mathbf{x}) = \\prod_{i=1}^{d} p(x_i | \\mathbf{x}_{<i})$$\n", |
| 250 | + "By converting a complex $d$-dimensional joint distribution into $d$ one-dimensional distributions, the model simplifies the learning task while maintaining an **explicit** representation of the distributions for the sample.\n", |
| 251 | + "\n", |
| 252 | + "AR models assume a fixed ordering of dimensions (e.g., raster scan for images, left-to-right for text). The model assumes that $x_i$ depends only on the observed values $\\mathbf{x}_{<i}$, satisfying the **causal constraint**.\n", |
| 253 | + "\n", |
| 254 | + "### Architectural Pipeline\n", |
| 255 | + "The transformation from input to probability follows this general flow:\n", |
| 256 | + "1. **Input Context:** Previous elements $\\mathbf{x}_{<i}$ are fed into the model.\n", |
| 257 | + "2. **Neural Network Feature Extractor:** A backbone architecture (e.g., RNN, LSTM, Transformer, or Causal CNN) processes the sequence to produce a hidden representation $h_i$.\n", |
| 258 | + "3. **Output Head:**\n", |
| 259 | + " * **Discrete Data (e.g., Text):** A Softmax layer over a vocabulary.\n", |
| 260 | + " * **Continuous Data (e.g., Audio/Images):** A parametric distribution, such as a **Gaussian Mixture Model (GMM)** or a discretized logistic distribution, where the network predicts parameters $\\mu_i$ and $\\sigma_i$.\n", |
| 261 | + "4. **Sampling:** A value is sampled from $p(x_i | \\mathbf{x}_{<i})$ and appended to the context for the next step $i+1$. The first value $p(x_0)$ is sampled from the prior after training.\n", |
| 262 | + "\n", |
| 263 | + "### MLE Objective and Loss Function\n", |
| 264 | + "AR models are trained by maximizing the log-likelihood of the training data $\\mathcal{D}$. This is equivalent to minimizing the **Cross-Entropy Loss**:\n", |
| 265 | + "$$\\mathcal{L}(\\theta) = -\\sum_{\\mathbf{x} \\in \\mathcal{D}} \\sum_{i=1}^{d} \\log p_{\\theta}(x_i | \\mathbf{x}_{<i})$$\n", |
| 266 | + "\n", |
| 267 | + "### Teacher Forcing\n", |
| 268 | + "During training, we use **Teacher Forcing**. Instead of feeding the model's own (potentially erroneous) predictions from step $i-1$ into step $i$, we provide the **ground truth** values from the training set. This allows for parallelization during training (especially in Transformers and Causal CNNs) because all $x_i$ conditionals can be computed simultaneously, which also leads to faster convergence.\n", |
| 269 | + "\n", |
| 270 | + "### Challenges\n", |
| 271 | + "1. **Exposure Bias**: A significant drawback of Teacher Forcing is **Exposure Bias**. During training, the model only sees ground truth context. During inference (generation), it sees its own generated (noisy) samples. Errors accumulate over time, leading to a divergence between the training and testing distributions.\n", |
| 272 | + "\n", |
| 273 | + "2. **Sampling Bottleneck**: While training of AR models can be parallelized, **inference is inherently sequential and recursive** since any future data depends on the past data. To generate $x_{100}$, the model must first generate and process $x_1$ through $x_{99}$. The computational complexity of sampling a single sequence is $O(d)$, where $d$ is the number of dimensions/tokens. This makes AR models significantly slower at test-time compared to parallel generative models like GANs or non-autoregressive Flows." |
| 274 | + ] |
| 275 | + }, |
206 | 276 | { |
207 | 277 | "cell_type": "code", |
208 | 278 | "execution_count": null, |
209 | | - "id": "96176061", |
| 279 | + "id": "252a9932", |
210 | 280 | "metadata": {}, |
211 | 281 | "outputs": [], |
212 | 282 | "source": [] |
|
0 commit comments