Skip to content

Commit a637992

Browse files
committed
Added AR models
1 parent 1b20a8b commit a637992

1 file changed

Lines changed: 72 additions & 2 deletions

File tree

notebooks/Data Generation.ipynb renamed to notebooks/Explicit Distribution Learning.ipynb

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -193,20 +193,90 @@
193193
"\n",
194194
"When we minimizing the divergence with respect to the model parameters $\\theta$, we notice the following:\n",
195195
"* **The Entropy Term:** $\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{data}(\\mathbf{x})]$ depends only on the true data. It is a constant with respect to $\\theta$ and does not change during training.\n",
196-
"* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution:\n",
196+
"* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\\mathcal{D}$ because we know the data points are sampled from the true distribution.\n",
197197
" $$\\mathbb{E}_{\\mathbf{x} \\sim P_{data}} [\\log P_{\\theta}(\\mathbf{x})] \\approx \\frac{1}{n} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n",
198198
"\n",
199199
"Therefore, minimizing the divergence is equivalent to maximizing the average log-likelihood:\n",
200200
"$$\\arg \\min_{\\theta} D_{KL}(P_{data} \\parallel P_{\\theta}) = \\arg \\max_{\\theta} \\sum_{i=1}^{n} \\log P_{\\theta}(\\mathbf{x}^{(i)})$$\n",
201201
"\n",
202+
"Here, we assume each of our $n$ samples was sampled uniformly from the distribution, meaning each has a probability mass of $1/n$. However, the true probablity is not necessarily $1/n$ for each individual sample, making the entire objective of MLE or minimizing KL divergence only an estimate.\n",
203+
"\n",
202204
"## The Lower Bound (Data Entropy)\n",
203205
"Note that even in a \"perfect\" training scenario where the model perfectly matches the empirical data, the log-likelihood does not necessarily reach zero. It is bounded by the **Entropy of the data**, $H(P_{data})$. This represents the inherent \"noise\" or uncertainty in the data that no model can eliminate."
204206
]
205207
},
208+
{
209+
"cell_type": "markdown",
210+
"id": "b9e55612",
211+
"metadata": {},
212+
"source": [
213+
" # Explicit Distribution Learning\n",
214+
"Explicit distribution learning means that we learn the distribution and have a clear expression for it (through a neural network). Clearly, modeling data distribution as a whole is computationally infeasible because the model needs to cover an exponentially large data-space and direct sampling could be exponentially hard. Therefore, we use methods like MLE and minimizing KL divergence to estimate the distribution. The objective can be achieved through 3 different modelling apporaches\n",
215+
"1. Autoregressive Modelling\n",
216+
"2. Energy-based Modeling\n",
217+
"3. Flow-based Modeling\n",
218+
"\n",
219+
"# Explicit Distribution Learning\n",
220+
"Explicit distribution learning involves the construction of a generative model that provides an explicit functional form for the probability density function (or mass function), denoted as $p_{\\theta}(\\mathbf{x})$. Unlike implicit models (e.g., GANs), which only provide a mechanism to sample data, explicit models allow for the direct evaluation of the likelihood of a given observation.\n",
221+
"\n",
222+
"## The Fundamental Challenge: Intractability and Normalization\n",
223+
"Modeling a high-dimensional data distribution $p_{data}(\\mathbf{x})$ (where $\\mathbf{x} \\in \\mathbb{R}^D$) is computationally intensive due to the **curse of dimensionality**. A valid probability distribution must satisfy the normalization constraint:\n",
224+
"$$\\int_{\\mathbf{x}} p_{\\theta}(\\mathbf{x}) d\\mathbf{x} = 1$$\n",
225+
"In high-dimensional spaces, calculating the denominator (the partition function) required to normalize a neural network's output is usually exponentially hard. Consequently, explicit modeling focuses on architectures that either bypass, simplify, or approximate this normalization while maintaining a clear expression for the density.\n",
226+
"\n",
227+
"The standard approach to estimate the distribution based through methods like MLE (minimizing KL divergence). In practice, the MLE objective is ususally implemented through 3 different modelling apporaches\n",
228+
"\n",
229+
"1. Autoregressive Modelling\n",
230+
"2. Energy-based Modeling\n",
231+
"3. Flow-based Modeling"
232+
]
233+
},
234+
{
235+
"cell_type": "markdown",
236+
"id": "bccceb15",
237+
"metadata": {},
238+
"source": [
239+
"## Autoregressive (AR) Modeling\n",
240+
"Autoregressive (AR) models belong to the class of **explicit density generative models**. They decompose the high-dimensional joint probability distribution of a data point into a sequence of conditional distributions. This structure allows for exact likelihood computation and stable training via MLE.\n",
241+
"\n",
242+
"\n",
243+
"The core principle of AR modeling is the application of the **Probability Chain Rule**. A $d$-dimensional data sample $\\mathbf{x} = (x_1, x_2, \\dots, x_d)$ is represented as a product of $d$ univariate conditional distributions.\n",
244+
"\n",
245+
"For any specific component $x_i$ within a sample, the model predicts the probability of that component given all preceding components:\n",
246+
"$$p(x_i | \\mathbf{x}_{<i}) = p(x_i | x_1, x_2, \\dots, x_{i-1})$$\n",
247+
"\n",
248+
"The entire sample's probability is the product of these conditionals:\n",
249+
"$$p(\\mathbf{x}) = \\prod_{i=1}^{d} p(x_i | \\mathbf{x}_{<i})$$\n",
250+
"By converting a complex $d$-dimensional joint distribution into $d$ one-dimensional distributions, the model simplifies the learning task while maintaining an **explicit** representation of the distributions for the sample.\n",
251+
"\n",
252+
"AR models assume a fixed ordering of dimensions (e.g., raster scan for images, left-to-right for text). The model assumes that $x_i$ depends only on the observed values $\\mathbf{x}_{<i}$, satisfying the **causal constraint**.\n",
253+
"\n",
254+
"### Architectural Pipeline\n",
255+
"The transformation from input to probability follows this general flow:\n",
256+
"1. **Input Context:** Previous elements $\\mathbf{x}_{<i}$ are fed into the model.\n",
257+
"2. **Neural Network Feature Extractor:** A backbone architecture (e.g., RNN, LSTM, Transformer, or Causal CNN) processes the sequence to produce a hidden representation $h_i$.\n",
258+
"3. **Output Head:**\n",
259+
" * **Discrete Data (e.g., Text):** A Softmax layer over a vocabulary.\n",
260+
" * **Continuous Data (e.g., Audio/Images):** A parametric distribution, such as a **Gaussian Mixture Model (GMM)** or a discretized logistic distribution, where the network predicts parameters $\\mu_i$ and $\\sigma_i$.\n",
261+
"4. **Sampling:** A value is sampled from $p(x_i | \\mathbf{x}_{<i})$ and appended to the context for the next step $i+1$. The first value $p(x_0)$ is sampled from the prior after training.\n",
262+
"\n",
263+
"### MLE Objective and Loss Function\n",
264+
"AR models are trained by maximizing the log-likelihood of the training data $\\mathcal{D}$. This is equivalent to minimizing the **Cross-Entropy Loss**:\n",
265+
"$$\\mathcal{L}(\\theta) = -\\sum_{\\mathbf{x} \\in \\mathcal{D}} \\sum_{i=1}^{d} \\log p_{\\theta}(x_i | \\mathbf{x}_{<i})$$\n",
266+
"\n",
267+
"### Teacher Forcing\n",
268+
"During training, we use **Teacher Forcing**. Instead of feeding the model's own (potentially erroneous) predictions from step $i-1$ into step $i$, we provide the **ground truth** values from the training set. This allows for parallelization during training (especially in Transformers and Causal CNNs) because all $x_i$ conditionals can be computed simultaneously, which also leads to faster convergence.\n",
269+
"\n",
270+
"### Challenges\n",
271+
"1. **Exposure Bias**: A significant drawback of Teacher Forcing is **Exposure Bias**. During training, the model only sees ground truth context. During inference (generation), it sees its own generated (noisy) samples. Errors accumulate over time, leading to a divergence between the training and testing distributions.\n",
272+
"\n",
273+
"2. **Sampling Bottleneck**: While training of AR models can be parallelized, **inference is inherently sequential and recursive** since any future data depends on the past data. To generate $x_{100}$, the model must first generate and process $x_1$ through $x_{99}$. The computational complexity of sampling a single sequence is $O(d)$, where $d$ is the number of dimensions/tokens. This makes AR models significantly slower at test-time compared to parallel generative models like GANs or non-autoregressive Flows."
274+
]
275+
},
206276
{
207277
"cell_type": "code",
208278
"execution_count": null,
209-
"id": "96176061",
279+
"id": "252a9932",
210280
"metadata": {},
211281
"outputs": [],
212282
"source": []

0 commit comments

Comments
 (0)