Skip to content

chataignault/probabilistic_pca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probabilitic PCA on MNIST dataset

Principal Components Analysis aims to describe as much variance in a dataset with orthogonal combinations of features, which is adapted to handle multi-colinearities, at the cost of features interpretability.

In Bengio, Yoshua, Ian Goodfellow, and Aaron Courville, 2017.*, the PCA is used to describe the data distribution. With Gaussian assumption, it generates samples assuming an ellipsoid distribution of the data.

Indeed the covariance matrix of the data can be deduced from PCA. For instance it can be used as a regularization technique to fit a covariance matrix in a Gaussian mixture model.

Extending the probabilistic framework further, PCA was then bestowed with a probabilistic optimisation framework and interpretation thereof (Tipping, 1999*) that allows for a fast approximation.

Probabilistic PCA (PPCA) has other applications like dealing with missing data and maximum likelihood comparison with other probabilistic models.

Project Overview :

  • Implement Singular Value Decomposition algorithms :
    • Using the QR algoritm on the covariance matrix,
    • Using the Golub-Kahan algorithm,
  • Implement the EM algorithm in the Probabilistic PCA framework,
  • Applying the PCA to generate MNIST digits

Results :

With first SVD components :

With Probabilistic PCA :

TODO :

  • Implement MLE estimation of PPCA with EM algorithm
  • Benchmark execution time between naive SVD and Golub-Kahan algorithm

Using Singular Value Decomposition

The naive SVD algorithm, pure QR with Householder reflections and bidiagonalisation algorithm are taken from Trehefen*, while the SVD implementation with Givens rotations is from Golub*.

The bidiagonalisation is computed with full orthogonal matrices, that is to say for $A \in \mathbb{R}^{n \times m}$ :

$$ A = U B V^T $$

where $U\in \mathbb{R}^{n \times n}$ and $V \in \mathbb{R}^{m \times m}$ .

A reduced form can be found, for instance if $m > n$, the matrices would be of the form :

$$ \begin{cases} U\in \mathbb{R}^{n \times n} \\ A\in \mathbb{R}^{n \times n} \\ V\in \mathbb{R}^{n \times m} \end{cases} $$

But doing so, $V$ is not orthogonal anymore.

Decomposing matrices $A$ and $V$ into sub square matrices $\tilde{A}, \tilde{V}$:

$$ \begin{cases} A = \begin{pmatrix} \tilde{A} & R \end{pmatrix} \\ V = \begin{pmatrix} \tilde{V} & Z \end{pmatrix} \end{cases} $$

while applying the normal Golub bidiagonalisation algorithm to $\tilde{A}$ :

$$ A = U \begin{pmatrix} \tilde{B} & R \end{pmatrix} \begin{pmatrix} \tilde{V}^T \ Z^T \end{pmatrix} $$

and $Z$ can be left to compute at the end by solving the system :

$$ \tilde{B} Z^T = U^T R $$

where $\tilde{B}$ is square and upper bi-diagonal.

A similar system is solved if $A$ is thin instead of wide, with the same complexity.

Using the EM algorithm

Using the same notations as in Tipping's paper*, the probabilistic framework being :

$$ t = W x + \mu + \epsilon $$

Where $t \in \mathbb{R}^d$ is the observation variable, $x \in \mathbb{R}^q$ is the latent one, with $q \ll d$ and $\epsilon \sim \mathcal{N} (0, \sigma^2)$.

In the following section, the reasoning is detailed, as in the paper, with extra intermediate steps.

Then the log-likelihood of the joint observation and latent variables is :

$$\begin{align*} \mathcal{L}(x, t) = &\sum \log p(x_i, t_i) \\\ = & -\frac{1}{2} \sum_{i=1}^N \left\{ d \log \sigma^2 + \frac{1}{\sigma^2} (t_i-\mu - Wx_i)^T( t_i - \mu - Wx_i) + x_i^Tx_i \right\} \\\ & -\frac{1}{2} \sum_{i=1}^N \left\{ d \log \sigma^2 + \frac{1}{\sigma^2} \text{tr} \left((t_i-\mu) (t_i-\mu)^T\right) + \frac{1}{\sigma^2}\text{tr} \left( W^T W x_i x_i^T\right) - \frac{2}{\sigma^2}(t_i-\mu)^TWx_i + x_i^Tx_i \right\} \\\ \end{align*}$$

E step

Taking the expectation step (conditionned on $t, W$ and $\sigma^2$) gives :

$$\begin{align*} \mathbb{E} \left\{ \mathcal{L(x, t)} | t, W, \sigma^2 \right\} = -\sum_{i=1}^N &\left\{ \frac{d}{2} \log \sigma^2 + \frac{1}{2\sigma^2} \text{tr} \left( (t_i-\mu) (t_i-\mu)^T\right) + \frac{1}{2\sigma^2}\text{tr} \left( W^T W \langle x_i x_i^T \rangle \right) \right. \\ & \left.- \frac{1}{\sigma^2}\text{tr} \left( W\langle x_i \rangle (t_i-\mu)^T \right) + \frac{1}{2} \text{tr} \left( \langle x_i x_i^T \rangle \right) \right\} \end{align*}$$

The final form of the conditionned expectation is designated as $\text{el(t, x)}$.

M step

Differenciating $el$ with respect to the model parameters gives :

$$\frac{\partial \text{el(t, x)}}{\partial W} = - \frac{1}{\sigma^2} \sum_{n=1}^N \left\{ W \langle x_n x_n^T \rangle - (t_n - \mu)\langle x_n \rangle^T \right\}$$

and

$$\frac{\partial \text{el(t, x)}}{\partial \sigma^2} = - \frac{1}{\sigma^2} \sum_{n=1}^N \left\{ \frac{d}{2} - \frac{1}{2\sigma^2} \text{tr} \left( (t_n - \mu)(t_n - \mu)^T \right) - \frac{1}{2\sigma^2} \text{tr} \left( W^T W \langle x_n x_n^T \rangle \right) \right\}$$

Hence the new parameters are given by :

$$\begin{cases} \tilde{W} = \left( \sum_{n=1}^N (t_n - \mu) \langle x_n \rangle^T \right) \left( \sum_{n=1}^N \langle x_n x_n^T \rangle \right)^{-1} \\\ \tilde{\sigma}^2 = \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W} ^T \tilde{W} \frac{1}{N} \sum_{n=1}^N \langle x_n x_n^T \rangle \right) - 2 \text{tr} \left( \tilde{W} \frac{1}{N} \sum_{n=1}^N \langle x_n \rangle (t_n - \mu)^T \right) \right\} \end{cases}$$

Where $S = \frac{1}{N} \sum_{n=1}^N (t_n - \mu) (t_n - \mu)^T$ is the empirical covariance of the observations.

Both expectations $\langle x_n \rangle$ and $\langle x_n x_n^T \rangle$ for any $n$ are analytic given $x | t$ is gaussian.

Indeed, Bayes forlula giving :

$$\begin{align*} p(x|t) & = \frac{p(t|x)p(x)}{p(t)} \\ & \propto \exp \left( - \frac{1}{2\sigma^2} \left(x- M^{-1} W^{T}(t-\mu)\right)^T M \left(x- M^{-1} W^{T}(t-\mu)\right) \right) \end{align*}$$

Shows that $x | t \sim \mathcal{N}\left(M^{-1} W^{T}(t-\mu), \sigma^2 M^{-1} \right)$, where $M = W^TW + \sigma^2 I$.

From there :

$$\begin{cases} \langle x_n \rangle = M^{-1} W^T (t_n - \mu) \\\ \langle x_n x_n^T \rangle = Var(x_n |t_n, W, \sigma^2) + \langle x_n \rangle \langle x_n \rangle^T = \sigma^2 M^{-1} + M^{-1} W^T (t_n - \mu)(t_n - \mu)^T W M^{-1} \end{cases}$$

Which simplifies the formula for the optimal parameters as follows :

$$\begin{align*} \tilde{W} & = \left( \sum_{n=1}^N (t_n - \mu) (t_n - \mu)^T W M^{-1} \right) \left( N\sigma^2 M^{-1} + M^{-1} W^T \sum_{n=1}^N (t_n - \mu)(t_n - \mu)^T W M^{-1} \right)^{-1} \\\ & = SWM^{-1} \left( \sigma^2 M^{-1} + M^{-1} W^T S W M^{-1} \right) \\\ & = SW \left( \sigma^2 + M^{-1} W^T S W\right) \end{align*}$$

and :

$$\begin{align*} \tilde{\sigma}^2 = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W}^T \tilde{W} \sigma^2 M^{-1} + \tilde{W}^T \tilde{W} M^{-1} W^T \frac{1}{N} \sum_{n=1}^N (t_n - \mu)(t_n - \mu)^T W M^{-1} \right) \right. \\ & \left. - 2 \text{tr} \left( \tilde{W} \frac{1}{N} \sum_{n=1}^N M^{-1} W^T (t_n - \mu) (t_n - \mu)^T \right) \right\} \\\ \tilde{\sigma}^2 = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \sigma^2\tilde{W}^T \tilde{W} M^{-1} + \tilde{W}^T \tilde{W} M^{-1} W^T S W M^{-1} \right) - 2 \text{tr} \left( \tilde{W} M^{-1} W^T S \right) \right\} \\\ = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W}^T \tilde{W} M^{-1} \left( \sigma^2 I + W^T S W M^{-1} \right) \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\} \\\ = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W} M^{-1} \left( \sigma^2 I + W^T S W M^{-1} \right) \tilde{W}^T \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\} \\\ = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W} M^{-1} \left( \underbrace{\tilde{W} \left( \sigma^2 I + M^{-1} W^T S W \right)}_{SW} \right)^T \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\} \end{align*}$$

eventually :

$$\tilde{\sigma}^2 = \frac{1}{d} \text{tr}(S - S W M^{-1}\tilde{W}^T)$$

References :

@article{tipping1999probabilistic,
  title={Probabilistic principal component analysis},
  author={Tipping, Michael E and Bishop, Christopher M},
  journal={Journal of the Royal Statistical Society Series B: Statistical Methodology},
  volume={61},
  number={3},
  pages={611--622},
  year={1999},
  publisher={Oxford University Press}
}
@book{bengio2017deep,
  title={Deep learning},
  author={Bengio, Yoshua and Goodfellow, Ian and Courville, Aaron and others},
  volume={1},
  year={2017},
  publisher={MIT press Cambridge, MA, USA}
}
@book{trefethen2022numerical,
  title={Numerical linear algebra},
  author={Trefethen, Lloyd N and Bau, David},
  year={2022},
  publisher={SIAM}
}
@book{golub2013matrix,
  title={Matrix computations},
  author={Golub, Gene H and Van Loan, Charles F},
  year={2013},
  publisher={JHU press}
}

Also read :

To model heavy-tailed distributions :

@article{collas2021probabilistic,
  title={Probabilistic PCA from heteroscedastic signals: geometric framework and application to clustering},
  author={Collas, Antoine and Bouchard, Florent and Breloy, Arnaud and Ginolhac, Guillaume and Ren, Chengfang and Ovarlez, Jean-Philippe},
  journal={IEEE Transactions on Signal Processing},
  volume={69},
  pages={6546--6560},
  year={2021},
  publisher={IEEE}
}

To adapt PCA to response variable :

@misc{papazoglou2025covariancesupervisedprincipalcomponent,
      title={Covariance Supervised Principal Component Analysis}, 
      author={Theodosios Papazoglou and Guosheng Yin},
      year={2025},
      eprint={2506.19247},
      archivePrefix={arXiv},
      primaryClass={stat.AP},
      url={https://arxiv.org/abs/2506.19247}, 
}

About

Explore Probabilistic PCA framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages