You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Principal Components Analysis aims to describe as much variance in a dataset
with orthogonal combinations of features,
which is adapted to handle multi-colinearities,
at the cost of features interpretability.
In
Bengio, Yoshua, Ian Goodfellow, and Aaron Courville, 2017.*,
the PCA is used to describe the data distribution.
With Gaussian assumption, it generates samples assuming an ellipsoid distribution of the data.
Indeed the covariance matrix of the data can be deduced from PCA.
For instance it can be used as a regularization technique
to fit a covariance matrix in a Gaussian mixture model.
Extending the probabilistic framework further, PCA was then bestowed with a probabilistic optimisation framework
and interpretation thereof (Tipping, 1999*)
that allows for a fast approximation.
Probabilistic PCA (PPCA) has other applications like dealing with missing data
and maximum likelihood comparison with other probabilistic models.
Project Overview :
Implement Singular Value Decomposition algorithms :
Using the QR algoritm on the covariance matrix,
Using the Golub-Kahan algorithm,
Implement the EM algorithm in the Probabilistic PCA framework,
Applying the PCA to generate MNIST digits
Results :
With first SVD components :
With Probabilistic PCA :
TODO :
Implement MLE estimation of PPCA with EM algorithm
Benchmark execution time between naive SVD and Golub-Kahan algorithm
Using Singular Value Decomposition
The naive SVD algorithm, pure QR with Householder reflections and bidiagonalisation algorithm are taken from Trehefen*,
while the SVD implementation with Givens rotations is from Golub*.
The bidiagonalisation is computed with full orthogonal matrices,
that is to say for $A \in \mathbb{R}^{n \times m}$ :
$$ A = U B V^T $$
where $U\in \mathbb{R}^{n \times n}$ and $V \in \mathbb{R}^{m \times m}$ .
A reduced form can be found, for instance if $m > n$, the matrices would be of the form :
Decomposing matrices $A$ and $V$ into sub square matrices $\tilde{A}, \tilde{V}$:
$$
\begin{cases}
A = \begin{pmatrix} \tilde{A} & R \end{pmatrix} \\
V = \begin{pmatrix} \tilde{V} & Z \end{pmatrix}
\end{cases}
$$
while applying the normal Golub bidiagonalisation algorithm to $\tilde{A}$ :
$$
A = U \begin{pmatrix} \tilde{B} & R \end{pmatrix}
\begin{pmatrix} \tilde{V}^T \ Z^T \end{pmatrix}
$$
and $Z$ can be left to compute at the end by solving the system :
$$
\tilde{B} Z^T = U^T R
$$
where $\tilde{B}$ is square and upper bi-diagonal.
A similar system is solved if $A$ is thin instead of wide, with the same complexity.
Using the EM algorithm
Using the same notations as in Tipping's paper*,
the probabilistic framework being :
$$
t = W x + \mu + \epsilon
$$
Where $t \in \mathbb{R}^d$ is the observation variable,
$x \in \mathbb{R}^q$ is the latent one,
with $q \ll d$ and $\epsilon \sim \mathcal{N} (0, \sigma^2)$.
In the following section, the reasoning is detailed, as in the paper,
with extra intermediate steps.
Then the
log-likelihood
of the joint observation and latent variables is :
$$\begin{align*}
\tilde{\sigma}^2 = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W}^T \tilde{W} \sigma^2 M^{-1} + \tilde{W}^T \tilde{W} M^{-1} W^T \frac{1}{N} \sum_{n=1}^N (t_n - \mu)(t_n - \mu)^T W M^{-1} \right) \right. \\
& \left. - 2 \text{tr} \left( \tilde{W} \frac{1}{N} \sum_{n=1}^N M^{-1} W^T (t_n - \mu) (t_n - \mu)^T \right) \right\} \\\
\tilde{\sigma}^2 = & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \sigma^2\tilde{W}^T \tilde{W} M^{-1} + \tilde{W}^T \tilde{W} M^{-1} W^T S W M^{-1} \right) - 2 \text{tr} \left( \tilde{W} M^{-1} W^T S \right) \right\} \\\
= & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W}^T \tilde{W} M^{-1} \left( \sigma^2 I + W^T S W M^{-1} \right) \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\} \\\
= & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W} M^{-1} \left( \sigma^2 I + W^T S W M^{-1} \right) \tilde{W}^T \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\} \\\
= & \frac{1}{d} \left\{ \text{tr}(S) + \text{tr} \left( \tilde{W} M^{-1} \left( \underbrace{\tilde{W} \left( \sigma^2 I + M^{-1} W^T S W \right)}_{SW} \right)^T \right) - 2 \text{tr} \left( S W M^{-1} \tilde{W}^T \right) \right\}
\end{align*}$$
eventually :
$$\tilde{\sigma}^2 = \frac{1}{d} \text{tr}(S - S W M^{-1}\tilde{W}^T)$$
References :
@article{tipping1999probabilistic,
title={Probabilistic principal component analysis},
author={Tipping, Michael E and Bishop, Christopher M},
journal={Journal of the Royal Statistical Society Series B: Statistical Methodology},
volume={61},
number={3},
pages={611--622},
year={1999},
publisher={Oxford University Press}
}
@book{bengio2017deep,
title={Deep learning},
author={Bengio, Yoshua and Goodfellow, Ian and Courville, Aaron and others},
volume={1},
year={2017},
publisher={MIT press Cambridge, MA, USA}
}
@book{trefethen2022numerical,
title={Numerical linear algebra},
author={Trefethen, Lloyd N and Bau, David},
year={2022},
publisher={SIAM}
}
@book{golub2013matrix,
title={Matrix computations},
author={Golub, Gene H and Van Loan, Charles F},
year={2013},
publisher={JHU press}
}
Also read :
To model heavy-tailed distributions :
@article{collas2021probabilistic,
title={Probabilistic PCA from heteroscedastic signals: geometric framework and application to clustering},
author={Collas, Antoine and Bouchard, Florent and Breloy, Arnaud and Ginolhac, Guillaume and Ren, Chengfang and Ovarlez, Jean-Philippe},
journal={IEEE Transactions on Signal Processing},
volume={69},
pages={6546--6560},
year={2021},
publisher={IEEE}
}
To adapt PCA to response variable :
@misc{papazoglou2025covariancesupervisedprincipalcomponent,
title={Covariance Supervised Principal Component Analysis},
author={Theodosios Papazoglou and Guosheng Yin},
year={2025},
eprint={2506.19247},
archivePrefix={arXiv},
primaryClass={stat.AP},
url={https://arxiv.org/abs/2506.19247},
}