Skip to content

KonstantinosVard/Dimensionality-Reduction

Repository files navigation

Fashion-MNIST Dimensionality Reduction: PCA vs Autoencoder

MSc Data Science & Engineering · Machine Learning Course · Project 1/2: Dimensionality Reduction

A comprehensive comparison of linear (PCA) and non-linear (Autoencoder) dimensionality reduction techniques applied to clothing image recognition and clustering using the Fashion-MNIST dataset.

Overview

This project evaluates two dimensionality reduction approaches on Fashion-MNIST clothing images (28×28 pixels, 784 features) across 10 categories. The reduced-dimension representations are assessed through classification (k-NN) and clustering (k-means, GMM) tasks to determine which method better preserves meaningful information in the latent space.

Dataset

  • Source: Fashion-MNIST dataset
  • Images: 28×28 grayscale clothing images (784 features)
  • Categories: 10 clothing types (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle Boot)
  • Sample size: 150 images per category (1,500 total)
  • Split: 80% training (1,200), 20% test (300) with stratified sampling

Data Preprocessing

  • Normalization: Min-max scaling from [0, 255] to [0, 1] pixel values
  • Stratified split: Maintains class balance across train/test sets
  • Validation: 20% of training data (240 images) used for autoencoder validation

Dimensionality Reduction Methods

1. Principal Component Analysis (PCA)

  • Type: Linear transformation
  • Dimensions tested: 50, 25, 10
  • Best performance: 50D with MAE = 0.067

2. Autoencoder Neural Network

  • Architecture:
    • Encoder: ReLU activation (hidden layers), Linear (output)
    • Decoder: ReLU activation (hidden layers), Sigmoid (output)
  • Dimensions tested: 50, 25, 10
  • Performance: Consistent MAE ≈ 0.073-0.075 across all dimensions

Reconstruction Results

Method 10D 25D 50D
Autoencoder 0.075 0.073 0.073
PCA 0.100 0.082 0.067

Key Findings:

  • PCA reconstruction quality improves significantly with higher dimensions
  • Autoencoder performance remains stable regardless of latent dimension
  • Autoencoders effectively encode information even with only 10 features

Classification Evaluation (k-NN)

Leave-one-out cross-validation with k=1 to 9 neighbors:

Method Mean Accuracy Std Best Configuration
Original (784D) 0.740 ±0.020 k=4: 0.760
PCA-50 0.757 ±0.011 k=3: 0.773
PCA-25 0.754 ±0.013 k=3: 0.770
PCA-10 0.717 ±0.009 k=9: 0.730
AE-50 0.718 ±0.023 k=7: 0.740
AE-25 0.710 ±0.028 k=7: 0.730
AE-10 0.713 ±0.023 k=7: 0.750

Best Overall Performance: PCA-50 with k=3 achieving 77.3% accuracy

Clustering Analysis

Using PCA-50 features with k=10 clusters:

Method NMI Score
k-means 0.5235
GMM 0.5655

Clustering Insights:

  • Trousers: Best separated category (100% precision in classification)
  • Challenging pairs: Sneaker-Sandal, Coat-Pullover-Shirt show high visual similarity
  • t-SNE visualization confirms these similarity patterns

Synthetic Data Generation

Two approaches tested with 10D autoencoder latent space:

  1. Per-class Gaussian: Calculate mean/covariance per category
  2. Gaussian Mixture Model: Fit GMM with 10 components across all categories

Both methods successfully generate realistic clothing images from latent representations.

Key Findings

  1. PCA superiority for classification: Despite autoencoders' better low-dimensional reconstruction, PCA features achieve superior classification accuracy
  2. Dimension efficiency: Autoencoders maintain consistent performance across dimensions, while PCA benefits from higher dimensions
  3. Feature interpretability: PCA's linear combinations preserve class-discriminative information better than autoencoder's non-linear features
  4. Reconstruction vs. classification trade-off: Better reconstruction doesn't guarantee better downstream task performance

Practical Implications

  • For classification tasks: PCA with sufficient dimensions (≥25) recommended
  • For reconstruction/generation: Autoencoders effective even with minimal dimensions
  • For clustering: GMM outperforms k-means on reduced representations
  • Computational efficiency: Autoencoders more parameter-efficient for very low dimensions

Requirements

  • Python 3.7+
  • TensorFlow/Keras
  • scikit-learn
  • numpy
  • matplotlib
  • seaborn

Author

Konstantinos Vardakas


This project demonstrates the trade-offs between linear and non-linear dimensionality reduction techniques and their impact on downstream machine learning tasks.

About

Comprehensive comparison of linear (PCA) and non-linear (Autoencoder) dimensionality reduction techniques on Fashion-MNIST clothing image dataset with evaluation through generation, classification and clustering tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors