Repository for the second assignment of the Machine Learning Subject.
This project explores three fundamental types of learning tasks commonly encountered in Machine Learning: Classification, Clustering, and Regression. Furthermore, in this study will be analyzed how the dimensionality reduction techniques, PCA and Autoencoder, affect the model behaviour under these scenarios.
Along the study, two datasets were chosen from Kaggle:
Used for the Clustering and Classification tasks.\n It is composed by 2100 images from the USGS National Map Urban Area Imagery collection for various urban areas around the United States of America. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}
Used for the Regression task. This dataset is composed by the flood.csv file, composed by 50000 rows and 21 columns of multiple features relevant to flood prediction, including environmental factors and socio-economic indicators. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}
| Model | ARI | NMI | Training time |
|---|---|---|---|
| Original | 0.07045 | 0.32849 | 36.39308 s |
| PCA | 0.07004 | 0.34914 | 0.02168 s |
| Autoencoder | 0.03035 | 0.27749 | 0.02045 s |
| Model | R^2 | RMSE |
|---|---|---|
| Raw NN | 0.9458 | 0.0116 |
| PCA NN | 0.9467 | 0.0115 |
| AE NN | 0.9643 | 0.0094 |
| RandomForest | 1.0000 | 0.0001 |
[!NOTE] Graphic representation of the distribution of errors and Predicted vs Actual values in the metrics folder.
In classification, during the training, the validation and training accuracy has been measured by epochs: Best Epoch case
| Model | Epoch | Val Accuracy | Train Accuracy | Time |
|---|---|---|---|---|
| Original | 19 | 0.6063 | 1.0000 | 5.5911 s |
| PCA | 18 | 0.5016 | 0.6150 | 0.6150 s |
| Autoencoder | 6 | 0.1270 | 0.0891 | 0.0768 s |
Performance
| Model | Accuracy | Precission | Recall | F1-score |
|---|---|---|---|---|
| Original | 0.59047 | 0.60731 | 0.59047 | 0.58060 |
| PCA | 0.49206 | 0.48042 | 0.49206 | 0.48044 |
| Autoencoder | 0.13650 | 0.04802 | 0.13650 | 0.06918 |
[!NOTE] Confusion matrices and ROC-AUC curves in the metrics folder.
While dimensionality reduction techniques like PCA and Autoencoders significantly lower computational costs and training times, their impact on model performance is highly task-dependent. In image classification and clustering, these methods degraded performance (more visible in the Autoencoder case). Conversely, for regression on tabular data, dimensionality reduction proved highly beneficial by mitigating noise and overfitting, allowing the Autoencoder to achieve the best neural network performance.
In summary, dimensionality reduction is not universally beneficial, but when applied appropriately, it is a powerful tool for controlling model complexity, reducing overfitting and lowering computational cost.