Skip to content

Repository for the second assignment of the Machine Learning Subject. Sapienza University of Rome

License

Notifications You must be signed in to change notification settings

celezm/ml_tasks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Second Assignment Machine Learning:

Repository for the second assignment of the Machine Learning Subject.

This project explores three fundamental types of learning tasks commonly encountered in Machine Learning: Classification, Clustering, and Regression. Furthermore, in this study will be analyzed how the dimensionality reduction techniques, PCA and Autoencoder, affect the model behaviour under these scenarios.

Datasets.

Along the study, two datasets were chosen from Kaggle:

UC Merced Land Use Dataset

Used for the Clustering and Classification tasks.\n It is composed by 2100 images from the USGS National Map Urban Area Imagery collection for various urban areas around the United States of America. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}

Flood Prediction Dataset

Used for the Regression task. This dataset is composed by the flood.csv file, composed by 50000 rows and 21 columns of multiple features relevant to flood prediction, including environmental factors and socio-economic indicators. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}

Obtained Results

Clustering Metrics

Model ARI NMI Training time
Original 0.07045 0.32849 36.39308 s
PCA 0.07004 0.34914 0.02168 s
Autoencoder 0.03035 0.27749 0.02045 s

Regression Metrics

Model R^2 RMSE
Raw NN 0.9458 0.0116
PCA NN 0.9467 0.0115
AE NN 0.9643 0.0094
RandomForest 1.0000 0.0001

[!NOTE] Graphic representation of the distribution of errors and Predicted vs Actual values in the metrics folder.

Clasification Metrics

In classification, during the training, the validation and training accuracy has been measured by epochs: Best Epoch case

Model Epoch Val Accuracy Train Accuracy Time
Original 19 0.6063 1.0000 5.5911 s
PCA 18 0.5016 0.6150 0.6150 s
Autoencoder 6 0.1270 0.0891 0.0768 s

Performance

Model Accuracy Precission Recall F1-score
Original 0.59047 0.60731 0.59047 0.58060
PCA 0.49206 0.48042 0.49206 0.48044
Autoencoder 0.13650 0.04802 0.13650 0.06918

[!NOTE] Confusion matrices and ROC-AUC curves in the metrics folder.

Conclusions

While dimensionality reduction techniques like PCA and Autoencoders significantly lower computational costs and training times, their impact on model performance is highly task-dependent. In image classification and clustering, these methods degraded performance (more visible in the Autoencoder case). Conversely, for regression on tabular data, dimensionality reduction proved highly beneficial by mitigating noise and overfitting, allowing the Autoencoder to achieve the best neural network performance.

In summary, dimensionality reduction is not universally beneficial, but when applied appropriately, it is a powerful tool for controlling model complexity, reducing overfitting and lowering computational cost.

About

Repository for the second assignment of the Machine Learning Subject. Sapienza University of Rome

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published