Second Assignment Machine Learning:

Repository for the second assignment of the Machine Learning Subject.

This project explores three fundamental types of learning tasks commonly encountered in Machine Learning: Classification, Clustering, and Regression. Furthermore, in this study will be analyzed how the dimensionality reduction techniques, PCA and Autoencoder, affect the model behaviour under these scenarios.

Datasets.

Along the study, two datasets were chosen from Kaggle:

UC Merced Land Use Dataset

Used for the Clustering and Classification tasks.\n It is composed by 2100 images from the USGS National Map Urban Area Imagery collection for various urban areas around the United States of America. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}

Flood Prediction Dataset

Used for the Regression task. This dataset is composed by the flood.csv file, composed by 50000 rows and 21 columns of multiple features relevant to flood prediction, including environmental factors and socio-economic indicators. {https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset}

Obtained Results

Clustering Metrics

Model	ARI	NMI	Training time
Original	0.07045	0.32849	36.39308 s
PCA	0.07004	0.34914	0.02168 s
Autoencoder	0.03035	0.27749	0.02045 s

Regression Metrics

Model	R^2	RMSE
Raw NN	0.9458	0.0116
PCA NN	0.9467	0.0115
AE NN	0.9643	0.0094
RandomForest	1.0000	0.0001

[!NOTE] Graphic representation of the distribution of errors and Predicted vs Actual values in the metrics folder.

Clasification Metrics

In classification, during the training, the validation and training accuracy has been measured by epochs: Best Epoch case

Model	Epoch	Val Accuracy	Train Accuracy	Time
Original	19	0.6063	1.0000	5.5911 s
PCA	18	0.5016	0.6150	0.6150 s
Autoencoder	6	0.1270	0.0891	0.0768 s

Performance

Model	Accuracy	Precission	Recall	F1-score
Original	0.59047	0.60731	0.59047	0.58060
PCA	0.49206	0.48042	0.49206	0.48044
Autoencoder	0.13650	0.04802	0.13650	0.06918

[!NOTE] Confusion matrices and ROC-AUC curves in the metrics folder.

Conclusions

While dimensionality reduction techniques like PCA and Autoencoders significantly lower computational costs and training times, their impact on model performance is highly task-dependent. In image classification and clustering, these methods degraded performance (more visible in the Autoencoder case). Conversely, for regression on tabular data, dimensionality reduction proved highly beneficial by mitigating noise and overfitting, allowing the Autoencoder to achieve the best neural network performance.

In summary, dimensionality reduction is not universally beneficial, but when applied appropriately, it is a powerful tool for controlling model complexity, reducing overfitting and lowering computational cost.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
metrics		metrics
LICENSE		LICENSE
README.md		README.md
report.pdf		report.pdf
sec_assignment.ipynb		sec_assignment.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Second Assignment Machine Learning:

Datasets.

UC Merced Land Use Dataset

Flood Prediction Dataset

Obtained Results

Clustering Metrics

Regression Metrics

Clasification Metrics

Conclusions

About

Uh oh!

Releases

Packages

Languages

License

celezm/ml_tasks

Folders and files

Latest commit

History

Repository files navigation

Second Assignment Machine Learning:

Datasets.

UC Merced Land Use Dataset

Flood Prediction Dataset

Obtained Results

Clustering Metrics

Regression Metrics

Clasification Metrics

Conclusions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages