Skip to content

Ludovik99/Breast-Cancer-Cells-Profiling-and-Classification

Repository files navigation

Breast-Cancer-Cells-Profiling-and-Classification

After conducting a PCA analysis (R-Studio) on the cells' morphological characteristics, a Logistic regression model (Scikit-learn) is applied to classify the benign/malignant nature of each cells.

1 - Breast Cancer Cells Analysis using PCA During my Msc. in Data Science, I had the opportunity to analyze, together with my colleague Edoardo Ferrero, digitalised images of breast cancer cells, whose characteristics have been categorised under different measures. Starting from personal intuition, I wanted to analyze how these - seemingly redundant - characteristics could have been dimensionally reduced in order to explain a consistent magnitude of variance with regards to malignant cancer cells. After having filtered the dataset to keep only those cells classified as malignant, we performed a Principal Component Analysis on R Studio, highlighting those features which aggregately contribute to the 82% of the variance in the dataset, potentially underlying the malignant nature of the cancer cell.

This part of the work has been supervised by Professor Matthias Seifert, PhD.

2 - Classification of breast tumoral masses using Logistic Regression Another interesting horizion that I wanted to explore individually regarded the coding of a Machine Learning Binary Classification model, capable of labelling the breast tumoral masses as Benign (0 in the model) or Malignant (1 in the model), depending on the observed morphological charcteristics. However, the instances in the exploited dataset were not numerous enough to allow for a proper training of the model, reason for which I decided to generate syntethic data to complement the original dataset, the synthetic instances have been generated through Clearbox.AI, an interface for the generation of syntethic data consistent with the one present in the original dataset; the notebook "2 - Original + Syntehtic data" performs a merging operation between the two datasets. From the merged dataset, I coded different models to find the best classification algorithm for our case study in a trial-and-error fashion, I coded Random Forest, XGBoost, LDA and Logistic Regression models. After splitting the data into train and test sets, I could observe how Logistic Regression offered the best classification performance on unseen (test) data, reason for which it has been chosen as final model; the notebook "3 - Classification model" contains the code for this final model. In order to obtain teh best possible performance, the features have been scaled and the model's hyperparameters have been set through a Grid Search framework optimized by Recall. The evaluation metric has been chosen based on the nature of the problem: ideally, we wanted to correctly predict the highest possible number of malignant tumors, minimizing the amount of False Negatives (instances classified as benign which turn out to be malignant). In conclusion the results have been satisfying, as the model shows a Recall of 0.986 and a F-1 Score of 0.949. When presented with unseen data, the model correctly predicted 71 out of the 72 malignant tumors.

About

After conducting a PCA analysis (R-Studio) on the cells' morphological characteristics, a Logistic regression model (Scikit-learn) is applied to classify the benign/malignant nature of each cells.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors