Skip to content

aa2782/Machine-Learning-Models-to-Predict-Breast-Cancer-using-Gene-Expression-Data-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Machine-Learning-Models-to-Predict-Breast-Cancer-using-Gene-Expression-Data

Authors: Arun Aryal, Sravana Nuti, Sudan Duwadi

Introduction: Breast cancer occurs when the cells in the mammary gland grow and multiply in an uncontrollable way. It is most prevalent in women, although there are cases with men having breast cancer as well.

Research has revealed different types of breast cancer cells based on the factors involved in the growth of the tumor. The four primary molecular subtypes are determined by the presence or absence of proteins. They are Luminal A, Luminal B, HER2, and basal like. Since each type of breast cancer cell type behaves differently, classifying a breast cell to a molecular type is crucial. Classification allows for more targeted breast cancer treatment because each different molecular subtype has a different treatment option. Each treatment has a different response from the cancer cells, leading to a different clinical outcome.

The goal of this project is to build a machine learning model that will classify a given specimen into one of the four subtypes. It is important to determine, which genes have the most impact in this classification given the data. Subsequently, this gene or a set of genes can be used as a predictor for breast cancer.

This particular Breast Cancer dataset has 151 different measurements with 54,677 features in total. Some features for each measurement include, sample ID, type of breast cancer (Basal, HER, luminal_B, luminal_Aplus, cell_line, or normal tissue), and quantified values for different types of genes. The ‘sample ID’ feature specifies the sample number that was examined. The ‘type’ feature specifies the type of breast cancer that the sample is. The rest of the features each correspond to a single gene, and the values present in the columns indicate the gene expression of each gene in the given sample.

For this model, we are only focused on classifying samples into breast cancer types. Therefore, the ‘Normal’ and the ‘Cell Line’ samples were dropped from the dataset. They will not be used to train the model.

As such, we used the following models to analyze the data: logistic regression, K-neighbor classifier, Random forest classifier, and support vector classifier.

We also used PCA and GridSearchCV to see the performance changes in the models.

Finally we also used the Sequential Deep Learning model.

Data source: https://www.kaggle.com/dataset/134765b899e1df132438692cdd193baa3142bbbd849d5129c4537f56e04671df

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors