Machine-Learning-Models-to-Predict-Breast-Cancer-using-Gene-Expression-Data

Authors: Arun Aryal, Sravana Nuti, Sudan Duwadi

Introduction: Breast cancer occurs when the cells in the mammary gland grow and multiply in an uncontrollable way. It is most prevalent in women, although there are cases with men having breast cancer as well.

Research has revealed different types of breast cancer cells based on the factors involved in the growth of the tumor. The four primary molecular subtypes are determined by the presence or absence of proteins. They are Luminal A, Luminal B, HER2, and basal like. Since each type of breast cancer cell type behaves differently, classifying a breast cell to a molecular type is crucial. Classification allows for more targeted breast cancer treatment because each different molecular subtype has a different treatment option. Each treatment has a different response from the cancer cells, leading to a different clinical outcome.

The goal of this project is to build a machine learning model that will classify a given specimen into one of the four subtypes. It is important to determine, which genes have the most impact in this classification given the data. Subsequently, this gene or a set of genes can be used as a predictor for breast cancer.

This particular Breast Cancer dataset has 151 different measurements with 54,677 features in total. Some features for each measurement include, sample ID, type of breast cancer (Basal, HER, luminal_B, luminal_Aplus, cell_line, or normal tissue), and quantified values for different types of genes. The ‘sample ID’ feature specifies the sample number that was examined. The ‘type’ feature specifies the type of breast cancer that the sample is. The rest of the features each correspond to a single gene, and the values present in the columns indicate the gene expression of each gene in the given sample.

For this model, we are only focused on classifying samples into breast cancer types. Therefore, the ‘Normal’ and the ‘Cell Line’ samples were dropped from the dataset. They will not be used to train the model.

As such, we used the following models to analyze the data: logistic regression, K-neighbor classifier, Random forest classifier, and support vector classifier.

We also used PCA and GridSearchCV to see the performance changes in the models.

Finally we also used the Sequential Deep Learning model.

Data source: https://www.kaggle.com/dataset/134765b899e1df132438692cdd193baa3142bbbd849d5129c4537f56e04671df

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Deep_Learning_Code.ipynb		Deep_Learning_Code.ipynb
Machine_Learning_Code.ipynb		Machine_Learning_Code.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Learning-Models-to-Predict-Breast-Cancer-using-Gene-Expression-Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Models-to-Predict-Breast-Cancer-using-Gene-Expression-Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages