Breast Cancer Prediction: Classification Models and Neural Networks

This project compares two approaches for breast cancer prediction:

Approach 1: Traditional machine learning classification models (Logistic Regression, Random Forest, SVM, XGBoost).
Approach 2: Neural Networks using TensorFlow/Keras.

The goal is to evaluate the performance of both approaches on multiple breast cancer datasets and identify the best-performing model.

Datasets

The following datasets are used in this project:

Wisconsin Breast Cancer Dataset:
- Features: Numerical (e.g., radius, texture, perimeter).
- Target: Binary diagnosis (malignant/benign).
Breast Cancer Dataset:
- Features: Numerical (e.g., age, tumor size) and categorical (e.g., menopause, metastasis).
- Target: Binary diagnosis.
BRCA Dataset:
- Features: Numerical (e.g., age, protein levels) and categorical (e.g., tumor stage, surgery type).
- Target: Binary patient status.
German Breast Cancer Dataset:
- Features: Numerical (e.g., age, tumor size) and categorical (e.g., hormonal status).
- Target: Binary status.
SEER Breast Cancer Dataset:
- Features: Numerical (e.g., age, survival months) and categorical (e.g., tumor stage, estrogen status).
- Target: Binary status.

Approach 1: Classification Models

Models Used

Logistic Regression
Random Forest
Support Vector Machine (SVM)
XGBoost

Steps

Data Preprocessing:
- Handle missing values.
- Encode categorical variables.
- Scale numerical features using StandardScaler.
- Split data into training and testing sets.
Model Training:
- Train each model on the training data.
Model Evaluation:
- Evaluate models using Accuracy, ROC-AUC, and Confusion Matrix.
- Combine predictions using a weighted average to create a fused model.
Visualization:
- Plot ROC curves and bar plots for accuracy and AUC across datasets.

Approach 2: Neural Networks

Model Used

A Sequential Neural Network with:
- Input layer: 128 neurons, ReLU activation.
- Hidden layers: 64 and 32 neurons, ReLU activation.
- Output layer: Softmax (for multiclass) or Sigmoid (for binary classification).

Steps

Data Preprocessing:
- Handle missing values.
- Encode categorical variables.
- Scale numerical features using StandardScaler.
- Split data into training and testing sets.
Handle Class Imbalance:
- Use SMOTE (Synthetic Minority Oversampling Technique) to balance the dataset.
- Compute class weights for training.
Model Training:
- Train the neural network using the balanced dataset.
- Use Binary Crossentropy (for binary classification) or Categorical Crossentropy (for multiclass classification) as the loss function.
- Optimizer: Adam with a learning rate of 0.001.
Model Evaluation:
- Evaluate the model using Accuracy, ROC-AUC, and Confusion Matrix.
- For multiclass classification, use Weighted ROC-AUC.
Visualization:
- Plot accuracy and ROC curves for each dataset.

Evaluation Metrics

1. Accuracy

Measures the proportion of correctly predicted instances.

2. ROC-AUC

Measures the model's ability to distinguish between classes across all thresholds.

3. F1 Score

Harmonic mean of Precision and Recall.

Results

Approach 1 (Classification Models):
- Best-performing model: XGBoost (highest accuracy and AUC).
- Fused model (weighted average) outperforms individual models.
Approach 2 (Neural Networks):
- Achieves competitive performance, especially on imbalanced datasets.
- SMOTE and class weighting improve model performance.

How to Run the Code

Clone the repository:

git clone <repository-url>
cd <repository-folder>

Install dependencies:
```
pip install -r requirements.txt
```
Run the scripts:
- For Approach 1 (Classification Models):
```
python classification_models.py
```
- For Approach 2 (Neural Networks):
```
python neural_network.py
```
View results:
- ROC curves and accuracy plots are saved in the results/ folder.

Dependencies

Python 3.x
Libraries:
- pandas, numpy, scikit-learn, tensorflow, imblearn, matplotlib, seaborn, xgboost

Author

Arpitha Thippeswamy

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Dataset		Dataset
EDA results		EDA results
Results from Approach 1 and 2 - Model Performance		Results from Approach 1 and 2 - Model Performance
Approach 1 - Classification Models.ipynb		Approach 1 - Classification Models.ipynb
Approach 2 - Neural Networks.ipynb		Approach 2 - Neural Networks.ipynb
DS Seminar Project 1 - GROUP G.pdf		DS Seminar Project 1 - GROUP G.pdf
EDA.ipynb		EDA.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Prediction: Classification Models and Neural Networks

Datasets

Approach 1: Classification Models

Models Used

Steps

Approach 2: Neural Networks

Model Used

Steps

Evaluation Metrics

1. Accuracy

2. ROC-AUC

3. F1 Score

Results

How to Run the Code

Dependencies

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Prediction: Classification Models and Neural Networks

Datasets

Approach 1: Classification Models

Models Used

Steps

Approach 2: Neural Networks

Model Used

Steps

Evaluation Metrics

1. Accuracy

2. ROC-AUC

3. F1 Score

Results

How to Run the Code

Dependencies

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages