This project compares two approaches for breast cancer prediction:
- Approach 1: Traditional machine learning classification models (Logistic Regression, Random Forest, SVM, XGBoost).
- Approach 2: Neural Networks using TensorFlow/Keras.
The goal is to evaluate the performance of both approaches on multiple breast cancer datasets and identify the best-performing model.
The following datasets are used in this project:
-
Wisconsin Breast Cancer Dataset:
- Features: Numerical (e.g., radius, texture, perimeter).
- Target: Binary diagnosis (malignant/benign).
-
Breast Cancer Dataset:
- Features: Numerical (e.g., age, tumor size) and categorical (e.g., menopause, metastasis).
- Target: Binary diagnosis.
-
BRCA Dataset:
- Features: Numerical (e.g., age, protein levels) and categorical (e.g., tumor stage, surgery type).
- Target: Binary patient status.
-
German Breast Cancer Dataset:
- Features: Numerical (e.g., age, tumor size) and categorical (e.g., hormonal status).
- Target: Binary status.
-
SEER Breast Cancer Dataset:
- Features: Numerical (e.g., age, survival months) and categorical (e.g., tumor stage, estrogen status).
- Target: Binary status.
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- XGBoost
-
Data Preprocessing:
- Handle missing values.
- Encode categorical variables.
- Scale numerical features using
StandardScaler. - Split data into training and testing sets.
-
Model Training:
- Train each model on the training data.
-
Model Evaluation:
- Evaluate models using Accuracy, ROC-AUC, and Confusion Matrix.
- Combine predictions using a weighted average to create a fused model.
-
Visualization:
- Plot ROC curves and bar plots for accuracy and AUC across datasets.
- A Sequential Neural Network with:
- Input layer: 128 neurons, ReLU activation.
- Hidden layers: 64 and 32 neurons, ReLU activation.
- Output layer: Softmax (for multiclass) or Sigmoid (for binary classification).
-
Data Preprocessing:
- Handle missing values.
- Encode categorical variables.
- Scale numerical features using
StandardScaler. - Split data into training and testing sets.
-
Handle Class Imbalance:
- Use SMOTE (Synthetic Minority Oversampling Technique) to balance the dataset.
- Compute class weights for training.
-
Model Training:
- Train the neural network using the balanced dataset.
- Use Binary Crossentropy (for binary classification) or Categorical Crossentropy (for multiclass classification) as the loss function.
- Optimizer: Adam with a learning rate of 0.001.
-
Model Evaluation:
- Evaluate the model using Accuracy, ROC-AUC, and Confusion Matrix.
- For multiclass classification, use Weighted ROC-AUC.
-
Visualization:
- Plot accuracy and ROC curves for each dataset.
- Measures the proportion of correctly predicted instances.
- Measures the model's ability to distinguish between classes across all thresholds.
- Harmonic mean of Precision and Recall.
-
Approach 1 (Classification Models):
- Best-performing model: XGBoost (highest accuracy and AUC).
- Fused model (weighted average) outperforms individual models.
-
Approach 2 (Neural Networks):
- Achieves competitive performance, especially on imbalanced datasets.
- SMOTE and class weighting improve model performance.
-
Clone the repository:
git clone <repository-url> cd <repository-folder>
-
Install dependencies:
pip install -r requirements.txt
-
Run the scripts:
- For Approach 1 (Classification Models):
python classification_models.py
- For Approach 2 (Neural Networks):
python neural_network.py
- For Approach 1 (Classification Models):
-
View results:
- ROC curves and accuracy plots are saved in the
results/folder.
- ROC curves and accuracy plots are saved in the
- Python 3.x
- Libraries:
pandas,numpy,scikit-learn,tensorflow,imblearn,matplotlib,seaborn,xgboost
- Arpitha Thippeswamy