Breast-Cancer-Prediction

This repository focuses on building predictive models for breast cancer diagnosis using medical imaging data. The project involves preprocessing, feature engineering, and the development of machine learning models to classify tumors as either Benign or Malignant.

Dataset Information

Source: Breast Cancer Wisconsin (Diagnostic) Dataset
Sample Size: 569
Predictors:
- Feature Types: Radius, Perimeter, Area, Compactness, Smoothness, Concavity, Concavity Points, Symmetry, Fractal Dimension, Texture.
- Statistic Types: Mean (1), Extreme (2), Standard Error (3).
Response Variable: Diagnosis (Benign or Malignant).

Preprocessing Steps

Dummy Variables: Not applicable (no categorical variables).
Handling Missing Values: No missing values in the dataset.
Handling Highly Correlated Predictors:
- Predictors with a correlation coefficient greater than 0.85 were removed to avoid redundancy.
- Example: Predictors such as concavity1, concave_points1, compactness1, perimeter3, and radius1 were removed, reducing the number of predictors from 30 to 17.
Handling Uncorrelated Predictors:
- Predictors with low correlation to the response variable were retained as they might still contribute valuable independent information to the model.
Box-Cox Transformation: Applied to skewed variables to make their distributions more symmetric.
Centering and Scaling: Ensures predictors with varying magnitudes have equal influence.
Spatial Sign Transformation: Addresses outliers in the dataset.
Data Splitting: 80-20 stratified split for training and testing sets.

Modeling

Algorithms Used

Linear Models: Logistic Regression, LDA, PLSDA, Penalized Models.
Non-linear Models: Neural Networks, SVM, KNN, Naive Bayes, etc.

Performance Metric

Primary Metric: ROC-AUC (ideal for unbalanced binary outcomes).

Handling Correlated and Uncorrelated Data in Modeling

Models trained on uncorrelated data: Logistic Regression, LDA, QDA, RDA, MDA, FDA, KNN.
Models trained on correlated data: PLSDA, Penalized Model, Neural Network, SVM, Naive Bayes.
Using separate datasets for correlated and uncorrelated predictors allowed us to compare model performance effectively.

Best Models

SVM:
- Tuning Parameters: Sigma = 0.007675, C = 4.
- Training ROC-AUC: 0.9959.
- Testing ROC-AUC: 1.0000.
Neural Network:
- Tuning Parameters: Size = 3, Decay = 1.
- Training ROC-AUC: 0.9943.
- Testing ROC-AUC: 0.9997.

Results

Confusion Matrix for SVM:

Reference Benign Malignant

Benign 70 0

Malignant 1 42
Most Important Predictors: Radius, Area, and Texture metrics.

Key Highlights

Correlation Handling: Removing highly correlated predictors improved model generalizability by reducing redundancy.
Dataset Separation: Training separate models on correlated and uncorrelated datasets highlighted differences in performance across algorithms.
SVM and Neural Networks demonstrated exceptional performance, achieving near-perfect ROC-AUC scores.

Getting Started

Clone the repository:

git clone https://github.com/PujaAmmineni/Breast-Cancer-Prediction.git

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Breast Cancer - Predictive Modeling.docx		Breast Cancer - Predictive Modeling.docx
Group 1 Predictive Modeling - 2nd.pdf		Group 1 Predictive Modeling - 2nd.pdf
Handle corelated variables.R		Handle corelated variables.R
README.md		README.md
for uncorelated variables (1).R		for uncorelated variables (1).R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast-Cancer-Prediction

Dataset Information

Preprocessing Steps

Modeling

Algorithms Used

Performance Metric

Handling Correlated and Uncorrelated Data in Modeling

Best Models

Results

Key Highlights

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Reference	Benign	Malignant
Benign	70	0
Malignant	1	42

Folders and files

Latest commit

History

Repository files navigation

Breast-Cancer-Prediction

Dataset Information

Preprocessing Steps

Modeling

Algorithms Used

Performance Metric

Handling Correlated and Uncorrelated Data in Modeling

Best Models

Results

Key Highlights

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages