Predicting Stroke Risk Using Machine Learning Models

Project Overview

Stroke is a significant global health issue, contributing to substantial morbidity and mortality. This project aims to develop a predictive model to assess stroke risk by analyzing healthcare data, including demographics, lifestyle, and medical history. By leveraging machine learning techniques, the project seeks to uncover key risk factors and support proactive healthcare management.

Objectives

Analyze the dataset: Identify factors contributing to stroke occurrences.
Preprocess the data: Ensure high quality and relevance of the dataset.
Build machine learning models: Create reliable predictive systems with high accuracy.
Deliver actionable insights: Provide tools to support stroke prevention strategies.

Dataset Overview

Features: Gender, age, hypertension, heart disease, marital status, work type, residence type, average glucose level, BMI, smoking status, and stroke status (target variable).
Size: 5,110 records with 12 features.
Target Variable: stroke (1 = stroke, 0 = no stroke).

Data Preprocessing

Key Steps

Exploration:
- Identified missing values in the bmi column (~3.93%).
- Removed duplicates (none found).
- Analyzed stroke rates by gender.
Handling Missing Values:
- Option 1: Dropped rows with missing bmi values (reduced dataset to 4,909 rows).
- Option 2: Imputed missing bmi values using the mean (28.89).
Data Encoding:
- Binary encoding for Residence_type and ever_married.
- One-hot encoding for gender, work_type, and smoking_status.
Final Dataset:
- Transformed all categorical features into numeric for machine learning compatibility.

Data Visualization

Age Distribution: Bimodal with peaks at 40–60 and 80 years.
Hypertension and Stroke: Most hypertensive patients did not have a stroke.
BMI vs. Glucose Levels: Stroke patients cluster at higher glucose levels.
Age, Glucose, and BMI Interaction: Stronger correlation with stroke for age and glucose than BMI.

Model Training and Evaluation

Models Used

Linear Regression
- Accuracy: 9.10%
- RMSE: 22.76%
Lasso Regression
- Accuracy: 0.94%
- RMSE: 23.76%
Ridge Regression
- Accuracy: 9.10%
- RMSE: 22.76%
Logistic Regression (Best performing model)
- Accuracy: 93.93%
- Precision: 0.50
- Recall: 0.02
- F1 Score: 0.03

Key Observations

Class Imbalance: Stroke cases form only 5% of the dataset.
Logistic Regression:
- Performs well on non-stroke predictions.
- Fails to detect stroke cases (low recall and F1 score).
Precision-Recall Curve: Shows limited predictive power (AP = 0.16).

Recommendations

Address Class Imbalance:
- Use SMOTE or undersampling techniques.
Focus on Relevant Metrics:
- Prioritize precision, recall, and F1 score over accuracy.
Improve Model Performance:
- Experiment with Gradient Boosting or class-weighted logistic regression.
Expand Feature Analysis:
- Investigate additional factors influencing stroke risk.

Conclusion

Logistic Regression demonstrates high accuracy but struggles with stroke case detection due to class imbalance. To build an effective predictive system, addressing imbalance and focusing on minority class metrics is critical.

Future Work

Explore advanced algorithms (e.g., XGBoost, Random Forest).
Perform hyperparameter tuning.
Integrate additional datasets for enhanced predictions.

Acknowledgments

This project was developed as part of a comprehensive effort to leverage machine learning for healthcare improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predicting Stroke Risk Using Machine Learning Models

Project Overview

Objectives

Dataset Overview

Data Preprocessing

Key Steps

Data Visualization

Model Training and Evaluation

Models Used

Key Observations

Recommendations

Conclusion

Future Work

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Predicting Stroke Risk Using Machine Learning Models

Project Overview

Objectives

Dataset Overview

Data Preprocessing

Key Steps

Data Visualization

Model Training and Evaluation

Models Used

Key Observations

Recommendations

Conclusion

Future Work

Acknowledgments