This repository contains teaching materials and code examples for demonstrating how to conduct exploratory data analysis (EDA), data preprocessing, and machine learning model training on the classic Pima Indians Diabetes Dataset.
The goal is to predict the onset of diabetes based on diagnostic measurements. The target variable is Outcome (0 or 1), and features include:
- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Exploratory Data Analysis (EDA) using pandas and seaborn
- Data cleaning and imputation (handling 0s in medical data)
- Feature scaling
- Logistic Regression, Decision Trees, Random Forest
- Model evaluation using accuracy, confusion matrix, ROC-AUC
This dataset is in the public domain under CC0 License. No attribution required.
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.
Dataset provided by the National Institute of Diabetes and Digestive and Kidney Diseases via UCI Machine Learning Repository and hosted on Kaggle.