This project is a Data Analytics Using Python (DAP) mini project focused on analyzing and predicting medical insurance charges based on personal and lifestyle attributes. The project combines Exploratory Data Analysis (EDA), statistical hypothesis testing, and machine learning (Linear Regression) to identify key cost-driving factors.
- Analyze the Medical Insurance Charges dataset to identify important factors affecting insurance costs
- Statistically validate differences in charges between smokers and non-smokers
- Build a Linear Regression model to predict insurance charges
- Evaluate model performance using standard regression metrics
- Total records: 1,337 (after cleaning)
- Target variable:
charges - Features used:
- age
- sex
- bmi
- children
- smoker
- The
regionfeature was excluded to keep the baseline model focused.
- Python
- Pandas
- NumPy
- Seaborn & Matplotlib
- SciPy
- Scikit-learn
-
Data Cleaning & Preprocessing
- Removed duplicate records
- Converted categorical variables using mapping
- Feature scaling using StandardScaler
-
Exploratory Data Analysis (EDA)
- Distribution analysis of insurance charges
- Correlation heatmap
- Scatter plots for multivariate analysis
-
Statistical Analysis
- Independent Two-Sample T-Test
- Compared insurance charges between smokers and non-smokers
-
Predictive Modeling
- Linear Regression model
- 80/20 train-test split
- Model evaluation using R², MAE, and RMSE
- Smoker status is the strongest predictor of insurance charges
- Charges increase significantly faster with age for smokers
- Statistical tests confirmed a highly significant difference between smokers and non-smokers
- Linear Regression model explained ~80% of the variance in insurance charges
- R² Score: 0.8046
- Mean Absolute Error (MAE): ~$4,198
- Root Mean Square Error (RMSE): ~$5,991
Lifestyle choices, particularly smoking, along with age and BMI, play a dominant role in determining medical insurance costs. The statistical evidence strongly supports the visual patterns observed in EDA and directly contributes to the predictive power of the regression model.
Prateek Garg
This project is intended for academic and learning purposes.