Insurance Claims Prediction: Full Project Workflow Plan

Tailored for State Farm — Rating & Underwriting Modeling Team

1. Project Overview

Problem Statement: Build a classification model that predicts whether an insurance policyholder will file a claim, directly supporting the core work of State Farm's Property & Casualty Actuarial & Underwriting Modeling Department — pricing accuracy and underwriting risk decisions.

Key Challenge: The dataset exhibits significant class imbalance (far more non-claim records than claim records), which mirrors the real-world distribution insurers face daily. The project must demonstrate techniques to handle this imbalance while producing models that are actionable for business stakeholders.

Dataset: Kaggle — Insurance Claims Prediction dataset (historical policyholder data with demographics, claim history, policy details, risk factors, and external factors).

Language & Tools: Python (end-to-end)

2. Phase 1 — Data Acquisition & Initial Exploration

2.1 Load and Profile the Data

Tools: pandas, numpy

Download the dataset from Kaggle and load into a pandas DataFrame.
Run df.info(), df.describe(), and df.shape to understand dimensions, dtypes, and memory usage.
Identify the target variable (claim filed: yes/no) and confirm the class distribution — quantify the imbalance ratio (e.g., 90/10 or 95/5).
Catalog features into groups matching the dataset description:
- Policyholder info: age, gender, occupation, marital status, location
- Claim history: past claim amounts, claim types, frequency, durations
- Policy details: coverage type, policy duration, premium, deductibles
- Risk factors: credit score, driving record, health status, property characteristics
- External factors: economic indicators, weather, regulatory variables

2.2 Missing Data Audit

Compute missing-value percentages per column.
Classify missingness pattern: MCAR, MAR, or MNAR.
Decision point: drop columns with >60% missing; flag columns with 10–60% for imputation strategy in Phase 2.

2.3 Exploratory Data Analysis (EDA)

Tools: matplotlib, seaborn, plotly

Target distribution: Bar chart showing claim vs. no-claim counts with exact percentages.
Univariate analysis: Histograms/KDE for continuous features; bar charts for categoricals — all split by the target class.
Bivariate analysis: Correlation heatmap (continuous features), chi-squared tests (categorical vs. target), point-biserial correlation (continuous vs. binary target).
Geospatial patterns: If location data permits, map claim rates by region to identify geographic risk clusters.
Claim history deep-dive: Distribution of past claim frequency and amounts — these are likely the strongest predictors and deserve dedicated attention.

Deliverable: A Jupyter notebook section with clearly narrated EDA, suitable for presenting to non-technical stakeholders (maps to the State Farm responsibility of communicating with business partners).

3. Phase 2 — Data Preparation & Feature Engineering

3.1 Data Cleaning

Tools: pandas, scikit-learn (SimpleImputer, KNNImputer)

Missing values: Median imputation for skewed numerics, mode for categoricals, KNN imputation for features with complex relationships.
Outlier treatment: Use IQR method and domain logic (e.g., a $500K claim on a $50K car warrants investigation, not automatic removal).
Duplicates: Check for duplicate policyholder records; deduplicate or flag.

3.2 Feature Engineering

This is where domain knowledge about insurance creates differentiation:

Interaction features:
- claim_frequency_per_year = total past claims / policy duration
- premium_to_coverage_ratio = premium amount / coverage amount (proxy for risk loading)
- claims_cost_ratio = total past claim amounts / total premiums paid
Binning with business logic:
- Age groups aligned with actuarial brackets (18–25 high-risk, 26–35 moderate, etc.)
- Credit score tiers (poor/fair/good/excellent)
Encoding categoricals:
- One-hot encoding for low-cardinality features (gender, marital status)
- Target encoding for high-cardinality features (occupation, location) — use cross-validated target encoding to avoid leakage
Temporal features: If policy start dates exist, extract tenure, season of policy inception, etc.
External factor transforms: Standardize economic indicators; consider lagging weather variables if temporal alignment allows.

3.3 Train/Test Split

Tools: scikit-learn (train_test_split, StratifiedKFold)

Stratified split: 80/20 train/test with stratification on the target to preserve class ratios.
Hold out the test set entirely — all imputation, scaling, and encoding fit only on training data.
Cross-validation strategy: Stratified 5-fold CV on the training set for model selection.

3.4 Feature Scaling

StandardScaler for linear/distance-based models.
Tree-based models (the likely winners) don't require scaling, but keep the pipeline flexible.

4. Phase 3 — Handling Class Imbalance

This is the central technical challenge. Implement and compare multiple strategies:

4.1 Resampling Techniques

Tools: imbalanced-learn (imblearn)

SMOTE (Synthetic Minority Oversampling Technique): Generate synthetic minority-class samples. Test standard SMOTE and variants (BorderlineSMOTE, ADASYN).
Random undersampling: Reduce majority class. Fast but loses information.
Combination: SMOTE + Tomek Links (oversample minority, then clean decision boundaries).

4.2 Algorithmic Approaches

Class weights: Pass class_weight='balanced' to models that support it (logistic regression, random forest, XGBoost's scale_pos_weight).
Threshold tuning: Instead of the default 0.5 cutoff, optimize the classification threshold using precision-recall curves.

4.3 Evaluation Metrics Aligned to Business Goals

Critical decision: Accuracy is misleading with imbalanced data. Focus on:

Primary metric: F1-Score (harmonic mean of precision and recall) — balances false positives (insuring too cautiously) and false negatives (missing risky policyholders).
AUC-ROC: Measures ranking ability across all thresholds.
AUC-PR (Precision-Recall Curve): More informative than ROC when the positive class is rare.
Business-oriented metrics: Compute the estimated dollar impact of false negatives (missed claims) vs. false positives (over-pricing) to frame results in underwriting terms.

5. Phase 4 — Model Development & Selection

5.1 Baseline Model

Tools: scikit-learn (LogisticRegression)

Logistic Regression with class_weight='balanced' as the interpretable baseline.
Record all metrics. This is the "beat this" benchmark.

5.2 Tree-Based Ensembles

Tools: scikit-learn (RandomForestClassifier, GradientBoostingClassifier), XGBoost, LightGBM

Random Forest: Good out-of-the-box; set class_weight='balanced_subsample'.
XGBoost: Tune scale_pos_weight, max_depth, learning_rate, n_estimators, min_child_weight, subsample, colsample_bytree.
LightGBM: Often faster than XGBoost on larger datasets; tune is_unbalance or scale_pos_weight.

5.3 Hyperparameter Tuning

Tools: scikit-learn (RandomizedSearchCV), optuna

Use Optuna for Bayesian hyperparameter optimization (more efficient than grid search).
Optimize on F1-Score within stratified 5-fold CV.
Log all experiments for reproducibility.

5.4 Stacking Ensemble (Optional, Advanced)

Combine top 2–3 models using a meta-learner (logistic regression on base model predictions).
Evaluate whether the stacked model materially improves over the best single model.

6. Phase 5 — Model Validation & Interpretability

6.1 Test Set Evaluation

Run the final model on the held-out 20% test set (used only once).
Generate: confusion matrix, classification report, ROC curve, precision-recall curve.
Compare performance across all imbalance-handling strategies.

6.2 Model Interpretability

Tools: shap, matplotlib

SHAP (SHapley Additive exPlanations):
- Summary plot: Which features drive predictions overall?
- Dependence plots: How does each key feature affect claim probability?
- Force plots: Explain individual predictions (e.g., "Why was this policyholder flagged as high-risk?").
Feature importance ranking: Compare SHAP-based importance to the model's native feature importances.
Partial Dependence Plots (PDPs): Show the marginal effect of features like age, credit score, and claim history on predicted probability — ideal for non-technical presentation.

6.3 Model Fairness Check

Examine whether the model's error rates differ significantly across demographic groups (age, gender, location).
Document any disparities and recommend mitigations — relevant to regulatory considerations in insurance.

7. Phase 6 — Business Translation & Recommendations

This phase maps directly to State Farm's emphasis on consulting with business partners and driving data-informed decisions.

7.1 Risk Segmentation

Use the model's predicted probabilities to segment policyholders into risk tiers (Low / Medium / High / Very High).
Profile each tier: average demographics, claim history, policy characteristics.
Recommend tier-specific underwriting actions (e.g., additional review for High tier, streamlined approval for Low).

7.2 Pricing Implications

Quantify the expected claim rate per risk tier.
Estimate the financial impact: "If this model had been used to adjust premiums, the estimated reduction in underwriting loss would be $X."
Provide a simple cost-benefit analysis of model adoption.

7.3 Data Collection Recommendations

Based on feature importance, identify which data fields are most valuable and recommend prioritizing their collection quality.
Flag features with high missingness but high predictive value — these represent data collection investment opportunities.
Recommend any external data sources (e.g., weather APIs, credit bureau feeds) that could improve model performance.

8. Phase 7 — Documentation & Delivery

8.1 Technical Notebook

Complete Jupyter notebook with narrative markdown cells explaining each decision.
All code reproducible with a requirements.txt and random seeds set.
Organized with a clear table of contents.

8.2 Non-Technical Summary (Stakeholder Deck)

A 5–7 slide summary covering:

The problem: Why accurate claim prediction matters for pricing and underwriting.
What the data told us: Key EDA insights in plain language with visuals.
What we built: Model approach explained without jargon (a "scoring system" that estimates claim likelihood).
How well it works: Metrics translated to business terms (e.g., "The model correctly identifies 82% of policyholders who will file claims").
What we recommend: Risk tiers, pricing adjustments, data quality investments.
Next steps: Production deployment considerations, monitoring plan, model refresh cadence.

8.3 GitHub Repository Structure

insurance-claims-prediction/
├── README.md                    # Project overview and setup instructions
├── requirements.txt             # Python dependencies
├── notebooks/
│   ├── 01_eda.ipynb            # Exploratory data analysis
│   ├── 02_preprocessing.ipynb  # Cleaning and feature engineering
│   ├── 03_modeling.ipynb       # Model training and selection
│   └── 04_evaluation.ipynb     # Final evaluation and interpretability
├── src/
│   ├── data_prep.py            # Reusable data preparation functions
│   ├── features.py             # Feature engineering pipeline
│   ├── models.py               # Model training utilities
│   └── evaluation.py           # Evaluation and plotting utilities
├── reports/
│   ├── stakeholder_summary.pdf # Non-technical presentation
│   └── model_card.md           # Model documentation (performance, limitations, fairness)
└── data/
    └── README.md               # Data source and download instructions

9. Python Dependencies

pandas>=2.0
numpy>=1.24
scikit-learn>=1.3
xgboost>=2.0
lightgbm>=4.0
imbalanced-learn>=0.11
shap>=0.43
optuna>=3.4
matplotlib>=3.7
seaborn>=0.13
plotly>=5.18
jupyter>=1.0

10. How This Project Maps to the State Farm Role

State Farm Responsibility	Project Component
Develop and validate advanced analytic models	Phases 4–5: Multiple classifiers with rigorous cross-validation
Build datasets to support model development	Phase 2: Feature engineering pipeline with domain-informed features
Collaborate on scoping and decision points	Phase 1: Documented decision framework for each analytical choice
Present to non-technical business partners	Phase 7: Stakeholder deck translating results to business language
Make strategic data collection recommendations	Phase 6.3: Feature importance-driven data investment analysis
Work on complex problems of diverse scope	Phase 3: Multiple imbalance strategies compared systematically
Peer review and mentoring	Phase 7: Model card and clean code structure enable team review
Open-source contribution mindset	Phase 7.3: Public GitHub repo with modular, documented code
Business-oriented data science mindset	Phase 6: Risk segmentation, pricing impact, and cost-benefit analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insurance Claims Prediction: Full Project Workflow Plan

Tailored for State Farm — Rating & Underwriting Modeling Team

1. Project Overview

2. Phase 1 — Data Acquisition & Initial Exploration

2.1 Load and Profile the Data

2.2 Missing Data Audit

2.3 Exploratory Data Analysis (EDA)

3. Phase 2 — Data Preparation & Feature Engineering

3.1 Data Cleaning

3.2 Feature Engineering

3.3 Train/Test Split

3.4 Feature Scaling

4. Phase 3 — Handling Class Imbalance

4.1 Resampling Techniques

4.2 Algorithmic Approaches

4.3 Evaluation Metrics Aligned to Business Goals

5. Phase 4 — Model Development & Selection

5.1 Baseline Model

5.2 Tree-Based Ensembles

5.3 Hyperparameter Tuning

5.4 Stacking Ensemble (Optional, Advanced)

6. Phase 5 — Model Validation & Interpretability

6.1 Test Set Evaluation

6.2 Model Interpretability

6.3 Model Fairness Check

7. Phase 6 — Business Translation & Recommendations

7.1 Risk Segmentation

7.2 Pricing Implications

7.3 Data Collection Recommendations

8. Phase 7 — Documentation & Delivery

8.1 Technical Notebook

8.2 Non-Technical Summary (Stakeholder Deck)

8.3 GitHub Repository Structure

9. Python Dependencies

10. How This Project Maps to the State Farm Role

FilesExpand file tree

plan.md

Latest commit

History

plan.md

File metadata and controls

Insurance Claims Prediction: Full Project Workflow Plan

Tailored for State Farm — Rating & Underwriting Modeling Team

1. Project Overview

2. Phase 1 — Data Acquisition & Initial Exploration

2.1 Load and Profile the Data

2.2 Missing Data Audit

2.3 Exploratory Data Analysis (EDA)

3. Phase 2 — Data Preparation & Feature Engineering

3.1 Data Cleaning

3.2 Feature Engineering

3.3 Train/Test Split

3.4 Feature Scaling

4. Phase 3 — Handling Class Imbalance

4.1 Resampling Techniques

4.2 Algorithmic Approaches

4.3 Evaluation Metrics Aligned to Business Goals

5. Phase 4 — Model Development & Selection

5.1 Baseline Model

5.2 Tree-Based Ensembles

5.3 Hyperparameter Tuning

5.4 Stacking Ensemble (Optional, Advanced)

6. Phase 5 — Model Validation & Interpretability

6.1 Test Set Evaluation

6.2 Model Interpretability

6.3 Model Fairness Check

7. Phase 6 — Business Translation & Recommendations

7.1 Risk Segmentation

7.2 Pricing Implications

7.3 Data Collection Recommendations

8. Phase 7 — Documentation & Delivery

8.1 Technical Notebook

8.2 Non-Technical Summary (Stakeholder Deck)

8.3 GitHub Repository Structure

9. Python Dependencies

10. How This Project Maps to the State Farm Role