This repository serves as an archive of three supervised and unsupervised machine learning projects completed during the Data Science Infinity program.
Each project demonstrates end-to-end implementation — from data preprocessing and model training to evaluation and business application — using Python and scikit-learn.
These projects collectively showcase a range of core data science skills:
- Regression modeling and prediction
- Supervised classification and evaluation metrics
- Unsupervised clustering and customer segmentation
- Data preprocessing, feature engineering, and model interpretability
Goal: Build predictive regression models to estimate customer loyalty scores for ABC Grocery’s membership program.
Techniques: Linear Regression, Random Forest Regressor, Feature Selection via RFECV.
Highlights:
- Compared multiple regression approaches for predictive accuracy.
- Identified key drivers of customer loyalty such as spending patterns and distance from store.
- Demonstrated robust data preprocessing and feature importance analysis.
Goal: Predict which customers are most likely to sign up for ABC Grocery’s Delivery Club membership using supervised ML classification.
Techniques: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN).
Highlights:
- Random Forest achieved the best balance of accuracy (0.935) and recall (0.904).
- Emphasized interpretability via feature and permutation importance.
- Provided actionable business insight: proximity to store was the top predictor of signups.
Goal: Use unsupervised learning (k-means) to segment grocery customers based on dietary preferences and spending patterns.
Techniques: K-Means Clustering, Feature Scaling, WCSS (Elbow Method).
Highlights:
- Identified 3 main segments (General, Vegetarian, Vegan-like) based on product area spend.
- Provided clear actionable insights for personalized marketing strategies.
- Suggested future applications: deeper subcategory segmentation, integration with demographic data.
| Tool / Library | Purpose |
|---|---|
| Python (3.x) | Core scripting language |
| pandas / numpy | Data cleaning & transformation |
| scikit-learn | Machine learning & evaluation |
| matplotlib / seaborn | Visualization |
| pickle | Model persistence |
- Regression and classification modeling
- Clustering and segmentation
- Cross-validation and feature selection
- Handling imbalanced data (Precision, Recall, F1)
- Model interpretability and visualization
- Business translation of ML insights
This collection marks the foundation of my applied machine learning journey — moving from conceptual understanding to practical, business-relevant modeling.
Each project emphasizes clarity, reproducibility, and interpretability — demonstrating the bridge between statistical rigor and actionable insight.
© 2025 Samuel Shaw
📍 Seattle, WA
📫 LinkedIn