Unlock the Power of Customer Data!
A Data Science Project to segment customers based on their purchasing behavior using RFM Analysis and K-Means Clustering.
Click on any section to jump directly to it:
- 📂 Project Overview
- 🛠️ Technologies Used
- 📂 Dataset
- ⚙️ Analysis Workflow
- 📈 Key Insights & Results
- 🚀 How to Run
- ✍️ Author
- License
This project focuses on identifying distinct customer segments for an online retail business. By analyzing transactional data, we group customers based on their purchasing habits to create targeted marketing strategies. We utilize RFM Analysis (Recency, Frequency, Monetary) combined with both rule-based segmentation and machine learning (K-Means Clustering).
The project is built using Python and the following powerful libraries:
- Pandas: Data manipulation and analysis.
- NumPy: Numerical computations.
- Matplotlib & Seaborn: Data visualization (Histograms, Boxplots, Bar charts).
- Scikit-Learn: Machine Learning (StandardScaler, K-Means Clustering).
The analysis is based on the Online Retail dataset.
- File Path:
Dataset/Online Retail.xlsx - Description: Contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
We start by ensuring high data quality:
- Missing Values: Handling null values in
CustomerIDandDescription. - Duplicates: Removing duplicate transactions to avoid skewing data.
- Data Types: Converting
InvoiceDateto datetime objects.
We visualize the data to understand trends:
- Top Countries: Identifying which countries have the most customers.
- Price Distribution: Analyzing
UnitPriceto detect outliers. - Orders per Day: Tracking transaction volume over time.
We compute the three key metrics for every customer:
- Recency (R): How many days ago was their last purchase?
- Frequency (F): How often do they buy?
- Monetary (M): How much do they spend?
Scores (1-5) are assigned to each metric using Quantiles (pd.qcut).
We take it a step further with Machine Learning:
- Log Transformation: To handle skewed data distribution.
- Scaling: Using
StandardScalerto normalize metrics. - Elbow Method: Determining the optimal number of clusters (
k). - Clustering: Grouping customers into mathematical clusters.
Based on the RFM Scores, customers are categorized into segments such as:
| Segment | Description | Strategy |
|---|---|---|
| 🏆 Champions | High R, F, M scores. Bought recently, buy often, and spend the most. | Reward them. Can become early adopters of new products. |
| 💎 Loyal Customers | Good Frequency and Monetary scores. | Upsell higher value products. Ask for reviews. |
| Aveage Recency, Frequency, and Monetary scores. | Send personalized emails to reconnect, offer renewals. | |
| 💤 Hibernating | Low Recency, Frequency, and Monetary scores. | Recreate brand value, offer relevant discounts. |
- Clone the repository.
- Install dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn openpyxl
- Run the Notebook:
Open
notebook.ipynbin Jupyter Notebook or VS Code and execute the cells sequentially.
- Name: Mohamed Younis
Add a license that matches how you want others to use your work (e.g., MIT).
Created with ❤️ for Data Science Enthusiasts