End-to-end user behavior analysis on a real-world e-commerce dataset
基于真实电商数据的用户行为全链路分析项目
➡️ Live Dashboard 点击查看交互式仪表盘
➡️ Real-time Prediction App 实时购买预测
This project performs end-to-end analysis on a real-world e-commerce dataset (2019-Oct), covering 9 analytical modules: data cleaning, funnel analysis, RFM clustering, behavior path analysis, purchase prediction (ML), anomaly detection, association rule mining, and an interactive dashboard.
本项目基于真实电商用户行为数据(2019年10月,抽样50万行),涵盖9个分析模块:数据清洗、漏斗分析、RFM用户分群、行为路径分析、购买预测建模(机器学习)、异常用户检测、关联规则挖掘,以及交互式仪表盘。
Source | 来源: eCommerce behavior data from multi category store — Kaggle
Author | 作者: Michael Kechinov
File used | 使用文件: 2019-Oct.csv(sampled 500,000 rows × 9 columns | 抽样50万行 × 9列,控制内存占用)
Event Types | 行为事件类型: view(浏览)→ cart(加购)→ purchase(购买)
- Read
2019-Oct.csvwithnrows=500,000to control memory usage
使用nrows=500,000读取数据,控制内存占用 - Filled missing values:
category_code→'no_cate',brand→'no_bran'
缺失值填充:品类代码填no_cate,品牌填no_bran - Converted
event_timeto datetime format
将event_time转为 datetime 类型
- Diagnosed data anomaly: purchase count (9,758) > cart count (8,409) in raw data — verified this is due to users purchasing directly without going through the cart step
发现数据异常(购买数 > 加购数),诊断后确认为部分用户存在"直接购买"行为,数据无误 - Computed view → cart → purchase conversion rates at each stage
计算各阶段转化率:浏览→加购→购买 - Visualized with Plotly interactive funnel chart
使用 Plotly 绘制交互式漏斗图
- Built Recency / Frequency / Monetary (RFM) features for each purchasing user
基于购买行为构建每位用户的 RFM 特征 - Used Elbow Method + Silhouette Score to determine optimal K=3
通过肘部法则 + 轮廓系数确定最优聚类数 K=3 - Segmented users into 3 tiers | 用户分为3类:
- Cluster 1 — VIP Users:高频购买、高消费金额,核心价值用户(79人)
- Cluster 0 — Potential Users:有购买行为但频次较低,具备提升空间(976人)
- Cluster 2 — Regular Users:低频低消费,普通用户(6,307人)
- Visualized Average Frequency and Average Monetary by cluster using Plotly bar charts
使用 Plotly 柱状图展示各群体的平均购买频次与消费金额
- Sorted events by
user_idandevent_timeto reconstruct each user's behavior sequence
按用户ID和时间排序,还原每位用户的完整行为序列 - Defined custom
trim_paths()function to remove consecutive duplicate events (e.g.view→view→view→view), keeping only meaningful transitions
自定义trim_paths()函数,去除连续重复事件(如连续浏览合并为一次),保留有意义的行为转变节点 - Extracted top frequent paths and visualized with a Plotly Sankey diagram
提取高频路径并使用桑基图可视化用户行为流向
⚠️ Retention analysis requires multi-period data to track whether users return over time. Since this project uses a single-month sample (Oct 2019), this module was skipped.
留存分析需要跨时间段数据追踪用户是否回购。由于本项目仅使用单月数据(2019年10月),该模块暂不适用,已跳过。
Built user-level features from event data | 基于事件数据构建用户级特征:
view_count,cart_count,avg_price,max_price- (Note:
total_eventwas excluded to avoid data leakage | 剔除total_event以避免数据泄露)
| Model | AUC |
|---|---|
| Logistic Regression | 0.7831 |
| Random Forest | 0.7567 |
| XGBoost | 0.8173 ✅ Best |
- Final model: XGBoost (AUC = 0.8173),deployed as an interactive Streamlit app on Hugging Face
最终选用 XGBoost(AUC = 0.8173),已部署为 Streamlit 实时预测应用 - XGBoost Feature Importance analysis showed
cart_countandview_countas top contributors
XGBoost 特征重要性分析显示cart_count与view_count是最强预测特征 - SHAP analysis further revealed:
SHAP 分析进一步揭示:view_count— widest overall impact range | 整体影响范围最大cart_count— strongest positive driver of purchase | 对购买概率正向推动最强max_price— high price negatively associated with purchase | 高价格抑制购买行为avg_price— weakest influence | 影响力最弱
Built anomaly features per user | 构建用户级异常检测特征:
-
total_events,total_view,total_cart,total_purchase,total_price,avg_price,unique_product -
Applied Isolation Forest (contamination=0.05)
应用孤立森林算法(异常比例参数设为5%) -
Detected 4,457 anomalous users / 89,124 total — Anomaly Rate: 5.00%
从89,124名用户中检测出 4,457名异常用户(占比5.00%) -
Compared median behavior between anomalous and normal users via radar chart
通过雷达图对比异常用户与正常用户的行为中位数差异
- Filtered purchase events and extracted
category_codeper user as transaction records
筛选购买事件,以用户为单位提取品类购买记录构建交易数据集 - Applied Apriori algorithm (mlxtend) to find frequent itemsets
应用 Apriori 算法挖掘频繁项集 - Extracted association rules and visualized support vs. confidence vs. lift using Plotly scatter chart
生成关联规则,并用散点图(support / confidence / lift)可视化规则质量
- Assembled all visualizations into a single HTML dashboard, hosted via GitHub Pages
将所有可视化图表整合为单页 HTML 仪表盘,通过 GitHub Pages 托管 - Embedded SHAP plot as base64 image; all Plotly charts rendered inline
SHAP图以base64嵌入,所有Plotly图表直接渲染 - Built a Streamlit app for real-time purchase prediction + user profile query, deployed on Hugging Face
基于 Streamlit 构建实时预测应用(购买概率预测 + 用户画像查询),部署至 Hugging Face
| Category | Tools |
|---|---|
| Language | Python 3 |
| Data Processing | Pandas, NumPy |
| Visualization | Plotly (Funnel, Sankey, Bar, Scatter), Matplotlib |
| Machine Learning | Scikit-learn (Logistic Regression, Random Forest, Isolation Forest), XGBoost |
| Explainability | SHAP |
| Association Rules | mlxtend (Apriori) |
| Deployment | Streamlit, Hugging Face Spaces, GitHub Pages |
| Environment | Kaggle Notebook |
-
转化漏斗流失严重 — 浏览到购买整体转化率极低,主要流失发生在加购环节
Funnel drop-off is severe — most users disengage before reaching the cart stage -
加购是购买最强预测信号 — SHAP分析显示
cart_count对购买概率正向影响最显著
cart_countis the strongest positive predictor of purchase probability (SHAP) -
高价格抑制购买行为 —
max_price越高,用户购买概率越低
Highermax_priceis negatively associated with purchase probability -
5% 疑似异常用户 — Isolation Forest 检测出行为特征异常用户,疑似机器人或刷量账号
5% of users flagged as anomalous — suspected bots or fraudulent accounts -
VIP用户价值远超其他群体 — Cluster 1(VIP,79人)购买频次与消费金额显著高于Potential和Regular用户
VIP users (Cluster 1, 79 users) show far higher purchase frequency and monetary value
All analysis is in a single Kaggle Notebook | 所有分析集中在一个 Kaggle Notebook 中:
https://www.kaggle.com/code/yuqingyang5201314/analysis-of-ecommerce-behavior/notebook
Yuqing Yang (杨雨青)
GitHub: @Tracy-yyq
Built as a portfolio project targeting data analyst / data science roles in Shenzhen.