Skip to content

Tracy-yyq/ecommerce-analytics

Repository files navigation

E-commerce Analytics

End-to-end user behavior analysis on a real-world e-commerce dataset
基于真实电商数据的用户行为全链路分析项目

➡️ Live Dashboard 点击查看交互式仪表盘
➡️ Real-time Prediction App 实时购买预测


1、Project Overview | 项目简介

This project performs end-to-end analysis on a real-world e-commerce dataset (2019-Oct), covering 9 analytical modules: data cleaning, funnel analysis, RFM clustering, behavior path analysis, purchase prediction (ML), anomaly detection, association rule mining, and an interactive dashboard.

本项目基于真实电商用户行为数据(2019年10月,抽样50万行),涵盖9个分析模块:数据清洗、漏斗分析、RFM用户分群、行为路径分析、购买预测建模(机器学习)、异常用户检测、关联规则挖掘,以及交互式仪表盘。


2、Dataset | 数据集

Source | 来源: eCommerce behavior data from multi category store — Kaggle
Author | 作者: Michael Kechinov
File used | 使用文件: 2019-Oct.csv(sampled 500,000 rows × 9 columns | 抽样50万行 × 9列,控制内存占用)
Event Types | 行为事件类型: view(浏览)→ cart(加购)→ purchase(购买)


3、Analysis Modules | 分析模块

Part 1 · Data Cleaning | 数据清洗

  • Read 2019-Oct.csv with nrows=500,000 to control memory usage
    使用 nrows=500,000 读取数据,控制内存占用
  • Filled missing values: category_code'no_cate', brand'no_bran'
    缺失值填充:品类代码填 no_cate,品牌填 no_bran
  • Converted event_time to datetime format
    event_time 转为 datetime 类型

Part 2 · Funnel Analysis | 转化漏斗分析

  • Diagnosed data anomaly: purchase count (9,758) > cart count (8,409) in raw data — verified this is due to users purchasing directly without going through the cart step
    发现数据异常(购买数 > 加购数),诊断后确认为部分用户存在"直接购买"行为,数据无误
  • Computed view → cart → purchase conversion rates at each stage
    计算各阶段转化率:浏览→加购→购买
  • Visualized with Plotly interactive funnel chart
    使用 Plotly 绘制交互式漏斗图

Part 3 · RFM + KMeans Clustering | 用户价值分层

  • Built Recency / Frequency / Monetary (RFM) features for each purchasing user
    基于购买行为构建每位用户的 RFM 特征
  • Used Elbow Method + Silhouette Score to determine optimal K=3
    通过肘部法则 + 轮廓系数确定最优聚类数 K=3
  • Segmented users into 3 tiers | 用户分为3类:
    • Cluster 1 — VIP Users:高频购买、高消费金额,核心价值用户(79人)
    • Cluster 0 — Potential Users:有购买行为但频次较低,具备提升空间(976人)
    • Cluster 2 — Regular Users:低频低消费,普通用户(6,307人)
  • Visualized Average Frequency and Average Monetary by cluster using Plotly bar charts
    使用 Plotly 柱状图展示各群体的平均购买频次与消费金额

Part 4 · User Behavior Path Analysis | 用户行为路径分析

  • Sorted events by user_id and event_time to reconstruct each user's behavior sequence
    按用户ID和时间排序,还原每位用户的完整行为序列
  • Defined custom trim_paths() function to remove consecutive duplicate events (e.g. view→view→viewview), keeping only meaningful transitions
    自定义 trim_paths() 函数,去除连续重复事件(如连续浏览合并为一次),保留有意义的行为转变节点
  • Extracted top frequent paths and visualized with a Plotly Sankey diagram
    提取高频路径并使用桑基图可视化用户行为流向

Part 5 · User Retention Analysis | 用户留存分析

⚠️ Retention analysis requires multi-period data to track whether users return over time. Since this project uses a single-month sample (Oct 2019), this module was skipped.
留存分析需要跨时间段数据追踪用户是否回购。由于本项目仅使用单月数据(2019年10月),该模块暂不适用,已跳过。


Part 6 · Purchase Prediction (ML Classification) | 购买预测建模

Built user-level features from event data | 基于事件数据构建用户级特征:

  • view_count, cart_count, avg_price, max_price
  • (Note: total_event was excluded to avoid data leakage | 剔除 total_event 以避免数据泄露)
Model AUC
Logistic Regression 0.7831
Random Forest 0.7567
XGBoost 0.8173 ✅ Best
  • Final model: XGBoost (AUC = 0.8173),deployed as an interactive Streamlit app on Hugging Face
    最终选用 XGBoost(AUC = 0.8173),已部署为 Streamlit 实时预测应用
  • XGBoost Feature Importance analysis showed cart_count and view_count as top contributors
    XGBoost 特征重要性分析显示 cart_countview_count 是最强预测特征
  • SHAP analysis further revealed:
    SHAP 分析进一步揭示:
    • view_count — widest overall impact range | 整体影响范围最大
    • cart_count — strongest positive driver of purchase | 对购买概率正向推动最强
    • max_price — high price negatively associated with purchase | 高价格抑制购买行为
    • avg_price — weakest influence | 影响力最弱

Part 7 · Anomaly Detection | 异常用户检测

Built anomaly features per user | 构建用户级异常检测特征:

  • total_events, total_view, total_cart, total_purchase, total_price, avg_price, unique_product

  • Applied Isolation Forest (contamination=0.05)
    应用孤立森林算法(异常比例参数设为5%)

  • Detected 4,457 anomalous users / 89,124 total — Anomaly Rate: 5.00%
    从89,124名用户中检测出 4,457名异常用户(占比5.00%)

  • Compared median behavior between anomalous and normal users via radar chart
    通过雷达图对比异常用户与正常用户的行为中位数差异


Part 8 · Association Rules | 关联规则挖掘

  • Filtered purchase events and extracted category_code per user as transaction records
    筛选购买事件,以用户为单位提取品类购买记录构建交易数据集
  • Applied Apriori algorithm (mlxtend) to find frequent itemsets
    应用 Apriori 算法挖掘频繁项集
  • Extracted association rules and visualized support vs. confidence vs. lift using Plotly scatter chart
    生成关联规则,并用散点图(support / confidence / lift)可视化规则质量

Part 9 · Interactive Dashboard + Deployed App | 交互式仪表盘 + 在线应用

  • Assembled all visualizations into a single HTML dashboard, hosted via GitHub Pages
    将所有可视化图表整合为单页 HTML 仪表盘,通过 GitHub Pages 托管
  • Embedded SHAP plot as base64 image; all Plotly charts rendered inline
    SHAP图以base64嵌入,所有Plotly图表直接渲染
  • Built a Streamlit app for real-time purchase prediction + user profile query, deployed on Hugging Face
    基于 Streamlit 构建实时预测应用(购买概率预测 + 用户画像查询),部署至 Hugging Face

4、Tech Stack | 技术栈

Category Tools
Language Python 3
Data Processing Pandas, NumPy
Visualization Plotly (Funnel, Sankey, Bar, Scatter), Matplotlib
Machine Learning Scikit-learn (Logistic Regression, Random Forest, Isolation Forest), XGBoost
Explainability SHAP
Association Rules mlxtend (Apriori)
Deployment Streamlit, Hugging Face Spaces, GitHub Pages
Environment Kaggle Notebook

5、Key Business Insights | 核心业务结论

  1. 转化漏斗流失严重 — 浏览到购买整体转化率极低,主要流失发生在加购环节
    Funnel drop-off is severe — most users disengage before reaching the cart stage

  2. 加购是购买最强预测信号 — SHAP分析显示 cart_count 对购买概率正向影响最显著
    cart_count is the strongest positive predictor of purchase probability (SHAP)

  3. 高价格抑制购买行为max_price 越高,用户购买概率越低
    Higher max_price is negatively associated with purchase probability

  4. 5% 疑似异常用户 — Isolation Forest 检测出行为特征异常用户,疑似机器人或刷量账号
    5% of users flagged as anomalous — suspected bots or fraudulent accounts

  5. VIP用户价值远超其他群体 — Cluster 1(VIP,79人)购买频次与消费金额显著高于Potential和Regular用户
    VIP users (Cluster 1, 79 users) show far higher purchase frequency and monetary value


6、Notebook | 代码

All analysis is in a single Kaggle Notebook | 所有分析集中在一个 Kaggle Notebook 中:
https://www.kaggle.com/code/yuqingyang5201314/analysis-of-ecommerce-behavior/notebook


7、Author | 作者

Yuqing Yang (杨雨青)

GitHub: @Tracy-yyq


Built as a portfolio project targeting data analyst / data science roles in Shenzhen.

About

End-to-end e-commerce user behavior analysis | 电商用户行为全链路分析

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors