GitHub - RameshSTA/clv-long-term-optimization: Production-grade CLV forecasting, churn risk modeling & budget-constrained retention targeting. RF AUC=0.831 · Spearman ρ=0.57 · ROI 10×–26× (Monte Carlo 90% CI) · SHAP · MLflow · PuLP/CBC

Customer Lifetime Value — Long-Term Optimization

A production-grade decision intelligence system — from raw retail transactions to budget-constrained,
fully explainable retention targeting with quantified ROI.

At a Glance

Metric	Value	Context
CLV Spearman rank correlation (ρ)	0.57	Predicted vs. actual holdout revenue, all 4,933 customers
Top-decile lift	6.3×	£5,339 avg holdout revenue vs. £852 population mean
Top-to-bottom decile ratio	22×	£5,339 (D10) vs. £241 (D1) — no rank inversions
Churn model holdout AUC	0.831	Random Forest, out-of-time holdout evaluation
CV average precision	0.883 ± 0.007	5-fold stratified cross-validation
Best optimization ROI	44.5×	£200 budget, 100 customers targeted
Monte Carlo lower bound	positive in all 1,000 runs	90% CI at £2,000: 10×–26× ROI
Revenue concentration (Gini)	0.726	Top 10% of customers → 62% of £12.1M total revenue

Documentation

	Description
	Algorithm selection, SHAP results, calibration, monitoring thresholds
	Problem framing, stakeholder context, success metrics
	System design, data flow, module responsibilities
	Feature design, leakage prevention, cutoff-safe computation
	Holdout methodology, metric rationale, bootstrap CI design
	BG/NBD + Gamma-Gamma derivations, churn label design
	Full ROI narrative, segment strategy, decision framework
	Assumptions, risks, sensitivity bounds
	Production readiness, monitoring plan, retraining triggers
	7-rule cleaning specification, edge case handling

The Business Problem

In retail and e-commerce, every customer team faces the same constraint: limited retention budget, unlimited customers to target. Without a rigorous system, teams default to three failing strategies:

Strategy	What Goes Wrong
Blanket campaigns	Same message to all customers — no differentiation, wasted spend
Heuristic targeting	"Target our biggest spenders" — ignores churn risk; budget spent on customers who would have stayed
Static RFM buckets	Segment labels without economic value attached — no way to prioritise within segments

The result: Budget is spent on the wrong customers, retention ROI is unmeasured, and the business loses high-value customers it could have saved.

What this data reveals

This system was built on the UCI Online Retail II dataset — 1M+ real transactions from a UK-based online retailer (2009–2011). Three facts from this data define the problem:

Revenue is dangerously concentrated. Top 10% of customers generate 62% of total revenue. Gini coefficient = 0.726 — approaching income-inequality levels.

High-value customers churn at an alarming rate. The "At Risk" segment — previously frequent buyers — carries 64% churn probability and represents £1.25M of threatened revenue (10% of total).

Churn prediction alone is not enough. Predicting who might churn does not tell you who to spend your budget on. That requires combining churn probability, expected future value, and cost — simultaneously, under a hard constraint.

The Solution

This project builds an 11-step, config-driven decision intelligence pipeline that answers a single business question:

Given a fixed retention budget, which customers should be targeted to maximise long-term business value — and how confident are you in that ROI?

Layer	Approach	Output
Probabilistic CLV forecasting	BG/NBD + Gamma-Gamma (lifetimes)	Expected future value per customer (£)
Evidence-based churn modeling	4-algorithm CV comparison + SHAP	Calibrated churn probability + feature explanations
Constrained budget optimization	0/1 Knapsack (integer programming)	Optimal targeting list under spend constraint
Business intelligence	RFM segmentation, cohort analysis, Pareto	Revenue concentration, decay curves, segment strategy
Uncertainty quantification	Monte Carlo simulation (1,000 draws)	ROI confidence intervals under assumption uncertainty

Key Results

CLV Model — Holdout Validation

Metric	Value	Interpretation
Spearman rank correlation (ρ)	0.57	Strong alignment between predicted and actual future revenue
Top decile avg holdout revenue	£5,339	vs. £852 population mean — 6.3× lift
Top-to-bottom decile ratio	22×	£5,339 (decile 10) vs. £241 (decile 1)
Monotonic lift	Yes — all 10 deciles	No rank inversions; consistent model quality across the full range

Churn Model — 4-Algorithm Comparison

Model	CV ROC AUC	CV Avg Precision	Status
Random Forest	0.816 ± 0.013	0.883 ± 0.007	Selected
Logistic Regression	0.810 ± 0.015	0.882 ± 0.009	Baseline
XGBoost	0.807 ± 0.014	0.878 ± 0.008	Evaluated
LightGBM	0.792 ± 0.014	0.867 ± 0.008	Evaluated

Holdout ROC AUC = 0.831 | Holdout Avg Precision = 0.878 | Churn base rate = 62.7%

Budget Optimization ROI

Budget	Customers Targeted	ROI	Net Gain
£200	100	44.5×	£8,902
£2,462	1,231	15.4×	£37,809
£4,725	2,362	10.8×	£50,822
£6,422	3,211	8.5×	£54,503

Monte Carlo 90% CI at £2,000 budget: ROI range 10×–26× across 1,000 simulations. ROI is positive in every single simulation.

End-to-End Architecture

Raw Transactions (~1M rows, UCI Online Retail II)
             │
             ▼
 [1] Ingestion ─────────► transactions_raw.parquet
     Schema validation
     + Parquet serialisation
             │
             ▼
 [2] Cleaning ──────────► transactions_clean.parquet
     7 deterministic rules
             │
             ▼
 [3] Feature Engineering ► customer_features.parquet
     Cutoff-safe · No leakage
             │
      ┌──────┴───────┐
      ▼              ▼
[4] CLV Modeling  [5] Churn Risk Modeling
    BG/NBD+GG         4-model CV comparison
    ρ = 0.57          RF wins (AUC = 0.816)
    22× lift          SHAP + calibration
      │              │
      └──────┬───────┘
             ▼
  [6] Budget Optimization
      0/1 Knapsack · PuLP/CBC solver
      maximize Σ xᵢ · net_gainᵢ
      subject to Σ xᵢ · costᵢ ≤ B, xᵢ ∈ {0,1}
             │
             ▼
  [7] Evaluation + Reporting
      Decile lift · Bootstrap CIs · ROI curve · MLflow
             │
      ┌──────┼──────────┐
      ▼      ▼          ▼
[8] RFM    [9] Cohort  [10] Business
Segments   Analysis     Insights
8 segments 25 cohorts   Gini = 0.726
             │
             ▼
  [11] Monte Carlo Sensitivity
       1,000 simulations · 90% CI on ROI

Step-by-Step: Methods and Findings

Steps 1–3: Data Pipeline

Ingestion reads the UCI Online Retail II dataset (two Excel sheets, ~1M rows), validates column schema, coerces types, and writes Parquet. No data manipulation at this stage — only parsing and serialisation.

Cleaning applies 7 deterministic, documented rules in a fixed order. Each rule is tracked separately to measure its impact on row count:

Rule	Rows Removed	Rationale
Remove cancellation invoices (prefix `C`)	~8,905	Cancellations reverse prior revenue; not purchase events
Remove non-positive unit price	~33	Price ≤ 0 indicates internal adjustments
Remove non-positive quantity	~10,624	Negative quantities without `C` prefix are data errors
Remove missing customer IDs	~135,080	Cannot assign revenue to customer without ID
Remove invalid timestamps	~0	Malformed date strings
Deduplicate exact rows	~5,268	Exact duplicates indicate ingestion errors
Compute `revenue = quantity × unit_price`	—	Derived field; applied after cleaning

Feature engineering computes 8 per-customer features, all strictly computed before the cutoff date. A leakage check runs explicitly before writing the output file.

Feature	Description	Primary Signal For
`recency_days`	Days since last purchase at cutoff	Churn (top SHAP driver)
`tenure_days`	Days since first purchase	Customer maturity
`n_invoices`	Distinct purchase events	Frequency (F in RFM)
`total_revenue`	Cumulative spend	Monetary value (M in RFM)
`avg_order_value`	Revenue ÷ invoices	Spend-per-visit pattern
`revenue_last_30d`	Trailing 30-day revenue	Short-term engagement
`revenue_last_90d`	Trailing 90-day revenue	Medium-term trend
`rev_30_to_90_ratio`	Recent ÷ medium-term revenue	Momentum / acceleration

Step 4: CLV Modeling

Why probabilistic models? Unlike regression, BG/NBD explicitly models two simultaneous processes: when a customer buys (Poisson purchase frequency) and whether they are still active (Geometric dropout probability). This produces interpretable, theoretically grounded estimates rather than black-box regression residuals.

BG/NBD model:

Purchase rate: λᵢ ~ Gamma(r, α) — heterogeneous across customers
Dropout probability: pᵢ ~ Beta(a, b) — customer may leave after any purchase
Outputs: E[N_i(H)] (expected future purchases) and P(alive)

Gamma-Gamma model (monetary component):

Transaction value: Mᵢₖ | νᵢ ~ Gamma(p, νᵢ)
Customer heterogeneity: νᵢ ~ Gamma(q, γ)
Outputs: E[μᵢ] (expected average transaction value per customer)

CLV formula:

CLV_i(H) = E[N_i(H)] × E[μ_i] × discount_factor

discount_factor = (1 + daily_rate)^(−H)
daily_rate      = (1 + r_annual)^(1/365) − 1

Time-safe evaluation: Calibration window (before cutoff) trains models. Holdout revenue is measured strictly after cutoff — never seen during training.

Result: CLV Decile Lift

Customers sorted by predicted CLV into 10 equal deciles. Bars show average observed holdout revenue per decile. Error bands are 95% bootstrap confidence intervals (500 resamples, seed=42).

Decile	Customers	Avg Predicted CLV	Avg Holdout Revenue	Lift vs Mean
1 (Lowest)	494	−£1,972	£241	0.28×
2	493	−£924	£159	0.19×
3	493	−£598	£66	0.08×
4	493	−£82	£78	0.09×
5	494	£149	£170	0.20×
6	493	£242	£305	0.36×
7	493	£371	£355	0.42×
8	493	£575	£659	0.77×
9	493	£959	£1,143	1.34×
10 (Highest)	494	£3,553	£5,339	6.26×

Spearman ρ = 0.57 across all 4,933 customers. Holdout revenue is monotonically increasing across all 10 deciles — no rank inversions. The model reliably separates high-value from low-value customers.

The top decile generates 22× more observed revenue than the bottom decile. This is the decision-grade signal needed for budget allocation.

Step 5: Churn Risk Modeling

Selection philosophy: Choosing an algorithm based on assumption ("logistic regression is interpretable") rather than evidence is a methodological error. All viable candidates are trained, compared via cross-validation, and the winner is selected automatically.

5a. Multi-Model Comparison

4 algorithms compared using 5-fold stratified cross-validation (shuffled, random_state=42). Bars show mean CV ROC AUC; error bars show ± 1 standard deviation across folds. The winner (Random Forest) is highlighted.

Model	CV ROC AUC	CV Avg Precision	Pipeline Preprocessing
Random Forest	0.816 ± 0.013	0.883 ± 0.007	Impute (median) only
Logistic Regression	0.810 ± 0.015	0.882 ± 0.009	Impute → StandardScaler
XGBoost	0.807 ± 0.014	0.878 ± 0.008	Impute (median) only
LightGBM	0.792 ± 0.014	0.867 ± 0.008	Impute (median) only

Random Forest wins with the highest mean CV AUC and lowest variance — evidence of both accuracy and stability.

5b. ROC and Precision-Recall Curves

Left: ROC curve — AUC = 0.831 on out-of-time holdout. Right: Precision-Recall curve — AP = 0.878. PR curves are especially informative at 62.7% base rate; high AP confirms the model ranks true churners near the top of its scored list.

The holdout AUC (0.831) exceeds the CV mean (0.816) — confirming that the model generalises well and has not overfit to the training distribution.

5c. Probability Calibration

Left: Reliability diagram — each point compares predicted probability bin (x-axis) to observed churn rate (y-axis). A perfectly calibrated model falls on the diagonal. Right: Score distribution by class — churners and non-churners clearly separated with minimal overlap.

Calibration matters because churn probability is used directly in business calculations. If churn_prob_i does not correspond to real-world churn rates, the ROI calculation produces misleading estimates:

Expected Benefit_i = CLV_i × churn_prob_i × retention_effectiveness

5d. SHAP Feature Importance

Left (bar chart): Mean absolute SHAP value per feature — overall importance ranking. Right (beeswarm): Each dot is one customer. Position on x-axis = SHAP value (positive → pushes prediction toward churn). Colour = feature value (red = high, blue = low).

Reading the beeswarm:

recency_days — top churn driver: customers with high recency (long time since last purchase) have large positive SHAP values → high predicted churn probability. Dominant signal.
revenue_last_90d — strong churn protector: customers with high medium-term revenue push SHAP negative → lower predicted churn. Active spenders are unlikely to churn.
n_invoices — frequency protects: high invoice count pulls SHAP negative → frequent buyers are retained.
rev_30_to_90_ratio — momentum signal: recent acceleration in spending reduces churn risk.

SHAP values are computed via TreeExplainer — exact (not approximate) for tree-based models. Feature directions are data-driven, not assumed.

Step 6: Budget Optimization

Economic proxy:

Expected Benefit_i = CLV_i × churn_prob_i × retention_effectiveness
Net Gain_i         = Expected Benefit_i − cost_i

Optimization problem (0/1 Knapsack):

maximize:   Σ xᵢ · net_gainᵢ
subject to: Σ xᵢ · costᵢ ≤ B
            xᵢ ∈ {0, 1}   ∀i

Solved exactly via PuLP (CBC integer solver) — optimal, not approximate. Greedy fallback if PuLP is unavailable.

Eligibility criteria:

CLV_i > min_clv (configurable threshold)
churn_prob_i > 0
net_gain_i > 0 (only target customers where expected benefit exceeds cost)

This is decision intelligence: it does not simply rank customers — it allocates a scarce resource to maximise expected economic value under a hard budget constraint.

Step 7: Backtesting & Evaluation

ROI vs Budget Curve

Dual-axis chart. Left axis (bars): total net gain (£) at each budget level. Right axis (line): ROI multiplier. As budget grows, more customers are targeted but marginal ROI decreases — classic diminishing returns. The optimal range depends on the organisation's budget envelope.

Number of customers selected by the knapsack solver at each budget level. Growth rate reflects the population of customers with positive net gain at each spend level.

Step 8: RFM Customer Segmentation

Heatmap: Each cell shows average CLV (left) and average churn probability (right) for each of the 8 named segments. Segment size (n) is annotated. Read together, CLV and churn probability determine the economic priority of each segment for retention investment.

Methodology: Each customer is scored on three dimensions using quartile ranking (1–4 scale): R (Recency) — days since last purchase, reversed; F (Frequency) — distinct invoices; M (Monetary) — total historical revenue. Combined into 8 named segments via a priority rule matrix based on RFM marketing literature (Kumar & Reinartz, 2012).

Segment	R	F	Customers	Avg CLV	Churn Risk	Revenue Share
Champions	4	4	681 (13.8%)	£2,436	13%	55.8%
Loyal Customers	≥3	≥3	1,049 (21.3%)	£761	36%	22.3%
Potential Loyalists	≥3	≤2	407 (8.3%)	−£145	49%	2.1%
New Customers	4	1	99 (2.0%)	−£2,992	49%	0.3%
At Risk	≤2	≥3	694 (14.1%)	£309	64%	10.3%
Cant Lose Them	1	4	42 (0.9%)	£302	62%	1.5%
Hibernating	≤2	≤2	952 (19.3%)	−£803	69%	4.0%
Lost	1	≤2	1,009 (20.5%)	−£438	86%	3.7%

Scatter plot: Each dot is a customer. X-axis = churn probability; Y-axis = predicted CLV. Colour = segment. The ideal retention targets occupy the upper-right quadrant: high CLV + high churn risk. Champions (upper-left) are safe; Lost customers (lower-right) have low CLV — low priority for expensive interventions.

Business implications:

Champions (13.8% of customers) drive 55.8% of revenue at only 13% churn risk. Protect but do not over-invest — they are not at risk.
At Risk segment (14.1%) carries 64% churn probability and represents £1.25M threatened revenue. Highest-value retention target.
Lost (20.5% of customers) have 86% churn and negative CLV. Reacquisition cost likely exceeds expected value — deprioritise.

git

Step 9: Cohort Retention Analysis

Retention heatmap: Rows = acquisition cohort (month of first purchase). Columns = months since acquisition (0 = acquisition month). Cell value = % of cohort still active. Darker = higher retention. The rapid colour fade from left to right reveals the natural churn decay curve.

How to read this: Month 0 (acquisition) is always 100% by definition. By Month 1, most cohorts lose 60–80% of customers. By Month 3, only the core loyal base remains. This decay pattern is precisely what CLV-based retention targeting is designed to slow.

Retention decay curves — one line per acquisition cohort, coloured by cohort month. Lines that stay elevated longer indicate higher-quality cohort acquisition. Cohorts acquired in peak trading periods (November–December) tend to have faster initial decay — likely driven by one-time seasonal buyers.

Revenue by acquisition cohort. Bars show total revenue per cohort (left axis). The line shows average revenue per customer (right axis). Early cohorts (2009–2010) generate higher total revenue — they had more time to purchase, and surviving members are likely the most engaged.

Key insight: Cohort analysis reveals that retention decay is steep and early — most churn happens in the first 1–2 months. The BG/NBD model captures this dropout process parametrically and uses it to project forward.

Step 10: Business Intelligence & Pareto

Left (Lorenz curve): Cumulative revenue share (y-axis) vs. cumulative customer share (x-axis), sorted by revenue. The dashed diagonal = perfect equality. The further below the diagonal, the more unequal the distribution. The red crosshairs mark the 80% revenue threshold. Right (bar chart): Revenue share by customer percentile group.

Metric	Value
Gini coefficient	0.726
Customers generating 80% of revenue	~0.3% (inverted Pareto — extreme concentration)
Top 10% customers → revenue share	62% of £12.1M
Revenue at risk (At Risk + Cant Lose)	~12% of total

A Gini of 0.73 approaches income-inequality levels. This means treating all customers equally is structurally wasteful — the vast majority of marketing budget applied to the bottom 80% reaches customers generating only 38% of revenue. CLV-based targeting is not an optimisation; it is a necessity.

Monthly revenue (bars, left axis) and 3-month moving average (line) from December 2009 to December 2011. Right axis: monthly active customer count. Strong November–December seasonal spikes correspond to holiday trading.

Distribution of per-customer lifetime revenue — linear scale (left) and log scale (right). The linear plot shows extreme right-skew. The log plot reveals an approximately log-normal distribution — the statistical basis for why Gamma-Gamma monetary modeling is appropriate and why simple averages are misleading.

Step 11: Monte Carlo Sensitivity Analysis

Why this matters: The budget optimization model uses two assumed parameters: retention_effectiveness (η) and unit_cost. Both are operationally assumed — not measured from A/B test data. A professional analysis must quantify how sensitive the ROI conclusions are to these assumptions.

Methodology: 1,000 Monte Carlo draws:

retention_effectiveness ~ Uniform(0.05, 0.25)   [central: 0.10]
unit_cost               ~ Uniform(£1.00, £5.00) [central: £2.00]

For each draw, the full knapsack policy is recomputed at 12 budget levels.

Monte Carlo ROI Uncertainty Bands

Shaded uncertainty bands: Dark centre line = median ROI (p50). Inner band = 50% CI (p25–p75). Outer band = 90% CI (p5–p95). Even at the pessimistic 5th percentile, ROI remains strongly positive across all budget levels tested.

Budget	Median ROI	90% CI (p5–p95)
£500	~18×	8×–32×
£1,331	~18×	8×–35×
£2,462	~16×	6×–26×
£4,725	~11×	4×–18×

At £2,000 budget: ROI is positive in all 1,000 simulations. Even the most pessimistic combination (5% effectiveness + £5/customer cost) produces a profitable campaign. The business case is robust to assumption uncertainty.

Tornado Chart — One-at-a-Time Sensitivity

Tornado chart: Each bar shows how much ROI changes when the parameter is moved ±50% of its central value (all other parameters held constant). Longer bar = greater sensitivity.

Parameter	Base ROI	Low Value ROI	High Value ROI	Swing
`retention_effectiveness`	16.9×	8.0× (η=5%)	25.9× (η=15%)	17.9×
`unit_cost`	16.9×	24.8× (£1)	13.1× (£3)	11.7×

Reading the tornado: retention_effectiveness has greater total influence on ROI than unit_cost. Empirically measuring retention uplift via A/B testing would have more impact on decision quality than negotiating down channel costs. This is a direct, actionable business recommendation.

Professional DS Practices

Practice	Implementation	Why It Matters
Multi-model comparison	4 algorithms, 5-fold stratified CV, auto-selection by CV AUC	Avoids model-selection bias; choice backed by evidence not assumption
SHAP interpretability	`TreeExplainer` — exact values; bar + beeswarm plots	Stakeholder trust; regulatory readiness; direction + magnitude per feature
Probability calibration	Reliability diagram + score distribution by class	Probabilities must reflect real-world rates to be usable in business math
Dual evaluation curves	ROC and Precision-Recall both reported	PR especially informative at 63% churn base rate
Bootstrap confidence intervals	95% CIs on decile lift (500 resamples, seed=42)	Statistical rigour around point estimates
Monte Carlo sensitivity	1,000 draws; tornado chart; 90% CI bands	Quantifies how robust ROI claims are to assumption uncertainty
Time-safe feature engineering	All features computed before cutoff; leakage check runs explicitly	Realistic simulation of production performance — no look-ahead
Out-of-time evaluation	Train on cutoff C; evaluate on cutoff C+60d	Models tested exactly as they will be used in production
Cohort analysis	25 monthly cohorts × 12-month retention matrix	Reveals natural decay; contextualises why CLV targeting is needed
Spearman rank correlation	ρ(CLV, holdout revenue) = 0.57	Right metric for targeting: ranking quality, not absolute accuracy
MLflow experiment tracking	Params, metrics, and artifacts logged per run	Full reproducibility; enables fair comparison across runs
Config-driven pipeline	All parameters in YAML; zero magic numbers in code	Any parameter change is one YAML edit — nothing is hidden in code
Google-style docstrings	Args, Returns, Raises on every public function	Code is readable by collaborators without opening the implementation
Full type hints	Complete signatures on all functions	IDE assistance; self-documenting interfaces
10 pytest unit tests	Synthetic data fixtures; pure-function design; CI-safe	Regression protection as the codebase evolves
Modern packaging	`pyproject.toml` + `Makefile` + `.pre-commit-config.yaml`	Project is installable, lintable, testable, and deployable in one command

Skills Demonstrated

Skill Area	Demonstrated By
Statistical & probabilistic modeling	BG/NBD + Gamma-Gamma CLV; inactivity-based churn label design; discount rate derivation
Supervised machine learning	4-algorithm comparison; stratified k-fold CV; class imbalance handling (`class_weight="balanced"`)
Model interpretability	SHAP `TreeExplorer`; mean \|SHAP\| bar; beeswarm scatter; direction analysis
Model evaluation	ROC/PR curves; calibration reliability diagram; decile lift; bootstrap CIs; out-of-time holdout
Mathematical optimisation	0/1 Knapsack; integer programming (PuLP/CBC); greedy fallback; economic objective function design
Uncertainty quantification	Monte Carlo simulation; tornado chart (OAT sensitivity); 90% CI bands on ROI
Customer analytics	RFM segmentation (8 named segments); cohort retention matrix; Lorenz curve; Gini coefficient
Data engineering	Multi-layer Parquet pipeline; schema validation; cutoff-safe computation; 7-rule deterministic cleaning
Software engineering	Config-driven YAML; frozen dataclasses; CLI scripts; Google-style docstrings; type hints throughout
MLOps fundamentals	MLflow experiment tracking; reproducible seeds; model card (v2.0); monitoring thresholds
Testing	10 pytest tests; synthetic data fixtures; pure-function design; no I/O in tests
Project packaging	`pyproject.toml`; `Makefile` (11 targets); `.pre-commit-config.yaml` (ruff lint + format)
Business communication	Results-first documentation; quantified claims; explicit assumptions; model card with limitations

How to Run

Prerequisites

Python 3.9+
~2 GB RAM (for 1M-row dataset processing)
UCI Online Retail II dataset (.xlsx) placed at data/raw/online_retail_II.xlsx

Option A: Makefile (recommended)

git clone <repo-url>
cd clv-long-term-optimization

make install        # Creates .venv, installs all dependencies from requirements.txt
make pipeline       # Runs all 11 steps end-to-end (~15–20 min on a laptop)
make test           # Runs 10 unit tests
make mlflow-ui      # Launches MLflow UI at http://localhost:5000
make lint           # ruff lint check
make format         # ruff autoformat
make clean          # Removes generated data and reports (keeps raw data)

Option B: Manual steps

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Full pipeline
python -m src.pipelines.weekly_scoring_pipeline --config-dir config

# Override budget for a one-off run
python -m src.pipelines.weekly_scoring_pipeline --config-dir config --budget 8000

# Individual analysis modules
python -m src.analysis.customer_segmentation
python -m src.analysis.cohort_analysis
python -m src.analysis.business_insights
python -m src.evaluation.sensitivity_analysis

# Tests
pytest -v

Configuration

All parameters are controlled via YAML files — no code edits required for any standard workflow:

File	Controls
`config/project.yaml`	Paths, cutoff date, holdout window
`config/modeling.yaml`	Hyperparameters, CV folds, random state, model type (`auto`)
`config/business.yaml`	Budget, unit cost, retention effectiveness, solver
`config/evaluation.yaml`	Decile count, currency symbol, rounding

Repository Structure

clv-long-term-optimization/
│
├── src/
│   ├── ingestion/
│   │   └── load_data.py               # Step 1: Schema validation + Parquet export
│   ├── cleaning/
│   │   └── clean_transactions.py      # Step 2: 7-rule deterministic cleaning
│   ├── features/
│   │   └── build_features.py          # Step 3: Cutoff-safe RFM + trend features
│   ├── modeling/
│   │   ├── train_clv_models.py        # Step 4: BG/NBD + Gamma-Gamma + MLflow
│   │   └── train_churn_risk.py        # Step 5: 4-model CV + SHAP + calibration
│   ├── optimization/
│   │   └── budget_allocator.py        # Step 6: 0/1 Knapsack (PuLP/CBC)
│   ├── evaluation/
│   │   ├── backtesting.py             # Step 7: Decile lift + 95% CIs + ROI curve
│   │   └── sensitivity_analysis.py   # Step 11: Monte Carlo ROI + tornado chart
│   ├── analysis/
│   │   ├── customer_segmentation.py  # Step 8: RFM segments + CLV/churn overlay
│   │   ├── cohort_analysis.py        # Step 9: Monthly cohort retention heatmap
│   │   └── business_insights.py      # Step 10: Pareto + Lorenz + monthly trend
│   ├── pipelines/
│   │   └── weekly_scoring_pipeline.py # CLI orchestrator — runs all 11 steps
│   └── utils/
│       ├── config_loader.py           # YAML loader with validation
│       └── helpers.py                 # Logging and filesystem utilities
│
├── tests/
│   ├── test_cleaning.py               # 2 tests: cleaning rules and revenue computation
│   ├── test_features.py               # 2 tests: cutoff safety and feature completeness
│   └── test_models.py                 # 6 tests: CLV aggregation, churn labels, snapshot
│
├── config/
│   ├── project.yaml                   # Paths, cutoff date, data locations
│   ├── modeling.yaml                  # Hyperparameters, cv_folds, model_type: "auto"
│   ├── business.yaml                  # Budget, cost, retention effectiveness, solver
│   └── evaluation.yaml                # Decile count, currency, rounding precision
│
├── reports/
│   ├── figures/                       # 17 PNG charts (auto-generated by pipeline)
│   └── tables/                        # 11 CSV tables (auto-generated by pipeline)
│
├── docs/
│   ├── model_card.md
│   ├── business_problem.md
│   ├── architecture_overview.md
│   ├── feature_engineering.md
│   ├── evaluation_strategy.md
│   ├── business_impact_and_roi.md
│   ├── modeling_assumptions.md
│   ├── deployment_plan.md
│   ├── data_quality_rules.md
│   └── Mathintuition_datascienceframing.md
│
├── assets/
│   ├── header_banner.svg
│   ├── footer_banner.svg
│   ├── clv_architecture.png
│   ├── data_pipeline_feature_engineering.png
│   ├── deployment_lifecycle_architecture.png
│   └── evaluation_monitoring_architecture.png
│
├── data/
│   ├── raw/                           # UCI Online Retail II xlsx (~1M rows)
│   ├── interim/                       # transactions_raw, transactions_clean (Parquet)
│   └── processed/                     # features, CLV scores, churn scores, segments, targeting
│
├── notebooks/                         # EDA and model interpretation notebooks
├── pyproject.toml                     # Packaging, ruff lint config, pytest config
├── Makefile                           # install / pipeline / test / lint / mlflow-ui / clean
├── .pre-commit-config.yaml            # ruff lint + format pre-commit hooks
└── requirements.txt                   # Full dependency list with version pins

Assumptions, Risks, and Limitations

Key Assumptions

Assumption	Value	How Sensitivity Is Tested
Retention effectiveness (η)	10%	Monte Carlo: Uniform(5%–25%) — ROI positive throughout
Unit cost per customer	£2	Monte Carlo: Uniform(£1–£5) — second-largest ROI driver
Contact frequency cap	None	Production systems need per-customer contact limits
Channel model	Single channel	Real optimisation would model email/SMS/calls separately

Known Risks

Non-causal: CLV and churn models are predictive, not causal. Expected gains are potential impact — not guaranteed uplift. SUTVA is not satisfied without controlled experiments.

Observational bias: Purchasing behaviour reflects unobserved factors (promotions, seasonality, competitive events) not captured in features.

Stationarity assumption: BG/NBD assumes stationary purchase rates — violated by strong seasonality or structural market shifts.

Concept drift: Model trained on 2009–2011 UK retail data. Performance will degrade without periodic retraining on fresh data.

Appropriate Uses

This system is appropriate for:

Batch-mode retention targeting (weekly or monthly cadence)
Internal CRM decision support and budget planning
Scenario analysis and ROI forecasting

This system is not appropriate for:

Real-time scoring (latency is measured in minutes)
Individual credit or financial eligibility decisions
Campaigns requiring causal proof of uplift (run A/B tests first)

Future Improvements

Improvement	Business Impact	Complexity
Uplift modeling (T-learner / X-learner)	Replace assumed η with measured causal effect	High
A/B test design + power analysis	Scientifically measure true intervention lift	Medium
Channel-aware multi-constraint knapsack	Separate email/SMS/call budgets, different unit costs	Medium
SHAP interaction values	Understand feature × feature effects on churn	Low
Automated drift monitoring	PSI alerts + scheduled retraining (PSI > 0.25 threshold)	Medium
Per-customer variable cost	Higher offers for highest-value customers	Low
Cohort-stratified CLV	Separate BG/NBD model per acquisition cohort	Medium
Model governance layer	Versioning, lineage, approval workflow, audit trail	Medium

License and Copyright

Ramesh Shrestha — Data Scientist · ML Engineer · Sydney, Australia

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
assets		assets
config		config
data		data
docs		docs
notebooks		notebooks
reports		reports
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

At a Glance

Documentation

Table of Contents

The Business Problem

What this data reveals

The Solution

Key Results

CLV Model — Holdout Validation

Churn Model — 4-Algorithm Comparison

Budget Optimization ROI

End-to-End Architecture

Step-by-Step: Methods and Findings

Steps 1–3: Data Pipeline

Step 4: CLV Modeling

Result: CLV Decile Lift

Step 5: Churn Risk Modeling

5a. Multi-Model Comparison

5b. ROC and Precision-Recall Curves

5c. Probability Calibration

5d. SHAP Feature Importance

Step 6: Budget Optimization

Step 7: Backtesting & Evaluation

ROI vs Budget Curve

Step 8: RFM Customer Segmentation

Step 9: Cohort Retention Analysis

Step 10: Business Intelligence & Pareto

Step 11: Monte Carlo Sensitivity Analysis

Monte Carlo ROI Uncertainty Bands

Tornado Chart — One-at-a-Time Sensitivity

Professional DS Practices

Skills Demonstrated

How to Run

Prerequisites

Option A: Makefile (recommended)

Option B: Manual steps

Configuration

Repository Structure

Assumptions, Risks, and Limitations

Key Assumptions

Known Risks

Appropriate Uses

Future Improvements

License and Copyright

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages