This document explains, from a data scientist’s perspective:
- How the business problem is translated into data science and mathematical problems
- The math intuition behind each model and component
- How we solve the full system end-to-end in a way that is time-safe and decision-focused
The project is structured as a decision system:
Transactions → Features → CLV (value) + Churn risk (risk) → Economic benefit → Budget optimization → Evaluation
“Who should we target with limited retention budget to maximize long-term value?”
We need to compute, for each customer (i), a decision (x_i \in {0,1}) such that:
- (x_i = 1): we target customer (i) (send offer, call, etc.)
- (x_i = 0): we do not target customer (i)
The business goal becomes:
Choose (x_i) to maximize total expected return subject to a budget constraint.
This requires:
- an estimate of future value if the customer stays active (CLV)
- an estimate of risk of losing that value (churn probability)
- an estimate of benefit gained by intervention (retention effectiveness)
- an optimization step to select customers under limited budget
We observe line-item transactions. After cleaning and aggregation, we represent data as purchase events (invoice-level):
For customer (i), we observe purchase times: [ t_{i1}, t_{i2}, \dots, t_{in_i} ] and monetary values (invoice revenue): [ m_{i1}, m_{i2}, \dots, m_{in_i} ]
We define:
- Recency: time since last purchase [ r_i = T - t_{in_i} ]
- Frequency: number of repeat purchases [ x_i = n_i - 1 ]
- Age / Tenure: length of observation window [ T_i = T - t_{i1} ] where (T) is the cutoff date (decision date).
These are the core sufficient statistics used by classical CLV models.
We always split time into:
- Calibration window: data used to build features and models
- Holdout window: future data used only for evaluation
If we let features include holdout transactions, we leak future information and overestimate performance.
Mathematically, when predicting outcomes after (T):
- Features must depend only on: [ {t_{ij}, m_{ij}\ |\ t_{ij} \le T} ]
- Labels and evaluation depend on: [ {t_{ij}, m_{ij}\ |\ t_{ij} > T} ]
This is crucial for real-world deployment: you never have future data when scoring customers.
CLV is the expected discounted value of a customer’s future purchases over a horizon (H):
[ CLV_i(H) = \mathbb{E}\left[\sum_{k=1}^{N_i(H)} \frac{M_{ik}}{(1+d)^{\tau_{ik}}}\right] ]
Where:
- (N_i(H)) is the random number of future purchases in horizon (H)
- (M_{ik}) is monetary value per purchase
- (d) is discount rate (time value of money)
- (\tau_{ik}) is time until the (k)-th future purchase
We decompose the problem into two parts:
- Purchase frequency → BG/NBD model
- Monetary value → Gamma-Gamma model
This separation is standard and practical.
Customers buy repeatedly over time, but some “drop out” (become inactive).
BG/NBD models:
- how often customers buy while active
- the probability they are still “alive”
-
While alive, customer purchases follow a Poisson process with rate (\lambda_i): [ N(t)\ |\ \lambda_i \sim \text{Poisson}(\lambda_i t) ]
-
Customers are heterogeneous: [ \lambda_i \sim \text{Gamma}(r, \alpha) ] Gamma prior captures that some customers buy frequently and others rarely.
-
After each purchase, customer may drop out with probability (p_i): [ p_i \sim \text{Beta}(a, b) ] Beta distribution captures heterogeneity in dropout tendency.
This combination yields closed-form formulas for:
- expected future purchases
- probability alive
BG/NBD uses:
- (x_i) = number of repeat purchases
- (t_{x_i}) = time of last purchase
- (T_i) = total observation time
We estimate:
- (\mathbb{E}[N_i(H)]): expected number of purchases in the next (H) days
- (P(\text{alive}_i)): probability customer is still active at cutoff
These outputs directly answer:
“Who is likely to purchase again?”
Given a customer purchases, how much do they spend on average?
We model mean transaction value per customer: [ \mu_i = \mathbb{E}[M_{ik}] ]
Assumption (common, but important):
- the spending distribution is independent of purchase timing/frequency (conditional on being a repeat customer)
-
Transaction values for customer (i) are Gamma distributed: [ M_{ik}\ |\ \nu_i \sim \text{Gamma}(p, \nu_i) ]
-
Customer heterogeneity: [ \nu_i \sim \text{Gamma}(q, \gamma) ]
This leads to a closed-form estimate for expected average spend: [ \mathbb{E}[\mu_i\ |\ \text{transactions}] ]
Gamma-Gamma is typically fit only for customers with at least 2 repeat purchases to ensure stability.
A practical CLV estimate over horizon (H) is:
[ CLV_i(H) \approx \mathbb{E}[N_i(H)] \times \mathbb{E}[\mu_i] \times \text{discounting} ]
Discounting uses a daily discount rate (d_{\text{daily}}) derived from annual rate (d_{\text{annual}}):
[ d_{\text{daily}} = (1 + d_{\text{annual}})^{1/365} - 1 ]
This produces:
- a single monetary score per customer: (CLV_i(H))
In many retail businesses, churn is not explicit. A customer can “leave” silently.
So we define churn operationally as inactivity.
Given cutoff (T), inactivity threshold (L) days:
[ y_i = \begin{cases} 1 & \text{if no purchase in } (T, T+L] \ 0 & \text{otherwise} \end{cases} ]
This turns a missing business label into a measurable label using future behavior.
We now have a supervised learning dataset:
- Input features (X_i): computed using transactions up to (T)
- Label (y_i): computed using transactions after (T)
We want: [ \hat{p}_i = P(y_i = 1\ |\ X_i) ]
Logistic regression estimates: [ \hat{p}_i = \sigma(\beta_0 + \beta^T X_i) ] where: [ \sigma(z) = \frac{1}{1 + e^{-z}} ]
Interpretation:
- Each feature contributes linearly to the log-odds of churn.
- Coefficients are interpretable: sign and magnitude show direction and strength.
We standardize features because:
- it stabilizes optimization
- it makes coefficients comparable
Predictions alone are not the business decision.
We need a “value of acting” estimate.
A high-risk customer with high CLV has large value at risk.
Approximate expected value at risk:
[ \text{value_at_risk}_i = CLV_i(H) \times \hat{p}_i ]
If an intervention prevents churn with probability (\eta), then expected benefit is:
[ \text{benefit}_i = CLV_i(H) \times \hat{p}_i \times \eta ]
Where (\eta \in [0,1]) is the assumed effectiveness.
If intervention cost is (c_i), then:
[ \text{net_gain}_i = \text{benefit}_i - c_i ]
If net gain is negative, the rational decision is not to intervene.
[ x_i \in {0,1} ]
Maximize total net gain:
[ \max_{x} \sum_{i=1}^{N} x_i \cdot \text{net_gain}_i ]
[ \sum_{i=1}^{N} x_i \cdot c_i \le B ]
This is a classical 0/1 knapsack.
Ranking by net gain gives a heuristic solution.
Optimization provides the best feasible subset under budget constraints (especially if costs vary).
We implemented:
- exact solution via integer programming (PuLP) or
- greedy approximation for speed
The core goal of CLV in business is ranking: we want top-ranked customers to produce more future revenue.
We evaluate using:
- decile lift
- Spearman rank correlation
Compute correlation between ranks:
[ \rho = \text{corr}(\text{rank}(CLV_i), \text{rank}(\text{holdout_revenue}_i)) ]
High (\rho) means the model orders customers correctly.
Split customers into 10 groups by predicted CLV and compute average holdout revenue by decile.
A good model shows monotonic increase from decile 1 → 10.
We use AUC/AP to ensure the model can separate churn vs non-churn, but business value depends on targeting decisions.
Because we do not have actual campaign outcomes, we evaluate the policy by simulation:
For each budget (B):
- select customers under policy
- compute total expected benefit and net gain
- compute ROI:
[ ROI(B) = \frac{\text{benefit}(B) - \text{cost}(B)}{\text{cost}(B)} ]
A typical curve shows diminishing returns (ROI decreases as budget expands).
The project is structured around the real decision:
- CLV estimates future value ((CLV_i))
- Churn model estimates risk ((\hat{p}_i))
- Combine into expected benefit ((CLV_i \cdot \hat{p}_i \cdot \eta))
- Solve a budget constrained optimization
- Produce an actionable target list with economic justification
- Evaluate with time-safe ranking and ROI curves
This is how real companies move from “analytics” to “decision intelligence”.
In a real business setting, we would replace assumptions with measured effects:
- Estimate (\eta) using A/B tests (uplift / causal inference)
- Allow (c_i) to vary by offer type and customer segment
- Incorporate operational constraints (channel quotas, contact rules)
- Monitor data drift and score drift
But the structure of the decision system remains the same.
This glossary explains all key terms used in the project from both a business and mathematical / data science perspective. It is intended to make the system understandable to analysts, data scientists, and stakeholders.
Business meaning:
The total expected monetary value a customer will generate for the business in the future.
Mathematical meaning:
The expected discounted sum of future transaction values over a fixed horizon.
[ CLV_i(H) = \mathbb{E}\left[\sum_{k=1}^{N_i(H)} \frac{M_{ik}}{(1+d)^{\tau_{ik}}}\right] ]
Where:
- (N_i(H)) is the random number of future purchases
- (M_{ik}) is the value of each purchase
- (d) is the discount rate
- (\tau_{ik}) is the time until each purchase
Business meaning:
How recently a customer has purchased. Customers who purchased recently are usually more engaged.
Mathematical definition:
[
\text{Recency}i = T - t{i,\text{last}}
]
Where (T) is the cutoff date.
Lower recency values generally indicate higher engagement.
Business meaning:
How often a customer purchases.
Mathematical definition:
[
\text{Frequency}_i = n_i - 1
]
Where (n_i) is the total number of purchases observed.
Frequency is a core signal of habitual behavior.
Business meaning:
How much a customer spends per transaction on average.
Mathematical definition:
[
\mu_i = \mathbb{E}[M_{ik}]
]
Estimated using the Gamma-Gamma model.
Business meaning:
How long the customer has been active with the business.
Mathematical definition:
[
\text{Tenure}i = T - t{i,\text{first}}
]
Longer tenure often indicates customer maturity and stability.
Business meaning:
The date at which the business makes a decision (e.g., who to target).
Data science meaning:
A strict boundary separating:
- past data (used for features and training)
- future data (used only for labels and evaluation)
This prevents data leakage.
Definition:
The historical time window used to estimate model parameters.
All features and model fitting must use only data within this window.
Definition:
A future time window used only for evaluation, never for training.
Used to measure:
- actual future revenue
- churn behavior
Full name:
Beta-Geometric / Negative Binomial Distribution model
Purpose:
Estimate how frequently a customer will purchase and whether they are still active.
Core intuition:
- Customers purchase randomly over time while alive
- Each customer has their own purchase rate
- After each purchase, a customer may permanently drop out
Key outputs:
- Expected future number of purchases
- Probability the customer is still “alive”
Meaning:
The probability that a customer has not permanently dropped out as of the cutoff date.
This is not churn probability, but rather a latent activity probability.
Purpose:
Estimate expected average transaction value per customer.
Key assumption:
Transaction value is independent of purchase frequency (conditional on customer).
Used to model spending heterogeneity across customers.
Business meaning:
Future money is worth less than present money.
Mathematical use:
Discount future cash flows when computing CLV.
Daily discount rate: [ d_{\text{daily}} = (1 + d_{\text{annual}})^{1/365} - 1 ]
Business meaning:
A customer who stops purchasing.
Data science proxy (used here):
A customer is considered churned if no purchase occurs within a specified inactivity window after the cutoff date.
Definition:
The number of days without a purchase after which a customer is considered churned.
Example:
- 90 days for fast-moving retail
- 180+ days for seasonal categories
Mathematical meaning:
[
\hat{p}_i = P(\text{churn}_i = 1 \mid X_i)
]
Estimated using a supervised classification model.
Purpose:
Estimate churn probability from behavioral features.
Mathematical form: [ \hat{p}_i = \frac{1}{1 + e^{-(\beta_0 + \beta^T X_i)}} ]
Why used:
- Interpretable
- Stable
- Well-suited for tabular business data
Meaning:
The expected future value that could be lost if the customer churns.
[ \text{Value at Risk}_i = CLV_i \times \hat{p}_i ]
Meaning:
The probability that a retention action successfully prevents churn.
Important:
This is an assumption in this project due to lack of experimental data.
In production, this should be estimated using A/B testing.
Definition:
The expected monetary value preserved by intervening.
[ \text{Expected Benefit}_i = CLV_i \times \hat{p}_i \times \eta ]
Where (\eta) is retention effectiveness.
Business meaning:
How much value the business gains after accounting for intervention cost.
[ \text{Net Gain}_i = \text{Expected Benefit}_i - \text{Cost}_i ]
Only customers with positive net gain should be targeted.
Definition:
A hard limit on total intervention cost:
[ \sum_i x_i \cdot \text{Cost}_i \le B ]
Where (B) is the total budget.
Purpose:
Select a subset of customers that maximizes total net gain under a budget constraint.
Decision variable: [ x_i \in {0,1} ]
This converts predictions into executable business decisions.
Meaning:
A ranking-based evaluation metric.
Customers are sorted by predicted score and split into 10 equal-sized groups (deciles).
A strong model shows increasing actual outcomes from lowest to highest decile.
Purpose:
Measure how well predicted rankings align with actual future outcomes.
[ \rho = \text{corr}(\text{rank}(y), \text{rank}(\hat{y})) ]
Used when absolute prediction accuracy is less important than ordering.
Definition:
[
ROI = \frac{\text{Benefit} - \text{Cost}}{\text{Cost}}
]
Primary metric used by business leadership to assess effectiveness of retention spend.
Meaning:
Evaluation of the decision strategy, not just individual models.
Assesses:
- ROI vs budget
- diminishing returns
- operational feasibility
Definition:
A system that combines:
- data
- models
- economics
- optimization
to produce actionable business decisions, not just predictions.
This project is an example of a decision intelligence system.