Skip to content

Tussar98/experimentiq

Repository files navigation

ExperimentIQ

A Bayesian A/B testing pipeline on GCP

🔗 Live dashboard →


ExperimentIQ dashboard

A production-style experimentation pipeline that runs frequentist + Bayesian analysis with CUPED variance reduction, heterogeneous treatment effect detection, and automated guardrail flagging. Built end-to-end on Google Cloud (BigQuery, Cloud Storage, Cloud Run, Looker Studio) using real MovieLens 25M data with a known-truth treatment effect simulated on the variant arm.


The business question

"We've built a new recommender (Variant B). Should we ship it?"

This is the most common A/B question at any recommender-driven company. The answer is rarely a clean "yes" or "no" — and the value of an experimentation platform is in producing the right kind of nuance: probability that B is better, expected loss if we get it wrong, segment-level effects, and guardrail health.

This project demonstrates that full loop on a portfolio scale.


TL;DR — what the pipeline found

For the experiment defined in this repo:

Metric Raw lift CUPED-adjusted P(B>A) Decision
Watch-time (primary) +13.0% +9.1% 0.85 Ship to light users
Sessions (guardrail) +13.1% +10.0% 0.70 ⚠ Inconclusive

Recommendation produced by the pipeline:

PROCEED WITH CAUTION. Primary lift +9.1% (P(B>A)=0.85). Guardrail metric inconclusive (+10.0%, CI crosses zero). Consider longer experiment or segmented rollout — segment analysis shows light users have stronger evidence than heavy users.


Why this experiment is interesting

The dataset is deliberately constructed so the raw aggregate lift is misleading. Pre-period covariates show A and B users were balanced before assignment, but a chance imbalance in experiment-period activity inflates the raw lift estimate to +13%. The true treatment effect coded into the data is +7% on watch-time (with a +3% extra bonus for light users) and −2% on sessions.

The pipeline's job is to recover this signal from noisy data. It does so via:

  • Multivariate CUPED with four pre-period covariates, which pulls the lift estimate down from +13% → +9% (closer to the true +7%)
  • Bayesian inference that produces a calibrated P(B>A) of 0.85 — even though the frequentist test fails to reject the null at α=0.05
  • Segment-level analysis that surfaces the heterogeneity (low-activity users +23%, mid +10%, high +4%; light users +11%, heavy users +8%)
  • Guardrail flagging that catches the unstable sessions metric and produces a yellow status with a thoughtful recommendation

Real A/B experiments are often underpowered. The senior DS skill is making good decisions under uncertainty — not waiting for p < 0.05 to feel certain.


Architecture

MovieLens 25M │ ▼ GCS bucket (raw CSVs) │ ▼ BigQuery raw layer (4 tables) │ ▼ BigQuery staging (cohort, assignment, daily metrics) │ ▼ BigQuery analytics (star schema: 3 dims + 2 facts) │ ▼ Cloud Run job (Python: freq + CUPED + Bayes + segments) │ ▼ analytics.experiment_results │ ▼ Looker Studio dashboard (public)

Components

Component Purpose Tech
Synthetic experiment build Real MovieLens behavior as control; injects a known counterfactual treatment effect (lift + novelty + heterogeneity + guardrail regression) into the variant arm BigQuery SQL
Star schema dim_users, dim_movies, dim_experiments, fact_assignments, fact_daily_metrics BigQuery
Analysis library Welch's t-test, univariate + multivariate CUPED, closed-form Bayesian, segment-level HTE Python (scipy, pandas, numpy)
Cloud Run job Containerized pipeline runs end-to-end in ~10s, writes structured results to BigQuery Docker, Artifact Registry, Cloud Run
Test suite 14 tests covering math correctness vs scipy, CUPED variance reduction, Bayesian probability recovery pytest
Dashboard Single-page exec dashboard with traffic-light decision logic, segment breakdowns, methodology table Looker Studio

Technical decisions worth calling out

Closed-form Bayesian instead of MCMC. For Normal-Normal A/B tests with weak priors, the posterior over the difference in means is analytically Student-t. The closed-form implementation runs in 5ms and produces results numerically identical to PyMC sampling (which took 23 minutes during development). The PyMC version is kept as bayesian_mcmc.py for reference. This is the same approach Microsoft ExP and Booking use for normal metrics.

Multivariate CUPED. Univariate CUPED with pre_watch_time_min only achieved 6% variance reduction (ρ=0.25) — the cohort filter compresses the covariate range. Multivariate CUPED with four pre-period covariates achieves 14% reduction (multiple-R=0.38). This is more interesting than the textbook "40% reduction" because it reflects what real production CUPED looks like when covariates are weak. Investigating why CUPED's reduction is modest, and reporting it honestly, is part of the project's value.

Stratified randomization on activity decile. Hash-based bucketing within NTILE(10) OVER (ORDER BY pre_n_ratings) to ensure pre-period activity is balanced across arms. SRM passes (chi-square = 0.205).

Winsorization at p99 (pooled). Watch-time has a heavy tail. Capping at the pooled p99 prevents single users from dominating arm averages without biasing one arm relative to the other.

Intent-to-treat analysis. Per-user totals include zero-activity users — every assigned user contributes one number to the analysis, including zero. This matches how production experimentation platforms handle exposure.

See DECISIONS.md for fuller treatment of these and other tradeoffs.


How to reproduce

You'll need a GCP project with billing enabled and the bigquery, storage, run, and artifactregistry APIs enabled.

# 1. Setup
git clone https://github.com/<you>/experimentiq.git && cd experimentiq
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
pip install -e .
pip install -r requirements.txt

# 2. Build the warehouse
python scripts/01_download_movielens.py
python scripts/02_upload_to_gcs.py
python scripts/03_load_to_bigquery.py
python scripts/04_build_experiment.py
python scripts/05_build_star_schema.py

# 3. Run the analysis (writes to analytics.experiment_results)
python scripts/06_run_analysis.py

# 4. Or deploy to Cloud Run
docker build -t $IMAGE .
docker push $IMAGE
gcloud run jobs create experimentiq-analysis --image $IMAGE --region us-central1 \
  --service-account $SA_EMAIL \
  --set-env-vars "GCP_PROJECT_ID=$PROJECT_ID" \
  --set-env-vars "GCS_BUCKET=$BUCKET" \
  --set-env-vars "BQ_LOCATION=US"
gcloud run jobs execute experimentiq-analysis --region us-central1 --wait

# 5. Run tests
pytest tests -v

Project structure

experimentiq/ ├── src/ │ └── analysis/ │ ├── data.py # BigQuery loader, ExperimentData container │ ├── frequentist.py # Welch's t-test + CIs + power │ ├── cuped.py # univariate + multivariate CUPED │ ├── bayesian.py # closed-form Bayesian (production) │ ├── bayesian_mcmc.py # PyMC reference implementation │ └── segments.py # heterogeneous treatment effects ├── scripts/ │ ├── 01_download_movielens.py │ ├── 02_upload_to_gcs.py │ ├── 03_load_to_bigquery.py │ ├── 04_build_experiment.py │ ├── 05_build_star_schema.py │ ├── 06_run_analysis.py │ ├── main.py # Cloud Run entrypoint │ └── _run_analysis_entry.py # importable wrapper ├── tests/ # 14 tests, runs in <2 seconds ├── Dockerfile ├── requirements-prod.txt ├── pyproject.toml ├── DECISIONS.md └── README.md


Caveats and what I'd do next

This experiment is genuinely underpowered. With ~2,400 users and heavy-tailed engagement, the frequentist test fails to reject the null at α=0.05 even though the true effect is real. Bayesian and segment analyses do the actual decision-making work. In a real YouTube setting, the same pipeline running on millions of users would have a tight CI and a clean significant result.

CUPED's modest variance reduction (14%) is structural to the cohort. Cohort filtering compresses the range of pre-period covariates, suppressing correlation. Extensions worth exploring: shorter pre-period windows (7-14 days vs 90), per-segment CUPED (high-activity users have ρ>0.4 individually), or learned residual models.

The chance imbalance is the project's value. A naive frequentist analysis on this data would say "fail to reject null, don't ship." A naive aggregate-only Bayesian would say "P(B>A)=0.91, ship." The pipeline does neither — it surfaces the heterogeneity, flags the guardrail uncertainty, and produces a segmented recommendation (ship to light users, hold for heavy). That's the actual work.


Built with

Python 3.12 · GCP (BigQuery, Cloud Storage, Cloud Run, Artifact Registry) · Docker · scipy · pandas · numpy · pymc · pytest · Looker Studio

About

Bayesian A/B testing pipeline on GCP — MovieLens 25M with closed-form Bayesian + multivariate CUPED + Cloud Run

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors