ExperimentIQ

A Bayesian A/B testing pipeline on GCP

A production-style experimentation pipeline that runs frequentist + Bayesian analysis with CUPED variance reduction, heterogeneous treatment effect detection, and automated guardrail flagging. Built end-to-end on Google Cloud (BigQuery, Cloud Storage, Cloud Run, Looker Studio) using real MovieLens 25M data with a known-truth treatment effect simulated on the variant arm.

The business question

"We've built a new recommender (Variant B). Should we ship it?"

This is the most common A/B question at any recommender-driven company. The answer is rarely a clean "yes" or "no" — and the value of an experimentation platform is in producing the right kind of nuance: probability that B is better, expected loss if we get it wrong, segment-level effects, and guardrail health.

This project demonstrates that full loop on a portfolio scale.

TL;DR — what the pipeline found

For the experiment defined in this repo:

Metric	Raw lift	CUPED-adjusted	P(B>A)	Decision
Watch-time (primary)	+13.0%	+9.1%	0.85	Ship to light users
Sessions (guardrail)	+13.1%	+10.0%	0.70	⚠ Inconclusive

Recommendation produced by the pipeline:

PROCEED WITH CAUTION. Primary lift +9.1% (P(B>A)=0.85). Guardrail metric inconclusive (+10.0%, CI crosses zero). Consider longer experiment or segmented rollout — segment analysis shows light users have stronger evidence than heavy users.

Why this experiment is interesting

The dataset is deliberately constructed so the raw aggregate lift is misleading. Pre-period covariates show A and B users were balanced before assignment, but a chance imbalance in experiment-period activity inflates the raw lift estimate to +13%. The true treatment effect coded into the data is +7% on watch-time (with a +3% extra bonus for light users) and −2% on sessions.

The pipeline's job is to recover this signal from noisy data. It does so via:

Multivariate CUPED with four pre-period covariates, which pulls the lift estimate down from +13% → +9% (closer to the true +7%)
Bayesian inference that produces a calibrated P(B>A) of 0.85 — even though the frequentist test fails to reject the null at α=0.05
Segment-level analysis that surfaces the heterogeneity (low-activity users +23%, mid +10%, high +4%; light users +11%, heavy users +8%)
Guardrail flagging that catches the unstable sessions metric and produces a yellow status with a thoughtful recommendation

Real A/B experiments are often underpowered. The senior DS skill is making good decisions under uncertainty — not waiting for p < 0.05 to feel certain.

Architecture

MovieLens 25M │ ▼ GCS bucket (raw CSVs) │ ▼ BigQuery raw layer (4 tables) │ ▼ BigQuery staging (cohort, assignment, daily metrics) │ ▼ BigQuery analytics (star schema: 3 dims + 2 facts) │ ▼ Cloud Run job (Python: freq + CUPED + Bayes + segments) │ ▼ analytics.experiment_results │ ▼ Looker Studio dashboard (public)

Components

Component	Purpose	Tech
Synthetic experiment build	Real MovieLens behavior as control; injects a known counterfactual treatment effect (lift + novelty + heterogeneity + guardrail regression) into the variant arm	BigQuery SQL
Star schema	`dim_users`, `dim_movies`, `dim_experiments`, `fact_assignments`, `fact_daily_metrics`	BigQuery
Analysis library	Welch's t-test, univariate + multivariate CUPED, closed-form Bayesian, segment-level HTE	Python (scipy, pandas, numpy)
Cloud Run job	Containerized pipeline runs end-to-end in ~10s, writes structured results to BigQuery	Docker, Artifact Registry, Cloud Run
Test suite	14 tests covering math correctness vs scipy, CUPED variance reduction, Bayesian probability recovery	pytest
Dashboard	Single-page exec dashboard with traffic-light decision logic, segment breakdowns, methodology table	Looker Studio

Technical decisions worth calling out

Closed-form Bayesian instead of MCMC. For Normal-Normal A/B tests with weak priors, the posterior over the difference in means is analytically Student-t. The closed-form implementation runs in 5ms and produces results numerically identical to PyMC sampling (which took 23 minutes during development). The PyMC version is kept as bayesian_mcmc.py for reference. This is the same approach Microsoft ExP and Booking use for normal metrics.

Multivariate CUPED. Univariate CUPED with pre_watch_time_min only achieved 6% variance reduction (ρ=0.25) — the cohort filter compresses the covariate range. Multivariate CUPED with four pre-period covariates achieves 14% reduction (multiple-R=0.38). This is more interesting than the textbook "40% reduction" because it reflects what real production CUPED looks like when covariates are weak. Investigating why CUPED's reduction is modest, and reporting it honestly, is part of the project's value.

Stratified randomization on activity decile. Hash-based bucketing within NTILE(10) OVER (ORDER BY pre_n_ratings) to ensure pre-period activity is balanced across arms. SRM passes (chi-square = 0.205).

Winsorization at p99 (pooled). Watch-time has a heavy tail. Capping at the pooled p99 prevents single users from dominating arm averages without biasing one arm relative to the other.

Intent-to-treat analysis. Per-user totals include zero-activity users — every assigned user contributes one number to the analysis, including zero. This matches how production experimentation platforms handle exposure.

See DECISIONS.md for fuller treatment of these and other tradeoffs.

How to reproduce

You'll need a GCP project with billing enabled and the bigquery, storage, run, and artifactregistry APIs enabled.

# 1. Setup
git clone https://github.com/<you>/experimentiq.git && cd experimentiq
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
pip install -e .
pip install -r requirements.txt

# 2. Build the warehouse
python scripts/01_download_movielens.py
python scripts/02_upload_to_gcs.py
python scripts/03_load_to_bigquery.py
python scripts/04_build_experiment.py
python scripts/05_build_star_schema.py

# 3. Run the analysis (writes to analytics.experiment_results)
python scripts/06_run_analysis.py

# 4. Or deploy to Cloud Run
docker build -t $IMAGE .
docker push $IMAGE
gcloud run jobs create experimentiq-analysis --image $IMAGE --region us-central1 \
  --service-account $SA_EMAIL \
  --set-env-vars "GCP_PROJECT_ID=$PROJECT_ID" \
  --set-env-vars "GCS_BUCKET=$BUCKET" \
  --set-env-vars "BQ_LOCATION=US"
gcloud run jobs execute experimentiq-analysis --region us-central1 --wait

# 5. Run tests
pytest tests -v

Project structure

experimentiq/ ├── src/ │ └── analysis/ │ ├── data.py # BigQuery loader, ExperimentData container │ ├── frequentist.py # Welch's t-test + CIs + power │ ├── cuped.py # univariate + multivariate CUPED │ ├── bayesian.py # closed-form Bayesian (production) │ ├── bayesian_mcmc.py # PyMC reference implementation │ └── segments.py # heterogeneous treatment effects ├── scripts/ │ ├── 01_download_movielens.py │ ├── 02_upload_to_gcs.py │ ├── 03_load_to_bigquery.py │ ├── 04_build_experiment.py │ ├── 05_build_star_schema.py │ ├── 06_run_analysis.py │ ├── main.py # Cloud Run entrypoint │ └── _run_analysis_entry.py # importable wrapper ├── tests/ # 14 tests, runs in <2 seconds ├── Dockerfile ├── requirements-prod.txt ├── pyproject.toml ├── DECISIONS.md └── README.md

Caveats and what I'd do next

This experiment is genuinely underpowered. With ~2,400 users and heavy-tailed engagement, the frequentist test fails to reject the null at α=0.05 even though the true effect is real. Bayesian and segment analyses do the actual decision-making work. In a real YouTube setting, the same pipeline running on millions of users would have a tight CI and a clean significant result.

CUPED's modest variance reduction (14%) is structural to the cohort. Cohort filtering compresses the range of pre-period covariates, suppressing correlation. Extensions worth exploring: shorter pre-period windows (7-14 days vs 90), per-segment CUPED (high-activity users have ρ>0.4 individually), or learned residual models.

The chance imbalance is the project's value. A naive frequentist analysis on this data would say "fail to reject null, don't ship." A naive aggregate-only Bayesian would say "P(B>A)=0.91, ship." The pipeline does neither — it surfaces the heterogeneity, flags the guardrail uncertainty, and produces a segmented recommendation (ship to light users, hold for heavy). That's the actual work.

Built with

Python 3.12 · GCP (BigQuery, Cloud Storage, Cloud Run, Artifact Registry) · Docker · scipy · pandas · numpy · pymc · pytest · Looker Studio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExperimentIQ

The business question

TL;DR — what the pipeline found

Why this experiment is interesting

Architecture

Components

Technical decisions worth calling out

How to reproduce

Project structure

Caveats and what I'd do next

Built with

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
experimentiq.egg-info		experimentiq.egg-info
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
DECISIONS.md		DECISIONS.md
Dockerfile		Dockerfile
README.md		README.md
dashboard.png		dashboard.png
pyproject.toml		pyproject.toml
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ExperimentIQ

The business question

TL;DR — what the pipeline found

Why this experiment is interesting

Architecture

Components

Technical decisions worth calling out

How to reproduce

Project structure

Caveats and what I'd do next

Built with

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages