diff --git a/README.md b/README.md index 17a8229..67b1872 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http ## Skills -| Skill | Description | -|---|---| -| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | -| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | -| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | -| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | +| Skill | Description | +| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | +| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| [`interpret-experiment`](plugins/mixpanel-mcp/skills/interpret-experiment/) | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. | +| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | +| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | ### Internal skills -| Skill | Description | -|---|---| +| Skill | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill ` before requesting a code review. | ## Getting Started @@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins 2. Install the plugin for your region: **US** + ```bash claude plugin install mixpanel-mcp ``` **EU** + ```bash claude plugin install mixpanel-mcp-eu ``` **India** + ```bash claude plugin install mixpanel-mcp-in ``` - ### Cursor Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import). diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c370fc0 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -0,0 +1,129 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..37ec069 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c370fc0 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -0,0 +1,129 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..37ec069 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c370fc0 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -0,0 +1,129 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..37ec069 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.