diff --git a/README.md b/README.md
index 17a8229..67b1872 100644
--- a/README.md
+++ b/README.md
@@ -4,17 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http
 
 ## Skills
 
-| Skill | Description |
-|---|---|
-| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. |
-| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. |
-| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. |
-| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. |
+| Skill                                                                             | Description                                                                                                                                                                                                                                                                                                                                                          |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/)               | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout.                                                                                                                                                                                                                                                                    |
+| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/)                     | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis.                                                                                                                                                                                         |
+| [`interpret-experiment`](plugins/mixpanel-mcp/skills/interpret-experiment/)       | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. |
+| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/)                   | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags.                                                                                                                               |
+| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes.                                                                                                                                                                                                                                 |
 
 ### Internal skills
 
-| Skill | Description |
-|---|---|
+| Skill                                          | Description                                                                                                                                                                                |
+| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill <skill-name>` before requesting a code review. |
 
 ## Getting Started
@@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins
 2. Install the plugin for your region:
 
 **US**
+
 ```bash
 claude plugin install mixpanel-mcp
 ```
 
 **EU**
+
 ```bash
 claude plugin install mixpanel-mcp-eu
 ```
 
 **India**
+
 ```bash
 claude plugin install mixpanel-mcp-in
 ```
 
-
 ### Cursor
 
 Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import).
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c370fc0
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
+
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
new file mode 100644
index 0000000..1467468
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -0,0 +1,176 @@
+# Health-Check Interpretation
+
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check per-variant exposure totals — which variant is undersampled?
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
+
+### Investigation checklist
+
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
+
+### Multiple-testing correction off with several primaries
+
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
+
+### Extreme winsorization percentile
+
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
+
+### SRM check disabled
+
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..3a9e24c
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
new file mode 100644
index 0000000..e46381c
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -0,0 +1,167 @@
+# Per-Metric Interpretation
+
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe
+
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
+
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
+
+---
+
+## Reading the p-value in this platform
+
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
+
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
+2. Lift from the winning row.
+3. Absolute lift: `baseline × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when the baseline value or sample size is missing
+
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## When a primary metric is inconclusive
+
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
+
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
+
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..98c7bbc
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -0,0 +1,99 @@
+# Segment-Breakdown Interpretation
+
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
+
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..4db49ac
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
new file mode 100644
index 0000000..7282bb4
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Turn a quantitative experiment result into a behavior story using session replays.
+
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
new file mode 100644
index 0000000..37ec069
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
+
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c370fc0
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
+
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
new file mode 100644
index 0000000..1467468
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -0,0 +1,176 @@
+# Health-Check Interpretation
+
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check per-variant exposure totals — which variant is undersampled?
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
+
+### Investigation checklist
+
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
+
+### Multiple-testing correction off with several primaries
+
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
+
+### Extreme winsorization percentile
+
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
+
+### SRM check disabled
+
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..3a9e24c
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
new file mode 100644
index 0000000..e46381c
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -0,0 +1,167 @@
+# Per-Metric Interpretation
+
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe
+
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
+
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
+
+---
+
+## Reading the p-value in this platform
+
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
+
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
+2. Lift from the winning row.
+3. Absolute lift: `baseline × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when the baseline value or sample size is missing
+
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## When a primary metric is inconclusive
+
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
+
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
+
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..98c7bbc
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -0,0 +1,99 @@
+# Segment-Breakdown Interpretation
+
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
+
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..4db49ac
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
new file mode 100644
index 0000000..7282bb4
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Turn a quantitative experiment result into a behavior story using session replays.
+
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
new file mode 100644
index 0000000..37ec069
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
+
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c370fc0
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
+
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
new file mode 100644
index 0000000..1467468
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -0,0 +1,176 @@
+# Health-Check Interpretation
+
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check per-variant exposure totals — which variant is undersampled?
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
+
+### Investigation checklist
+
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
+
+### Multiple-testing correction off with several primaries
+
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
+
+### Extreme winsorization percentile
+
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
+
+### SRM check disabled
+
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..3a9e24c
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
new file mode 100644
index 0000000..e46381c
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -0,0 +1,167 @@
+# Per-Metric Interpretation
+
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe
+
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
+
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
+
+---
+
+## Reading the p-value in this platform
+
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
+
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
+2. Lift from the winning row.
+3. Absolute lift: `baseline × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when the baseline value or sample size is missing
+
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## When a primary metric is inconclusive
+
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
+
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
+
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..98c7bbc
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -0,0 +1,99 @@
+# Segment-Breakdown Interpretation
+
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
+
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..4db49ac
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
new file mode 100644
index 0000000..7282bb4
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Turn a quantitative experiment result into a behavior story using session replays.
+
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
new file mode 100644
index 0000000..37ec069
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
+
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.