mixpanel · elliotrfeinberg · Jun 4, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/README.md b/README.md
@@ -4,17 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http
 
 ## Skills
 
-| Skill | Description |
-|---|---|
-| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. |
-| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. |
-| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. |
-| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. |
+| Skill                                                                             | Description                                                                                                                                                                                                                                                                                                                                                          |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/)               | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout.                                                                                                                                                                                                                                                                    |
+| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/)                     | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis.                                                                                                                                                                                         |
+| [`interpret-experiment`](plugins/mixpanel-mcp/skills/interpret-experiment/)       | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. |
+| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/)                   | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags.                                                                                                                               |
+| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes.                                                                                                                                                                                                                                 |
 
 ### Internal skills
 
-| Skill | Description |
-|---|---|
+| Skill                                          | Description                                                                                                                                                                                |
+| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill <skill-name>` before requesting a code review. |
 
 ## Getting Started
@@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins
 2. Install the plugin for your region:
 
 **US**
+
 ```bash
 claude plugin install mixpanel-mcp
 ```
 
 **EU**
+
 ```bash
 claude plugin install mixpanel-mcp-eu
 ```
 
 **India**
+
 ```bash
 claude plugin install mixpanel-mcp-in
 ```
 
-
 ### Cursor
 
 Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import).

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
+
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.