From a6ec43f10d27e2db858a12eec7d8f5b70dad19ab Mon Sep 17 00:00:00 2001
From: "mixpanel-claude-code-agent[bot]"
 <237517943+mixpanel-claude-code-agent[bot]@users.noreply.github.com>
Date: Thu, 4 Jun 2026 22:32:41 +0000
Subject: [PATCH 01/11] Add experiment-results skill

Authors a single home for all results- and health-phase expertise: the
agent loads this skill and reads the verdicts that Get-Experiment
returns, rather than recomputing thresholds. Replaces the
interpretation portion of several superseded per-capability tools.

Skill is structured for progressive disclosure: the spine (5-step
decision tree, polarity recipe, ship/iterate/kill/wait verdict) lives
in SKILL.md, and deep-dive references cover health-check causes,
per-metric phrasing, why-no-statsig, segment-of-interest selection,
segment-breakdown reading, session-replay analysis, and the
Get-Experiment field map. Eval fixtures seeded from PRD customer
quotes (Pelando "+2 others", Confetti "8 metrics for new visitors",
Polarsteps "no documented workaround").

Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills.

Linear: https://linear.app/mixpanel/issue/MULTI-582

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md                                     |  15 +-
 .../skills/experiment-results/SKILL.md        | 236 ++++++++++++++++++
 .../skills/experiment-results/evals/README.md |  34 +++
 .../evals/confetti-8-metrics.yaml             |  48 ++++
 .../evals/pelando-plus-2-others.yaml          |  79 ++++++
 .../evals/polarsteps-no-workaround.yaml       |  61 +++++
 .../references/get-experiment-fields.md       | 161 ++++++++++++
 .../references/health-check-interpretation.md | 158 ++++++++++++
 .../references/per-metric-interpretation.md   | 188 ++++++++++++++
 .../segment-breakdown-interpretation.md       |  95 +++++++
 .../segment-of-interest-selection.md          | 116 +++++++++
 .../references/session-replay-analysis.md     | 109 ++++++++
 .../references/why-no-statsig.md              | 115 +++++++++
 .../skills/experiment-results/SKILL.md        | 236 ++++++++++++++++++
 .../skills/experiment-results/evals/README.md |  34 +++
 .../evals/confetti-8-metrics.yaml             |  48 ++++
 .../evals/pelando-plus-2-others.yaml          |  79 ++++++
 .../evals/polarsteps-no-workaround.yaml       |  61 +++++
 .../references/get-experiment-fields.md       | 161 ++++++++++++
 .../references/health-check-interpretation.md | 158 ++++++++++++
 .../references/per-metric-interpretation.md   | 188 ++++++++++++++
 .../segment-breakdown-interpretation.md       |  95 +++++++
 .../segment-of-interest-selection.md          | 116 +++++++++
 .../references/session-replay-analysis.md     | 109 ++++++++
 .../references/why-no-statsig.md              | 115 +++++++++
 .../skills/experiment-results/SKILL.md        | 236 ++++++++++++++++++
 .../skills/experiment-results/evals/README.md |  34 +++
 .../evals/confetti-8-metrics.yaml             |  48 ++++
 .../evals/pelando-plus-2-others.yaml          |  79 ++++++
 .../evals/polarsteps-no-workaround.yaml       |  61 +++++
 .../references/get-experiment-fields.md       | 161 ++++++++++++
 .../references/health-check-interpretation.md | 158 ++++++++++++
 .../references/per-metric-interpretation.md   | 188 ++++++++++++++
 .../segment-breakdown-interpretation.md       |  95 +++++++
 .../segment-of-interest-selection.md          | 116 +++++++++
 .../references/session-replay-analysis.md     | 109 ++++++++
 .../references/why-no-statsig.md              | 115 +++++++++
 37 files changed, 4209 insertions(+), 6 deletions(-)
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
 create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md

diff --git a/README.md b/README.md
index c79cb95..3f843f2 100644
--- a/README.md
+++ b/README.md
@@ -4,11 +4,12 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http
 
 ## Skills
 
-| Skill | Description |
-|---|---|
-| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. |
-| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. |
-| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. |
+| Skill                                                                             | Description                                                                                                                                                                                                                                                                                                                                                          |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes.                                                                                                                                                                                                                                 |
+| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/)               | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout.                                                                                                                                                                                                                                                                    |
+| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/)                     | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis.                                                                                                                                                                                         |
+| [`experiment-results`](plugins/mixpanel-mcp/skills/experiment-results/)           | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. |
 
 ## Getting Started
 
@@ -23,21 +24,23 @@ claude plugin marketplace add mixpanel/ai-plugins
 2. Install the plugin for your region:
 
 **US**
+
 ```bash
 claude plugin install mixpanel-mcp
 ```
 
 **EU**
+
 ```bash
 claude plugin install mixpanel-mcp-eu
 ```
 
 **India**
+
 ```bash
 claude plugin install mixpanel-mcp-in
 ```
 
-
 ### Cursor
 
 Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import).
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
new file mode 100644
index 0000000..4e344d3
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
@@ -0,0 +1,236 @@
+---
+name: experiment-results
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+license: Apache-2.0
+---
+
+# Experiment Results Interpretation
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+
+## Requirements
+
+- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
+- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+
+## When to use this skill
+
+Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
+
+- "What do these results mean?" / "Should we ship this?"
+- "Is this experiment trustworthy?" / "Why is SRM failing?"
+- "Why hasn't this hit statistical significance yet?"
+- "Break this down by `<segment>`" / "What segments should I look at?"
+- "What does this Retro A/A failure mean?"
+- "Can you compare the session replays for control vs treatment?"
+
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+
+---
+
+## How to read `Get-Experiment` output
+
+Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+
+| Concept                      | Live (preferred)                  | Cached fallback                             |
+| ---------------------------- | --------------------------------- | ------------------------------------------- |
+| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
+| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
+| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
+| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
+| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
+
+If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+
+If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
+
+See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+
+---
+
+## The Decision Tree
+
+This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
+
+```
+┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
+│   SRM ok? → exposures sufficient? →          │
+│   Retro A/A clean? → minimum duration met? → │
+│   no misconfig?                              │
+│        │                                     │
+│      fail → STOP. See references/            │
+│             health-check-interpretation.md   │
+└──────────────┬───────────────────────────────┘
+               ↓ pass
+┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
+│   For each non-control variant × primary,    │
+│   apply the polarity recipe (sign-of-lift +  │
+│   metric.direction). Significant + correct   │
+│   polarity = "win"; significant + wrong      │
+│   polarity = "loss".                         │
+│        │                                     │
+│   nothing significant on primaries →         │
+│   see references/why-no-statsig.md           │
+└──────────────┬───────────────────────────────┘
+               ↓ at least one primary win
+┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
+│   Any guardrail significant in the wrong     │
+│   polarity? → regression → ITERATE not ship  │
+└──────────────┬───────────────────────────────┘
+               ↓ guardrails clean
+┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
+│   Convert the lift on the primary into       │
+│   absolute terms. Is it big enough to        │
+│   matter to the business?                    │
+│   Statistically significant ≠ ships.         │
+└──────────────┬───────────────────────────────┘
+               ↓ meaningful magnitude
+┌─ Step 5: VERDICT ────────────────────────────┐
+│   Trust ✓ + primary win + guardrails ✓ +     │
+│   meaningful magnitude → SHIP                │
+│   Trust ✓ + primary win + guardrail regress  │
+│     → ITERATE                                │
+│   Trust ✓ + primary neutral after target     │
+│     → KILL or ITERATE                        │
+│   Trust ✗                                    │
+│     → DO NOT DECIDE; report failures         │
+│   Hasn't reached target sample/duration      │
+│     → WAIT (or extend, or restart with more  │
+│       power — see why-no-statsig.md)         │
+└──────────────────────────────────────────────┘
+```
+
+### Step 1 — Trustworthiness gate (consume the verdicts)
+
+Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
+
+| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
+| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
+| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
+| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
+| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
+| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
+| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+
+If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
+
+### Step 2 — Statistical significance with polarity
+
+**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
+
+#### Polarity recipe
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
+
+- `lift is None` or `lift == 0` → **neutral**.
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
+
+#### How to read the summary
+
+1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
+2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
+3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
+4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
+
+Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
+
+### Step 3 — Guardrail check
+
+Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
+
+- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
+- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
+- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
+
+### Step 4 — Practical significance
+
+Statistical significance ≠ business impact. For every primary metric that won:
+
+1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
+2. Read the **lift** from the winning variant's row.
+3. Compute absolute lift: `baseline_value × lift`.
+4. Project to population per period: ask the user for traffic estimates if not in context.
+
+A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+
+**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+
+### Step 5 — Verdict
+
+| Situation                                                              | Recommendation                                                                                                                                               |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
+
+For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
+
+`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
+
+Special variant constants when `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant
+- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
+
+For a kill, pass `success=false`.
+
+---
+
+## Going deeper
+
+Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+
+| User asks about…                                                                | Open                                                                                             |
+| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+
+---
+
+## Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
+5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+
+If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+
+---
+
+## Common pitfalls (cheat sheet)
+
+- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
+- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
+- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
+- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
+- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
+- ⛔ Treating a `null` lift as "no effect" — it means computation failed
+- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
+- ⛔ Interpreting a `< 3 day` experiment instead of refusing
+- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
+- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
+- ⛔ Conflating **statistical significance** with **practical significance**
+- ⛔ Ignoring **guardrail regressions** because the primary won
+- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
+- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
new file mode 100644
index 0000000..71278d6
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
@@ -0,0 +1,34 @@
+# Eval fixtures — `experiment-results`
+
+Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
+
+The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
+
+1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
+2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
+
+When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
+
+---
+
+## Fixtures
+
+| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
+| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
+| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
+| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
+| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
+
+Each YAML doc has the same shape:
+
+```yaml
+name: <slug>
+prd_source: <one-line attribution>
+trigger_phrase: <what the user types>
+get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
+expected_behavior:
+  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
+  must_mention: [<phrases / framings the skill must cover>]
+  must_not_do: [<failure modes the skill should avoid>]
+  references_consulted: [<which reference files the skill should pull open>]
+```
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
new file mode 100644
index 0000000..da61d9e
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
@@ -0,0 +1,48 @@
+name: confetti-8-metrics
+prd_source: |
+  Confetti — "8 metrics for new visitors"
+  Customer is running an experiment with 8 primary-ish metrics and explicitly
+  cares about new-visitor behavior. They want a segment-driven read, not a
+  dump of 8 lifts. The skill should pre-commit to segments tied to the
+  hypothesis (new vs returning), call out the multiple-testing concern with
+  8 metrics, and produce a verdict scoped to the segment that matters.
+
+trigger_phrase: |
+  We're tracking 8 metrics on this onboarding redesign experiment and I really
+  care about how new visitors respond. Can you read this and tell me whether
+  it's a ship for the new-user audience?
+
+get_experiment_summary:
+  hypothesis: |
+    If we redesign the first-session onboarding flow, then activation rate
+    among NEW visitors will increase by ≥5% relative, because reducing
+    cold-start friction shortens time-to-first-value.
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "off" # mis-configured given 8 primaries
+    testingModel: "sequential"
+    confidenceLevel: 0.95
+  metrics_count: 8
+  primary_metrics_summary: |
+    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
+    negative (a "Time to First Action" metric with direction=down where
+    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
+
+expected_behavior:
+  verdict: WAIT
+  must_mention:
+    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
+    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
+    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
+    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
+    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
+    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
+  must_not_do:
+    - "Slice by every available property after the fact (the fishing-expedition warning)"
+    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
+    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
+    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
+  references_consulted:
+    - segment-of-interest-selection.md
+    - per-metric-interpretation.md
+    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
new file mode 100644
index 0000000..f634236
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
@@ -0,0 +1,79 @@
+name: pelando-plus-2-others
+prd_source: |
+  Pelando — "+2 others"
+  Customer reported that when a multi-variant test concludes with a winner banner
+  plus a small-print "+2 others", they cannot tell which non-winner variants are
+  benign vs which contain a guardrail regression they need to act on. The skill
+  should pivot the summary per variant, polarity-correct each, and call out the
+  losers, not gloss over them.
+
+trigger_phrase: |
+  Can you make sense of this experiment for me? The UI shows treatment_a winning
+  on the primary plus "+2 others" but I have no idea whether treatment_b or
+  treatment_c are okay to ignore.
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "benjamini-hochberg"
+    testingModel: "sequential"
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Activation Rate"
+    - id: m_guardrail_latency
+      type: guardrail
+      direction: down
+      name: "p95 Latency (ms)"
+    - id: m_guardrail_errors
+      type: guardrail
+      direction: down
+      name: "Error Rate"
+  live_exposures:
+    control: 41123
+    treatment_a: 40987
+    treatment_b: 41210
+    treatment_c: 40755
+  live_srm_analysis:
+    # platform-flagged passing
+    p_value: 0.42
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment_a,
+          lift: 0.041,
+          liftConfidence: 0.95,
+        }
+      - {
+          metricId: m_guardrail_latency,
+          variant: treatment_b,
+          lift: 0.08,
+          liftConfidence: 0.95,
+        }
+    negative:
+      - {
+          metricId: m_primary,
+          variant: treatment_c,
+          lift: -0.022,
+          liftConfidence: 0.95,
+        }
+    no:
+      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
+
+expected_behavior:
+  verdict: ITERATE
+  must_mention:
+    - "Pivot the summary by variant before declaring a winner"
+    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
+    - "treatment_c regresses the primary"
+    - "Multi-variant verdict requires each treatment to be judged independently against control"
+    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
+  must_not_do:
+    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
+    - "Trust the bucket name (positive/negative) as the business verdict"
+    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
+  references_consulted:
+    - per-metric-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
new file mode 100644
index 0000000..325a3bf
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
@@ -0,0 +1,61 @@
+name: polarsteps-no-workaround
+prd_source: |
+  Polarsteps — "no documented workaround"
+  Customer's experiment is failing SRM and they cannot find a documented path
+  forward. The skill should consume the platform's SRM verdict (not recompute
+  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
+  surface ordered likely causes plus a specific recommended action — not
+  punt with "investigate further."
+
+trigger_phrase: |
+  My experiment is failing SRM and the result lift looks too good to be true
+  (+18% on the primary). The docs just say "investigate" — what does that
+  actually mean here? Should I trust the lift?
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    srm:
+      enabled: true
+      targetAllocations: { control: 50, treatment: 50 }
+    excludeQA: false # potentially relevant
+  live_exposures:
+    control: 18250
+    treatment: 22980
+  live_srm_analysis:
+    # platform-flagged FAILING
+    p_value: 0.00002
+    chi_square: 18.4
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment,
+          lift: 0.18,
+          liftConfidence: 0.95,
+        }
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Trip Plan Created"
+
+expected_behavior:
+  verdict: DO_NOT_DECIDE
+  must_mention:
+    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
+    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
+    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
+    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
+    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
+    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
+    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
+    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
+  must_not_do:
+    - "Calculate the chi-square or re-derive an SRM p-value threshold"
+    - "Recommend shipping or treating the +18% lift as real"
+    - "Hand the user a generic 'investigate further' without ordered causes and an action"
+    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
+  references_consulted:
+    - health-check-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md
new file mode 100644
index 0000000..efaeae5
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md
@@ -0,0 +1,161 @@
+# `Get-Experiment` Field Map
+
+Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+
+This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
+
+---
+
+## Identity & lifecycle
+
+```
+id, name, description, hypothesis, status, start_date, end_date
+creator_email, tags, url, workspace_id
+feature_flag_id                       → for feature-flag-based experiments
+settings.controlKey                   → variant key treated as control (often "control"; may be "")
+```
+
+`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
+
+---
+
+## Trustworthiness
+
+```
+live_srm_analysis                     → SRM verdict (consume — don't recompute)
+  .p_value
+  .chi_square
+live_exposures[<variantKey>]          → per-variant exposure counts (live)
+exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
+exposures_cache.$srm_analysis         → cached SRM analysis
+exposures_cache.$last_computed        → when the cache was last refreshed
+settings.srm.enabled                  → whether the SRM check ran
+settings.srm.targetAllocations        → expected per-variant allocation (percent)
+settings.preExperimentBias            → whether Retro A/A was enabled
+settings.excludeQA                    → whether QA traffic was filtered
+live_results_errors                   → non-null = live computation failed; surface and fall back to cache
+```
+
+---
+
+## Per-metric per-variant results
+
+```
+live_metrics[<metricId>][<variantKey>]
+  .value             → metric value for this variant
+  .sampleSize        → sample size for this variant on this metric
+  .lift              → (treatment - control) / control  (0 for control row)
+  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
+  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
+
+results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
+```
+
+---
+
+## Bucketed summary
+
+```
+results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
+results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
+results_cache.summary.no[]            → items with significance == "NO"
+
+Each item:
+  .metricId
+  .variant
+  .value
+  .lift
+  .liftConfidence
+  .sampleSize
+  .significance
+```
+
+**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
+
+---
+
+## Metric catalog (for polarity lookups)
+
+```
+metrics[]
+  .id, .name
+  .type ("primary" | "guardrail" | "secondary")
+  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
+```
+
+Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
+
+---
+
+## Settings that change interpretation
+
+```
+settings.confidenceLevel              → significance threshold (e.g. 0.95)
+settings.testingModel                 → "frequentist" or "sequential"
+settings.endCondition                 → "sample_size" or "days"
+settings.sampleSize / .endAfterDays   → planned end target
+settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
+settings.cuped.enabled                → CUPED variance reduction applied
+settings.cuped.preExposureDatePreset  → pre-exposure window
+settings.winsorization.enabled        → outlier capping applied
+settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
+```
+
+---
+
+## Decision metadata (post-decide)
+
+```
+results_cache.message                 → decision rationale
+results_cache.variant                 → shipped variant key (or special constant)
+status                                → "concluded" | "success" | "fail"
+```
+
+Special variant constants for `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant.
+- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
+
+For a kill, pass `success=false`.
+
+---
+
+## Lifecycle hand-off
+
+```
+Update-Experiment(
+  experiment_id=<id>,
+  experiment={
+    "action": "decide",
+    "success": true | false,
+    "variant": "<winner_key>",      # required when success=true
+    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
+  }
+)
+```
+
+`message` is required on every `decide` call.
+
+---
+
+## Misconfig field map (cross-link)
+
+For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
+
+- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
+- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
+- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
+- `settings.confidenceLevel != 0.95`
+- `metrics[]` entries with `name == ""`
+- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
+
+---
+
+## When to reach for sibling tools
+
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
+- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
+- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
new file mode 100644
index 0000000..4471219
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
@@ -0,0 +1,158 @@
+# Health-Check Interpretation
+
+Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check `live_exposures` totals — which variant is undersampled?
+2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
+3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm `settings.testingModel == "frequentist"`.
+2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+
+### Investigation checklist
+
+1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**Verdict to compute (this one is local)**: `end_date - start_date`.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations to flag during Step 1
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+
+- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
+- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
+- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
+- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
+- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
new file mode 100644
index 0000000..3b44385
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
@@ -0,0 +1,188 @@
+# Per-Metric Interpretation
+
+Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe (repeat from the spine — critical)
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+
+- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+
+---
+
+## Reading the p-value correctly
+
+The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+
+- ❌ The probability that the treatment works.
+- ❌ The probability the result will replicate.
+- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
+- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+
+Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+2. Lift from the winning row.
+3. Absolute lift: `baseline_value × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when `value` / `sampleSize` are null
+
+Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## "Significance = NO" does NOT mean "no effect"
+
+A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+
+Options to suggest when a primary metric lands in `summary.no`:
+
+1. **Extend duration** (if the experiment is still ACTIVE).
+2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
+3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
+4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
+5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
+6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
+
+For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Check `settings.testingModel`:
+
+- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
+- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+
+Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..6877d2a
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -0,0 +1,95 @@
+# Segment-Breakdown Interpretation
+
+Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+
+- Segments below the floor → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..ea9f22b
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
new file mode 100644
index 0000000..88640f4
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+
+> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
new file mode 100644
index 0000000..fdad2cd
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- `lift is None` on the primary → no measurement, not "no effect."
+- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
+- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+
+- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
new file mode 100644
index 0000000..4e344d3
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
@@ -0,0 +1,236 @@
+---
+name: experiment-results
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+license: Apache-2.0
+---
+
+# Experiment Results Interpretation
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+
+## Requirements
+
+- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
+- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+
+## When to use this skill
+
+Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
+
+- "What do these results mean?" / "Should we ship this?"
+- "Is this experiment trustworthy?" / "Why is SRM failing?"
+- "Why hasn't this hit statistical significance yet?"
+- "Break this down by `<segment>`" / "What segments should I look at?"
+- "What does this Retro A/A failure mean?"
+- "Can you compare the session replays for control vs treatment?"
+
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+
+---
+
+## How to read `Get-Experiment` output
+
+Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+
+| Concept                      | Live (preferred)                  | Cached fallback                             |
+| ---------------------------- | --------------------------------- | ------------------------------------------- |
+| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
+| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
+| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
+| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
+| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
+
+If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+
+If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
+
+See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+
+---
+
+## The Decision Tree
+
+This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
+
+```
+┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
+│   SRM ok? → exposures sufficient? →          │
+│   Retro A/A clean? → minimum duration met? → │
+│   no misconfig?                              │
+│        │                                     │
+│      fail → STOP. See references/            │
+│             health-check-interpretation.md   │
+└──────────────┬───────────────────────────────┘
+               ↓ pass
+┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
+│   For each non-control variant × primary,    │
+│   apply the polarity recipe (sign-of-lift +  │
+│   metric.direction). Significant + correct   │
+│   polarity = "win"; significant + wrong      │
+│   polarity = "loss".                         │
+│        │                                     │
+│   nothing significant on primaries →         │
+│   see references/why-no-statsig.md           │
+└──────────────┬───────────────────────────────┘
+               ↓ at least one primary win
+┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
+│   Any guardrail significant in the wrong     │
+│   polarity? → regression → ITERATE not ship  │
+└──────────────┬───────────────────────────────┘
+               ↓ guardrails clean
+┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
+│   Convert the lift on the primary into       │
+│   absolute terms. Is it big enough to        │
+│   matter to the business?                    │
+│   Statistically significant ≠ ships.         │
+└──────────────┬───────────────────────────────┘
+               ↓ meaningful magnitude
+┌─ Step 5: VERDICT ────────────────────────────┐
+│   Trust ✓ + primary win + guardrails ✓ +     │
+│   meaningful magnitude → SHIP                │
+│   Trust ✓ + primary win + guardrail regress  │
+│     → ITERATE                                │
+│   Trust ✓ + primary neutral after target     │
+│     → KILL or ITERATE                        │
+│   Trust ✗                                    │
+│     → DO NOT DECIDE; report failures         │
+│   Hasn't reached target sample/duration      │
+│     → WAIT (or extend, or restart with more  │
+│       power — see why-no-statsig.md)         │
+└──────────────────────────────────────────────┘
+```
+
+### Step 1 — Trustworthiness gate (consume the verdicts)
+
+Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
+
+| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
+| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
+| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
+| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
+| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
+| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
+| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+
+If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
+
+### Step 2 — Statistical significance with polarity
+
+**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
+
+#### Polarity recipe
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
+
+- `lift is None` or `lift == 0` → **neutral**.
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
+
+#### How to read the summary
+
+1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
+2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
+3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
+4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
+
+Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
+
+### Step 3 — Guardrail check
+
+Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
+
+- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
+- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
+- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
+
+### Step 4 — Practical significance
+
+Statistical significance ≠ business impact. For every primary metric that won:
+
+1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
+2. Read the **lift** from the winning variant's row.
+3. Compute absolute lift: `baseline_value × lift`.
+4. Project to population per period: ask the user for traffic estimates if not in context.
+
+A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+
+**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+
+### Step 5 — Verdict
+
+| Situation                                                              | Recommendation                                                                                                                                               |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
+
+For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
+
+`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
+
+Special variant constants when `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant
+- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
+
+For a kill, pass `success=false`.
+
+---
+
+## Going deeper
+
+Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+
+| User asks about…                                                                | Open                                                                                             |
+| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+
+---
+
+## Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
+5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+
+If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+
+---
+
+## Common pitfalls (cheat sheet)
+
+- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
+- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
+- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
+- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
+- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
+- ⛔ Treating a `null` lift as "no effect" — it means computation failed
+- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
+- ⛔ Interpreting a `< 3 day` experiment instead of refusing
+- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
+- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
+- ⛔ Conflating **statistical significance** with **practical significance**
+- ⛔ Ignoring **guardrail regressions** because the primary won
+- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
+- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
new file mode 100644
index 0000000..71278d6
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
@@ -0,0 +1,34 @@
+# Eval fixtures — `experiment-results`
+
+Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
+
+The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
+
+1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
+2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
+
+When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
+
+---
+
+## Fixtures
+
+| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
+| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
+| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
+| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
+| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
+
+Each YAML doc has the same shape:
+
+```yaml
+name: <slug>
+prd_source: <one-line attribution>
+trigger_phrase: <what the user types>
+get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
+expected_behavior:
+  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
+  must_mention: [<phrases / framings the skill must cover>]
+  must_not_do: [<failure modes the skill should avoid>]
+  references_consulted: [<which reference files the skill should pull open>]
+```
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
new file mode 100644
index 0000000..da61d9e
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
@@ -0,0 +1,48 @@
+name: confetti-8-metrics
+prd_source: |
+  Confetti — "8 metrics for new visitors"
+  Customer is running an experiment with 8 primary-ish metrics and explicitly
+  cares about new-visitor behavior. They want a segment-driven read, not a
+  dump of 8 lifts. The skill should pre-commit to segments tied to the
+  hypothesis (new vs returning), call out the multiple-testing concern with
+  8 metrics, and produce a verdict scoped to the segment that matters.
+
+trigger_phrase: |
+  We're tracking 8 metrics on this onboarding redesign experiment and I really
+  care about how new visitors respond. Can you read this and tell me whether
+  it's a ship for the new-user audience?
+
+get_experiment_summary:
+  hypothesis: |
+    If we redesign the first-session onboarding flow, then activation rate
+    among NEW visitors will increase by ≥5% relative, because reducing
+    cold-start friction shortens time-to-first-value.
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "off" # mis-configured given 8 primaries
+    testingModel: "sequential"
+    confidenceLevel: 0.95
+  metrics_count: 8
+  primary_metrics_summary: |
+    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
+    negative (a "Time to First Action" metric with direction=down where
+    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
+
+expected_behavior:
+  verdict: WAIT
+  must_mention:
+    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
+    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
+    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
+    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
+    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
+    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
+  must_not_do:
+    - "Slice by every available property after the fact (the fishing-expedition warning)"
+    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
+    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
+    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
+  references_consulted:
+    - segment-of-interest-selection.md
+    - per-metric-interpretation.md
+    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
new file mode 100644
index 0000000..f634236
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
@@ -0,0 +1,79 @@
+name: pelando-plus-2-others
+prd_source: |
+  Pelando — "+2 others"
+  Customer reported that when a multi-variant test concludes with a winner banner
+  plus a small-print "+2 others", they cannot tell which non-winner variants are
+  benign vs which contain a guardrail regression they need to act on. The skill
+  should pivot the summary per variant, polarity-correct each, and call out the
+  losers, not gloss over them.
+
+trigger_phrase: |
+  Can you make sense of this experiment for me? The UI shows treatment_a winning
+  on the primary plus "+2 others" but I have no idea whether treatment_b or
+  treatment_c are okay to ignore.
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "benjamini-hochberg"
+    testingModel: "sequential"
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Activation Rate"
+    - id: m_guardrail_latency
+      type: guardrail
+      direction: down
+      name: "p95 Latency (ms)"
+    - id: m_guardrail_errors
+      type: guardrail
+      direction: down
+      name: "Error Rate"
+  live_exposures:
+    control: 41123
+    treatment_a: 40987
+    treatment_b: 41210
+    treatment_c: 40755
+  live_srm_analysis:
+    # platform-flagged passing
+    p_value: 0.42
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment_a,
+          lift: 0.041,
+          liftConfidence: 0.95,
+        }
+      - {
+          metricId: m_guardrail_latency,
+          variant: treatment_b,
+          lift: 0.08,
+          liftConfidence: 0.95,
+        }
+    negative:
+      - {
+          metricId: m_primary,
+          variant: treatment_c,
+          lift: -0.022,
+          liftConfidence: 0.95,
+        }
+    no:
+      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
+
+expected_behavior:
+  verdict: ITERATE
+  must_mention:
+    - "Pivot the summary by variant before declaring a winner"
+    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
+    - "treatment_c regresses the primary"
+    - "Multi-variant verdict requires each treatment to be judged independently against control"
+    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
+  must_not_do:
+    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
+    - "Trust the bucket name (positive/negative) as the business verdict"
+    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
+  references_consulted:
+    - per-metric-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
new file mode 100644
index 0000000..325a3bf
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
@@ -0,0 +1,61 @@
+name: polarsteps-no-workaround
+prd_source: |
+  Polarsteps — "no documented workaround"
+  Customer's experiment is failing SRM and they cannot find a documented path
+  forward. The skill should consume the platform's SRM verdict (not recompute
+  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
+  surface ordered likely causes plus a specific recommended action — not
+  punt with "investigate further."
+
+trigger_phrase: |
+  My experiment is failing SRM and the result lift looks too good to be true
+  (+18% on the primary). The docs just say "investigate" — what does that
+  actually mean here? Should I trust the lift?
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    srm:
+      enabled: true
+      targetAllocations: { control: 50, treatment: 50 }
+    excludeQA: false # potentially relevant
+  live_exposures:
+    control: 18250
+    treatment: 22980
+  live_srm_analysis:
+    # platform-flagged FAILING
+    p_value: 0.00002
+    chi_square: 18.4
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment,
+          lift: 0.18,
+          liftConfidence: 0.95,
+        }
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Trip Plan Created"
+
+expected_behavior:
+  verdict: DO_NOT_DECIDE
+  must_mention:
+    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
+    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
+    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
+    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
+    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
+    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
+    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
+    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
+  must_not_do:
+    - "Calculate the chi-square or re-derive an SRM p-value threshold"
+    - "Recommend shipping or treating the +18% lift as real"
+    - "Hand the user a generic 'investigate further' without ordered causes and an action"
+    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
+  references_consulted:
+    - health-check-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md
new file mode 100644
index 0000000..efaeae5
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md
@@ -0,0 +1,161 @@
+# `Get-Experiment` Field Map
+
+Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+
+This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
+
+---
+
+## Identity & lifecycle
+
+```
+id, name, description, hypothesis, status, start_date, end_date
+creator_email, tags, url, workspace_id
+feature_flag_id                       → for feature-flag-based experiments
+settings.controlKey                   → variant key treated as control (often "control"; may be "")
+```
+
+`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
+
+---
+
+## Trustworthiness
+
+```
+live_srm_analysis                     → SRM verdict (consume — don't recompute)
+  .p_value
+  .chi_square
+live_exposures[<variantKey>]          → per-variant exposure counts (live)
+exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
+exposures_cache.$srm_analysis         → cached SRM analysis
+exposures_cache.$last_computed        → when the cache was last refreshed
+settings.srm.enabled                  → whether the SRM check ran
+settings.srm.targetAllocations        → expected per-variant allocation (percent)
+settings.preExperimentBias            → whether Retro A/A was enabled
+settings.excludeQA                    → whether QA traffic was filtered
+live_results_errors                   → non-null = live computation failed; surface and fall back to cache
+```
+
+---
+
+## Per-metric per-variant results
+
+```
+live_metrics[<metricId>][<variantKey>]
+  .value             → metric value for this variant
+  .sampleSize        → sample size for this variant on this metric
+  .lift              → (treatment - control) / control  (0 for control row)
+  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
+  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
+
+results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
+```
+
+---
+
+## Bucketed summary
+
+```
+results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
+results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
+results_cache.summary.no[]            → items with significance == "NO"
+
+Each item:
+  .metricId
+  .variant
+  .value
+  .lift
+  .liftConfidence
+  .sampleSize
+  .significance
+```
+
+**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
+
+---
+
+## Metric catalog (for polarity lookups)
+
+```
+metrics[]
+  .id, .name
+  .type ("primary" | "guardrail" | "secondary")
+  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
+```
+
+Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
+
+---
+
+## Settings that change interpretation
+
+```
+settings.confidenceLevel              → significance threshold (e.g. 0.95)
+settings.testingModel                 → "frequentist" or "sequential"
+settings.endCondition                 → "sample_size" or "days"
+settings.sampleSize / .endAfterDays   → planned end target
+settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
+settings.cuped.enabled                → CUPED variance reduction applied
+settings.cuped.preExposureDatePreset  → pre-exposure window
+settings.winsorization.enabled        → outlier capping applied
+settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
+```
+
+---
+
+## Decision metadata (post-decide)
+
+```
+results_cache.message                 → decision rationale
+results_cache.variant                 → shipped variant key (or special constant)
+status                                → "concluded" | "success" | "fail"
+```
+
+Special variant constants for `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant.
+- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
+
+For a kill, pass `success=false`.
+
+---
+
+## Lifecycle hand-off
+
+```
+Update-Experiment(
+  experiment_id=<id>,
+  experiment={
+    "action": "decide",
+    "success": true | false,
+    "variant": "<winner_key>",      # required when success=true
+    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
+  }
+)
+```
+
+`message` is required on every `decide` call.
+
+---
+
+## Misconfig field map (cross-link)
+
+For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
+
+- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
+- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
+- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
+- `settings.confidenceLevel != 0.95`
+- `metrics[]` entries with `name == ""`
+- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
+
+---
+
+## When to reach for sibling tools
+
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
+- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
+- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
new file mode 100644
index 0000000..4471219
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
@@ -0,0 +1,158 @@
+# Health-Check Interpretation
+
+Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check `live_exposures` totals — which variant is undersampled?
+2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
+3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm `settings.testingModel == "frequentist"`.
+2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+
+### Investigation checklist
+
+1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**Verdict to compute (this one is local)**: `end_date - start_date`.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations to flag during Step 1
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+
+- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
+- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
+- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
+- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
+- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
new file mode 100644
index 0000000..3b44385
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
@@ -0,0 +1,188 @@
+# Per-Metric Interpretation
+
+Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe (repeat from the spine — critical)
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+
+- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+
+---
+
+## Reading the p-value correctly
+
+The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+
+- ❌ The probability that the treatment works.
+- ❌ The probability the result will replicate.
+- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
+- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+
+Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+2. Lift from the winning row.
+3. Absolute lift: `baseline_value × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when `value` / `sampleSize` are null
+
+Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## "Significance = NO" does NOT mean "no effect"
+
+A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+
+Options to suggest when a primary metric lands in `summary.no`:
+
+1. **Extend duration** (if the experiment is still ACTIVE).
+2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
+3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
+4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
+5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
+6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
+
+For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Check `settings.testingModel`:
+
+- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
+- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+
+Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..6877d2a
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -0,0 +1,95 @@
+# Segment-Breakdown Interpretation
+
+Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+
+- Segments below the floor → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..ea9f22b
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
new file mode 100644
index 0000000..88640f4
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+
+> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
new file mode 100644
index 0000000..fdad2cd
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- `lift is None` on the primary → no measurement, not "no effect."
+- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
+- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+
+- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
new file mode 100644
index 0000000..4e344d3
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
@@ -0,0 +1,236 @@
+---
+name: experiment-results
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+license: Apache-2.0
+---
+
+# Experiment Results Interpretation
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+
+## Requirements
+
+- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
+- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+
+## When to use this skill
+
+Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
+
+- "What do these results mean?" / "Should we ship this?"
+- "Is this experiment trustworthy?" / "Why is SRM failing?"
+- "Why hasn't this hit statistical significance yet?"
+- "Break this down by `<segment>`" / "What segments should I look at?"
+- "What does this Retro A/A failure mean?"
+- "Can you compare the session replays for control vs treatment?"
+
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+
+---
+
+## How to read `Get-Experiment` output
+
+Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+
+| Concept                      | Live (preferred)                  | Cached fallback                             |
+| ---------------------------- | --------------------------------- | ------------------------------------------- |
+| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
+| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
+| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
+| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
+| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
+
+If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+
+If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
+
+See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+
+---
+
+## The Decision Tree
+
+This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
+
+```
+┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
+│   SRM ok? → exposures sufficient? →          │
+│   Retro A/A clean? → minimum duration met? → │
+│   no misconfig?                              │
+│        │                                     │
+│      fail → STOP. See references/            │
+│             health-check-interpretation.md   │
+└──────────────┬───────────────────────────────┘
+               ↓ pass
+┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
+│   For each non-control variant × primary,    │
+│   apply the polarity recipe (sign-of-lift +  │
+│   metric.direction). Significant + correct   │
+│   polarity = "win"; significant + wrong      │
+│   polarity = "loss".                         │
+│        │                                     │
+│   nothing significant on primaries →         │
+│   see references/why-no-statsig.md           │
+└──────────────┬───────────────────────────────┘
+               ↓ at least one primary win
+┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
+│   Any guardrail significant in the wrong     │
+│   polarity? → regression → ITERATE not ship  │
+└──────────────┬───────────────────────────────┘
+               ↓ guardrails clean
+┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
+│   Convert the lift on the primary into       │
+│   absolute terms. Is it big enough to        │
+│   matter to the business?                    │
+│   Statistically significant ≠ ships.         │
+└──────────────┬───────────────────────────────┘
+               ↓ meaningful magnitude
+┌─ Step 5: VERDICT ────────────────────────────┐
+│   Trust ✓ + primary win + guardrails ✓ +     │
+│   meaningful magnitude → SHIP                │
+│   Trust ✓ + primary win + guardrail regress  │
+│     → ITERATE                                │
+│   Trust ✓ + primary neutral after target     │
+│     → KILL or ITERATE                        │
+│   Trust ✗                                    │
+│     → DO NOT DECIDE; report failures         │
+│   Hasn't reached target sample/duration      │
+│     → WAIT (or extend, or restart with more  │
+│       power — see why-no-statsig.md)         │
+└──────────────────────────────────────────────┘
+```
+
+### Step 1 — Trustworthiness gate (consume the verdicts)
+
+Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
+
+| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
+| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
+| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
+| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
+| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
+| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
+| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+
+If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
+
+### Step 2 — Statistical significance with polarity
+
+**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
+
+#### Polarity recipe
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
+
+- `lift is None` or `lift == 0` → **neutral**.
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
+
+#### How to read the summary
+
+1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
+2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
+3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
+4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
+
+Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
+
+### Step 3 — Guardrail check
+
+Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
+
+- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
+- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
+- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
+
+### Step 4 — Practical significance
+
+Statistical significance ≠ business impact. For every primary metric that won:
+
+1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
+2. Read the **lift** from the winning variant's row.
+3. Compute absolute lift: `baseline_value × lift`.
+4. Project to population per period: ask the user for traffic estimates if not in context.
+
+A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+
+**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+
+### Step 5 — Verdict
+
+| Situation                                                              | Recommendation                                                                                                                                               |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
+
+For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
+
+`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
+
+Special variant constants when `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant
+- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
+
+For a kill, pass `success=false`.
+
+---
+
+## Going deeper
+
+Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+
+| User asks about…                                                                | Open                                                                                             |
+| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+
+---
+
+## Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
+5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+
+If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+
+---
+
+## Common pitfalls (cheat sheet)
+
+- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
+- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
+- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
+- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
+- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
+- ⛔ Treating a `null` lift as "no effect" — it means computation failed
+- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
+- ⛔ Interpreting a `< 3 day` experiment instead of refusing
+- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
+- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
+- ⛔ Conflating **statistical significance** with **practical significance**
+- ⛔ Ignoring **guardrail regressions** because the primary won
+- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
+- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
new file mode 100644
index 0000000..71278d6
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
@@ -0,0 +1,34 @@
+# Eval fixtures — `experiment-results`
+
+Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
+
+The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
+
+1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
+2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
+
+When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
+
+---
+
+## Fixtures
+
+| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
+| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
+| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
+| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
+| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
+
+Each YAML doc has the same shape:
+
+```yaml
+name: <slug>
+prd_source: <one-line attribution>
+trigger_phrase: <what the user types>
+get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
+expected_behavior:
+  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
+  must_mention: [<phrases / framings the skill must cover>]
+  must_not_do: [<failure modes the skill should avoid>]
+  references_consulted: [<which reference files the skill should pull open>]
+```
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
new file mode 100644
index 0000000..da61d9e
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
@@ -0,0 +1,48 @@
+name: confetti-8-metrics
+prd_source: |
+  Confetti — "8 metrics for new visitors"
+  Customer is running an experiment with 8 primary-ish metrics and explicitly
+  cares about new-visitor behavior. They want a segment-driven read, not a
+  dump of 8 lifts. The skill should pre-commit to segments tied to the
+  hypothesis (new vs returning), call out the multiple-testing concern with
+  8 metrics, and produce a verdict scoped to the segment that matters.
+
+trigger_phrase: |
+  We're tracking 8 metrics on this onboarding redesign experiment and I really
+  care about how new visitors respond. Can you read this and tell me whether
+  it's a ship for the new-user audience?
+
+get_experiment_summary:
+  hypothesis: |
+    If we redesign the first-session onboarding flow, then activation rate
+    among NEW visitors will increase by ≥5% relative, because reducing
+    cold-start friction shortens time-to-first-value.
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "off" # mis-configured given 8 primaries
+    testingModel: "sequential"
+    confidenceLevel: 0.95
+  metrics_count: 8
+  primary_metrics_summary: |
+    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
+    negative (a "Time to First Action" metric with direction=down where
+    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
+
+expected_behavior:
+  verdict: WAIT
+  must_mention:
+    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
+    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
+    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
+    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
+    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
+    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
+  must_not_do:
+    - "Slice by every available property after the fact (the fishing-expedition warning)"
+    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
+    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
+    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
+  references_consulted:
+    - segment-of-interest-selection.md
+    - per-metric-interpretation.md
+    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
new file mode 100644
index 0000000..f634236
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
@@ -0,0 +1,79 @@
+name: pelando-plus-2-others
+prd_source: |
+  Pelando — "+2 others"
+  Customer reported that when a multi-variant test concludes with a winner banner
+  plus a small-print "+2 others", they cannot tell which non-winner variants are
+  benign vs which contain a guardrail regression they need to act on. The skill
+  should pivot the summary per variant, polarity-correct each, and call out the
+  losers, not gloss over them.
+
+trigger_phrase: |
+  Can you make sense of this experiment for me? The UI shows treatment_a winning
+  on the primary plus "+2 others" but I have no idea whether treatment_b or
+  treatment_c are okay to ignore.
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    multipleTestingCorrection: "benjamini-hochberg"
+    testingModel: "sequential"
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Activation Rate"
+    - id: m_guardrail_latency
+      type: guardrail
+      direction: down
+      name: "p95 Latency (ms)"
+    - id: m_guardrail_errors
+      type: guardrail
+      direction: down
+      name: "Error Rate"
+  live_exposures:
+    control: 41123
+    treatment_a: 40987
+    treatment_b: 41210
+    treatment_c: 40755
+  live_srm_analysis:
+    # platform-flagged passing
+    p_value: 0.42
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment_a,
+          lift: 0.041,
+          liftConfidence: 0.95,
+        }
+      - {
+          metricId: m_guardrail_latency,
+          variant: treatment_b,
+          lift: 0.08,
+          liftConfidence: 0.95,
+        }
+    negative:
+      - {
+          metricId: m_primary,
+          variant: treatment_c,
+          lift: -0.022,
+          liftConfidence: 0.95,
+        }
+    no:
+      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
+
+expected_behavior:
+  verdict: ITERATE
+  must_mention:
+    - "Pivot the summary by variant before declaring a winner"
+    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
+    - "treatment_c regresses the primary"
+    - "Multi-variant verdict requires each treatment to be judged independently against control"
+    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
+  must_not_do:
+    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
+    - "Trust the bucket name (positive/negative) as the business verdict"
+    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
+  references_consulted:
+    - per-metric-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
new file mode 100644
index 0000000..325a3bf
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
@@ -0,0 +1,61 @@
+name: polarsteps-no-workaround
+prd_source: |
+  Polarsteps — "no documented workaround"
+  Customer's experiment is failing SRM and they cannot find a documented path
+  forward. The skill should consume the platform's SRM verdict (not recompute
+  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
+  surface ordered likely causes plus a specific recommended action — not
+  punt with "investigate further."
+
+trigger_phrase: |
+  My experiment is failing SRM and the result lift looks too good to be true
+  (+18% on the primary). The docs just say "investigate" — what does that
+  actually mean here? Should I trust the lift?
+
+get_experiment_summary:
+  settings:
+    controlKey: "control"
+    srm:
+      enabled: true
+      targetAllocations: { control: 50, treatment: 50 }
+    excludeQA: false # potentially relevant
+  live_exposures:
+    control: 18250
+    treatment: 22980
+  live_srm_analysis:
+    # platform-flagged FAILING
+    p_value: 0.00002
+    chi_square: 18.4
+  summary:
+    positive:
+      - {
+          metricId: m_primary,
+          variant: treatment,
+          lift: 0.18,
+          liftConfidence: 0.95,
+        }
+  metrics:
+    - id: m_primary
+      type: primary
+      direction: up
+      name: "Trip Plan Created"
+
+expected_behavior:
+  verdict: DO_NOT_DECIDE
+  must_mention:
+    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
+    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
+    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
+    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
+    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
+    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
+    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
+    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
+  must_not_do:
+    - "Calculate the chi-square or re-derive an SRM p-value threshold"
+    - "Recommend shipping or treating the +18% lift as real"
+    - "Hand the user a generic 'investigate further' without ordered causes and an action"
+    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
+  references_consulted:
+    - health-check-interpretation.md
+    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md
new file mode 100644
index 0000000..efaeae5
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md
@@ -0,0 +1,161 @@
+# `Get-Experiment` Field Map
+
+Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+
+This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
+
+---
+
+## Identity & lifecycle
+
+```
+id, name, description, hypothesis, status, start_date, end_date
+creator_email, tags, url, workspace_id
+feature_flag_id                       → for feature-flag-based experiments
+settings.controlKey                   → variant key treated as control (often "control"; may be "")
+```
+
+`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
+
+---
+
+## Trustworthiness
+
+```
+live_srm_analysis                     → SRM verdict (consume — don't recompute)
+  .p_value
+  .chi_square
+live_exposures[<variantKey>]          → per-variant exposure counts (live)
+exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
+exposures_cache.$srm_analysis         → cached SRM analysis
+exposures_cache.$last_computed        → when the cache was last refreshed
+settings.srm.enabled                  → whether the SRM check ran
+settings.srm.targetAllocations        → expected per-variant allocation (percent)
+settings.preExperimentBias            → whether Retro A/A was enabled
+settings.excludeQA                    → whether QA traffic was filtered
+live_results_errors                   → non-null = live computation failed; surface and fall back to cache
+```
+
+---
+
+## Per-metric per-variant results
+
+```
+live_metrics[<metricId>][<variantKey>]
+  .value             → metric value for this variant
+  .sampleSize        → sample size for this variant on this metric
+  .lift              → (treatment - control) / control  (0 for control row)
+  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
+  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
+
+results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
+```
+
+---
+
+## Bucketed summary
+
+```
+results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
+results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
+results_cache.summary.no[]            → items with significance == "NO"
+
+Each item:
+  .metricId
+  .variant
+  .value
+  .lift
+  .liftConfidence
+  .sampleSize
+  .significance
+```
+
+**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
+
+---
+
+## Metric catalog (for polarity lookups)
+
+```
+metrics[]
+  .id, .name
+  .type ("primary" | "guardrail" | "secondary")
+  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
+```
+
+Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
+
+---
+
+## Settings that change interpretation
+
+```
+settings.confidenceLevel              → significance threshold (e.g. 0.95)
+settings.testingModel                 → "frequentist" or "sequential"
+settings.endCondition                 → "sample_size" or "days"
+settings.sampleSize / .endAfterDays   → planned end target
+settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
+settings.cuped.enabled                → CUPED variance reduction applied
+settings.cuped.preExposureDatePreset  → pre-exposure window
+settings.winsorization.enabled        → outlier capping applied
+settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
+```
+
+---
+
+## Decision metadata (post-decide)
+
+```
+results_cache.message                 → decision rationale
+results_cache.variant                 → shipped variant key (or special constant)
+status                                → "concluded" | "success" | "fail"
+```
+
+Special variant constants for `success=true`:
+
+- `__no_variant_shipped__` — ship the change without picking a variant.
+- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
+
+For a kill, pass `success=false`.
+
+---
+
+## Lifecycle hand-off
+
+```
+Update-Experiment(
+  experiment_id=<id>,
+  experiment={
+    "action": "decide",
+    "success": true | false,
+    "variant": "<winner_key>",      # required when success=true
+    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
+  }
+)
+```
+
+`message` is required on every `decide` call.
+
+---
+
+## Misconfig field map (cross-link)
+
+For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
+
+- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
+- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
+- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
+- `settings.confidenceLevel != 0.95`
+- `metrics[]` entries with `name == ""`
+- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
+
+---
+
+## When to reach for sibling tools
+
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
+- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
+- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
new file mode 100644
index 0000000..4471219
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
@@ -0,0 +1,158 @@
+# Health-Check Interpretation
+
+Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+
+**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+
+---
+
+## Kohavi framing — always cite when a health check fails
+
+> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment.
+>
+> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken.
+
+These two principles drive the recommendations below. Lead with them when explaining a failing check to the user.
+
+---
+
+## 1. SRM (Sample Ratio Mismatch)
+
+**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+
+### What it means
+
+Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+
+### Likely causes, ordered most → least likely
+
+(Surface in this order — investigate the most probable first.)
+
+1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
+2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
+3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
+
+### Recommended actions
+
+- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
+- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
+- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
+
+### Investigation checklist
+
+1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
+7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
+
+---
+
+## 2. Retro A/A (pre-experiment bias) failure
+
+**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+
+### What it means
+
+The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change.
+
+- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal.
+- Pre-experiment bias on a **secondary** metric is informational only.
+
+### Investigation checklist
+
+1. Identify which metric × variant pair triggered the failure (after the platform's correction).
+2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
+5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
+
+---
+
+## 3. Insufficient exposures
+
+**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+
+### Investigation checklist
+
+1. Check `live_exposures` totals — which variant is undersampled?
+2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
+3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
+
+If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
+
+---
+
+## 4. Frequentist peeking
+
+**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+
+### What it means
+
+A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production.
+
+### Investigation checklist
+
+1. Confirm `settings.testingModel == "frequentist"`.
+2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
+4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+
+(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
+
+---
+
+## 5. Live computation timeout / broken data
+
+**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+
+### Investigation checklist
+
+1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
+4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+
+---
+
+## 6. Experiment ran < 3 days
+
+**Verdict to compute (this one is local)**: `end_date - start_date`.
+
+Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
+
+> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
+
+If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+
+---
+
+## 7. Misconfigurations to flag during Step 1
+
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+
+- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
+- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
+- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
+- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
+- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
+- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+
+---
+
+## Output shape when a health check fails
+
+1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive).
+2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits.
+3. **Likely causes**, ordered most → least probable.
+4. **Recommended action** from the small set above.
+5. **Investigation checklist** the user can run.
+6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers."
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
new file mode 100644
index 0000000..3b44385
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
@@ -0,0 +1,188 @@
+# Per-Metric Interpretation
+
+Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+
+---
+
+## The mental model
+
+Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions:
+
+1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity).
+2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not).
+3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`.
+4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context.
+
+A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing.
+
+---
+
+## Polarity recipe (repeat from the spine — critical)
+
+`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+
+- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+
+---
+
+## Reading the p-value correctly
+
+The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+
+- ❌ The probability that the treatment works.
+- ❌ The probability the result will replicate.
+- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
+- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+
+Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+
+---
+
+## Reading the lift correctly
+
+```
+lift = (treatment_mean - control_mean) / control_mean
+```
+
+- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
+- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
+- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
+
+---
+
+## Verdict phrasing — a small palette
+
+Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds.
+
+| Pattern (sig × polarity × magnitude)                        | Plain-language verdict                                                                                                                                                    |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `<metric>` moved `<lift%>` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%)                             |
+| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `<lift%>` on a `<baseline>` baseline is `<absolute>`; confirm with the user whether that clears the business bar." |
+| Significant, polarity negative                              | "**Regression** — `<metric>` moved `<lift%>` against its goal direction. This is a reason not to ship even if other primaries won."                                       |
+| Not significant, lift in goal direction, well-powered       | "**Likely no effect at the detectable size.** The experiment had enough power to detect `<MDE>`; the observed lift is below that threshold."                              |
+| Not significant, lift in goal direction, underpowered       | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart."                                            |
+| Not significant, lift in wrong direction                    | "**No detectable harm**, but no win either."                                                                                                                              |
+| `lift is None`                                              | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync."                                                                             |
+| Lift > ~30% on any metric                                   | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating."                                             |
+
+---
+
+## Magnitude — make it absolute
+
+Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
+
+1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+2. Lift from the winning row.
+3. Absolute lift: `baseline_value × lift`. Examples:
+   - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
+   - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
+4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
+
+### Fallback when `value` / `sampleSize` are null
+
+Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+
+Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+
+- `unique` (Bernoulli) → conversion **rate** as the baseline.
+- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
+
+---
+
+## Twyman's Law in practice — changed-denominator lifts
+
+Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?**
+
+If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed.
+
+- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views.
+- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged.
+
+When you see a > 30% lift, name the risk explicitly:
+
+> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_
+
+---
+
+## Metric distribution types
+
+Different metric types behave differently; cite the relevant nuance in your verdict.
+
+| Metric type                      | Distribution | Interpretation nuance                                                                                     |
+| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- |
+| Unique users / conversion rate   | Bernoulli    | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. |
+| Event counts / sessions per user | Poisson      | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results.      |
+| Revenue / numeric properties     | Gaussian     | Long tails (whales) inflate variance. Strongly consider Winsorization.                                    |
+
+---
+
+## Variance-reduction & outlier settings that change interpretation
+
+- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+
+---
+
+## Multiple comparisons & metric tiers — what's decisional and what isn't
+
+| Tier          | How it influences the verdict                                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
+| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
+
+If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+
+---
+
+## "Significance = NO" does NOT mean "no effect"
+
+A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+
+Options to suggest when a primary metric lands in `summary.no`:
+
+1. **Extend duration** (if the experiment is still ACTIVE).
+2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
+3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
+4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
+5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
+6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
+
+For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+
+---
+
+## Frequentist vs Sequential — what affects per-metric reading
+
+Check `settings.testingModel`:
+
+- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
+- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+
+Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+
+---
+
+## Triggered analysis & dilution
+
+If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**.
+
+- Triggered analysis zooms in on users who actually saw the change.
+- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`.
+
+The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small."
+
+---
+
+## Novelty and primacy
+
+- **Novelty** — lift is large early, then decays as users habituate.
+- **Primacy** — lift is small or negative early, then grows as users learn the new behavior.
+
+To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
new file mode 100644
index 0000000..6877d2a
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -0,0 +1,95 @@
+# Segment-Breakdown Interpretation
+
+Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+
+> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+
+---
+
+## The mental model
+
+A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment:
+
+1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new.
+2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset.
+3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep.
+
+Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them.
+
+---
+
+## Per-segment polarity recipe — apply per row
+
+The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut.
+
+- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no).
+- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary.
+- Filter out the control row in each segment.
+
+Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row.
+
+---
+
+## Sample-size floor per segment
+
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+
+- Segments below the floor → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
+
+---
+
+## Heterogeneity vs Simpson's paradox vs noise
+
+| What you see                                                                                        | Interpretation                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Most segments lift positive, one or two negative, all with overlapping CIs                          | **Noise.** Not heterogeneity. Don't ship a segment-specific story.                                                                                         |
+| One segment lifts much more than the rest, with a tight CI and a clear mechanism                    | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis.                    |
+| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
+| Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
+
+When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+
+---
+
+## What a "ship only to segment X" recommendation requires
+
+Don't recommend a segment-scoped ship unless **all** of these hold:
+
+1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
+2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
+4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
+5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
+
+Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment.
+
+---
+
+## When a segment loses but overall wins
+
+This is the everyday case of mixed effects.
+
+- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale.
+- If the losing segment is large or has a guardrail regression, recommend iterate, not ship.
+- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall.
+
+---
+
+## What NOT to do
+
+- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
+- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
+- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
+
+---
+
+## Output shape
+
+1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious.
+2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
+3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
+4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md
new file mode 100644
index 0000000..ea9f22b
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md
@@ -0,0 +1,116 @@
+# Segment-of-Interest Selection
+
+Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+
+The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
+
+---
+
+## Why this matters: the fishing-expedition problem
+
+If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.**
+
+Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional."
+
+---
+
+## The decision tree for picking segments
+
+Walk through these in order. The first match is the most defensible pick.
+
+### 1. Segments the hypothesis explicitly names
+
+If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them.
+
+Look at:
+
+- `experiment.hypothesis`
+- `experiment.description`
+- The setup-side conversation, if present
+
+These are not exploratory; they're the variables the team committed to test.
+
+### 2. Segments where the mechanism is expected to matter
+
+The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect:
+
+| Hypothesis mechanism                              | Segments likely to moderate the effect             |
+| ------------------------------------------------- | -------------------------------------------------- |
+| "Reduces first-time friction in onboarding"       | New vs returning; signup source; locale            |
+| "Improves discoverability of feature X"           | Users who previously used X vs not; tenure         |
+| "Speeds up a slow flow"                           | Platform (mobile slower than web); connection type |
+| "Lowers payment friction"                         | Plan tier; payment-method type; geography          |
+| "Replaces a confusing UI element"                 | New vs returning (returning users habituated)      |
+| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure                    |
+| "Localized copy / pricing change"                 | Country / language                                 |
+
+If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it.
+
+### 3. Segments where the **denominator** plausibly differs
+
+Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win.
+
+- Triggered vs untriggered cohorts (if the treatment only fires on certain pages).
+- Platform / app version (the treatment may only ship on a subset of clients).
+- Device class (mobile vs desktop) when the change is platform-specific.
+
+A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding.
+
+### 4. Segments where SRM or baseline shift is suspected
+
+If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples:
+
+- iOS vs Android (often the SDK bucketing layer differs).
+- Bot-suspicious countries (`bot_traffic` cause from health-check).
+- A specific app version range that shipped a flag-evaluation change.
+
+This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+
+### 5. Segments the platform de facto requires
+
+Some user dimensions are so foundational that any results report should mention them once:
+
+- **Platform** — web vs iOS vs Android.
+- **New vs returning** — defined as first session within the experiment window vs before.
+- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context.
+
+Don't include all three blindly — pick the one(s) most likely to vary given the change.
+
+---
+
+## Sanity checks before committing to a slice
+
+For each segment you want to break down on:
+
+1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
+3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
+4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
+
+---
+
+## How many slices to commit to
+
+| Situation                                                         | Number of slices                |
+| ----------------------------------------------------------------- | ------------------------------- |
+| Hypothesis-driven, well-powered, decisional                       | 3–5 segments, named upfront     |
+| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat |
+| Diagnostic (chasing a failing SRM or strange overall result)      | Whatever helps localize the bug |
+
+If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision.
+
+---
+
+## The pre-commit ritual
+
+Before running the breakdowns, tell the user something like:
+
+> _"Based on the hypothesis (`<one-line summary>`), I'd slice by `<segment A>` and `<segment B>` because `<why each matters>`. I'm intentionally not slicing `<X, Y, Z>` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_
+
+Pre-commitment is what separates "segmentation analysis" from "fishing."
+
+---
+
+## Then read the results
+
+Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
new file mode 100644
index 0000000..88640f4
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
@@ -0,0 +1,109 @@
+# Session-Replay Analysis Guidance
+
+Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+
+> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+
+---
+
+## When replays help, when they don't
+
+| Question                                                                                 | Replays help?                                                                         |
+| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| "Why is conversion lower in treatment?"                                                  | Yes — behavior diff is observable.                                                    |
+| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. |
+| "Why is `time_on_page` higher in treatment?"                                             | Yes — distinguishes engaged reading vs confused dwell.                                |
+| "Is the treatment shipping a regression on iOS only?"                                    | Sometimes — better answered first by segment breakdown.                               |
+| "Why is SRM failing?"                                                                    | No — replays don't show bucketing. Go to health checks.                               |
+| "What's the lift?"                                                                       | No — replays are qualitative; they explain _why_, not what.                           |
+| "Why hasn't this hit statsig yet?"                                                       | No — that's a sample/power question, not a behavior question.                         |
+
+A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal.
+
+---
+
+## Cohort selection: which replays to compare
+
+You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question.
+
+| Question                                                             | Cohort A (replays to pull)                                 | Cohort B (replays to pull)                                  |
+| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
+| Why is primary metric down in treatment?                             | Treatment users who **failed** the primary action          | Control users who **succeeded** at the primary action       |
+| Why is a guardrail regression appearing?                             | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it                        |
+| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen                     | Same users, looking at whether they completed the next step |
+| Why is engagement higher / lower in a specific segment?              | Treatment users in that segment                            | Control users in the same segment                           |
+| What does the new UI look like in practice?                          | Any treatment users who saw the change                     | Any control users to confirm the baseline UI                |
+
+**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics.
+
+Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise).
+
+---
+
+## What to actually watch for
+
+Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing.
+
+### Friction / failure patterns
+
+- **Hesitation** — long pause before clicking a key element (often signals confusion).
+- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work.
+- **Form abandonment** — typing into a field, then leaving without submitting.
+- **Back-button bounce** — landing on the page, then immediately backing out.
+- **Scroll-and-leave** — scrolling without engaging, then exiting.
+
+If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression.
+
+### Layout / discoverability issues
+
+- **CTA below the fold** — users never scrolling to where the new button is.
+- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens.
+- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance.
+
+These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size).
+
+### Changed-denominator behavior
+
+If you're investigating a Twyman's-Law-sized lift, look for:
+
+- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion.
+- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow.
+
+If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story.
+
+### Variant-specific UI issues
+
+- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only.
+- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md).
+- **Treatment fired twice / persisted state across sessions** — implementation regression.
+
+---
+
+## How to frame the findings
+
+Replay analysis is qualitative. Be honest about that.
+
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
+
+Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+
+---
+
+## What NOT to do
+
+- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first.
+- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote.
+- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions.
+- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal.
+- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it.
+
+---
+
+## Output shape
+
+1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict.
+2. **Cohorts watched** — what filters were applied to A and B, how many replays in each.
+3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did").
+4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof.
+5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
new file mode 100644
index 0000000..fdad2cd
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
@@ -0,0 +1,115 @@
+# Why Hasn't This Reached Statistical Significance Yet?
+
+Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+
+The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+
+---
+
+## First, rule out a broken result
+
+Inconclusive can mean two very different things:
+
+1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
+2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
+
+Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+
+Also check:
+
+- `lift is None` on the primary → no measurement, not "no effect."
+- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
+- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+
+---
+
+## The five real reasons an experiment hasn't hit statsig
+
+Walk through these in order. The first one that explains the picture is usually right.
+
+### 1. Not enough sample yet (not enough exposures)
+
+**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+
+- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
+- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
+- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
+
+If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+
+### 2. Observed effect is smaller than the MDE
+
+**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+
+- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
+- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
+  - **Accept the null** — at this size, the change isn't moving the metric. Document and move on.
+  - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE).
+- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1).
+
+### 3. Variance is too high (metric is too noisy)
+
+**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+
+- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
+- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
+- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
+- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen.
+
+### 4. Traffic split is starving the variant
+
+**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+
+- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
+- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+
+Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
+
+### 5. Exposure config is filtering more users than the user expects
+
+**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+
+**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
+
+---
+
+## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL?
+
+Once you know which reason fits, the recommendation almost picks itself.
+
+| Reason                                 | Recommendation                                                                                               |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| Not enough sample yet, still ACTIVE    | **WAIT.** Show projected end date based on observed traffic.                                                 |
+| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible).             |
+| Effect << MDE                          | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. |
+| Variance too high                      | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy.                    |
+| Variant starved by traffic split       | **EXTEND** (if remaining time is enough) or restart with rebalanced split.                                   |
+| Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
+| Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
+
+When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+
+---
+
+## What NOT to suggest
+
+- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem.
+- ❌ **Switch testing model mid-experiment** — restart, don't morph.
+- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better.
+- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer.
+- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed."
+
+---
+
+## Output shape
+
+1. **The reason** (one of the five above), in one sentence.
+2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.

From aa0a13c6f25f8fcd156428287aa986a7154b8fec Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Fri, 5 Jun 2026 00:30:00 +0000
Subject: [PATCH 02/11] Remove MCP tool name references and delete eval
 fixtures
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Replace explicit MCP tool names (Get-Experiment, Update-Experiment,
  Run-Query, Get-Feature-Flag, Get-Experiment-Setup-Guidance) with
  agent-agnostic phrasing per the convention from #22. Skills describe
  actions ("request experiment details", "query the metric", "update
  the experiment") rather than specific tool calls.
- Rename references/get-experiment-fields.md → experiment-fields.md so
  the filename doesn't echo a specific MCP tool name.
- Drop evals/ directory — this repo doesn't run evals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../skills/experiment-results/SKILL.md        | 26 +++---
 .../skills/experiment-results/evals/README.md | 34 --------
 .../evals/confetti-8-metrics.yaml             | 48 -----------
 .../evals/pelando-plus-2-others.yaml          | 79 -------------------
 .../evals/polarsteps-no-workaround.yaml       | 61 --------------
 ...eriment-fields.md => experiment-fields.md} | 31 ++++----
 .../references/health-check-interpretation.md | 14 ++--
 .../references/per-metric-interpretation.md   |  6 +-
 .../segment-breakdown-interpretation.md       |  4 +-
 .../references/session-replay-analysis.md     |  4 +-
 .../references/why-no-statsig.md              | 12 +--
 .../skills/experiment-results/SKILL.md        | 26 +++---
 .../skills/experiment-results/evals/README.md | 34 --------
 .../evals/confetti-8-metrics.yaml             | 48 -----------
 .../evals/pelando-plus-2-others.yaml          | 79 -------------------
 .../evals/polarsteps-no-workaround.yaml       | 61 --------------
 ...eriment-fields.md => experiment-fields.md} | 31 ++++----
 .../references/health-check-interpretation.md | 14 ++--
 .../references/per-metric-interpretation.md   |  6 +-
 .../segment-breakdown-interpretation.md       |  4 +-
 .../references/session-replay-analysis.md     |  4 +-
 .../references/why-no-statsig.md              | 12 +--
 .../skills/experiment-results/SKILL.md        | 26 +++---
 .../skills/experiment-results/evals/README.md | 34 --------
 .../evals/confetti-8-metrics.yaml             | 48 -----------
 .../evals/pelando-plus-2-others.yaml          | 79 -------------------
 .../evals/polarsteps-no-workaround.yaml       | 61 --------------
 ...eriment-fields.md => experiment-fields.md} | 31 ++++----
 .../references/health-check-interpretation.md | 14 ++--
 .../references/per-metric-interpretation.md   |  6 +-
 .../segment-breakdown-interpretation.md       |  4 +-
 .../references/session-replay-analysis.md     |  4 +-
 .../references/why-no-statsig.md              | 12 +--
 33 files changed, 141 insertions(+), 816 deletions(-)
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 rename plugins/mixpanel-mcp-eu/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%)
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 rename plugins/mixpanel-mcp-in/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%)
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
 rename plugins/mixpanel-mcp/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%)

diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
index 4e344d3..0164c56 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
 license: Apache-2.0
 ---
 
@@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio
 
 ## Requirements
 
-- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
-- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
+- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its
 - "What does this Retro A/A failure mean?"
 - "Can you compare the session replays for control vs treatment?"
 
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
 
 ---
 
-## How to read `Get-Experiment` output
+## How to read experiment-details output
 
-Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
 
 | Concept                      | Live (preferred)                  | Cached fallback                             |
 | ---------------------------- | --------------------------------- | ------------------------------------------- |
@@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea
 
 If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
 
 ---
 
@@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap
 | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
 | Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
 | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
 
 If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
 
@@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift
 
 **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
 
 ### Step 5 — Verdict
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
 | Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
 | Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
@@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r
 | "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
 | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
 | "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
 
 ---
 
@@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else:
 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
-If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
deleted file mode 100644
index 71278d6..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# Eval fixtures — `experiment-results`
-
-Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
-
-The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
-
-1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
-2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
-
-When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
-
----
-
-## Fixtures
-
-| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
-| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
-| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
-| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
-| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
-
-Each YAML doc has the same shape:
-
-```yaml
-name: <slug>
-prd_source: <one-line attribution>
-trigger_phrase: <what the user types>
-get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
-expected_behavior:
-  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
-  must_mention: [<phrases / framings the skill must cover>]
-  must_not_do: [<failure modes the skill should avoid>]
-  references_consulted: [<which reference files the skill should pull open>]
-```
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
deleted file mode 100644
index da61d9e..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml
+++ /dev/null
@@ -1,48 +0,0 @@
-name: confetti-8-metrics
-prd_source: |
-  Confetti — "8 metrics for new visitors"
-  Customer is running an experiment with 8 primary-ish metrics and explicitly
-  cares about new-visitor behavior. They want a segment-driven read, not a
-  dump of 8 lifts. The skill should pre-commit to segments tied to the
-  hypothesis (new vs returning), call out the multiple-testing concern with
-  8 metrics, and produce a verdict scoped to the segment that matters.
-
-trigger_phrase: |
-  We're tracking 8 metrics on this onboarding redesign experiment and I really
-  care about how new visitors respond. Can you read this and tell me whether
-  it's a ship for the new-user audience?
-
-get_experiment_summary:
-  hypothesis: |
-    If we redesign the first-session onboarding flow, then activation rate
-    among NEW visitors will increase by ≥5% relative, because reducing
-    cold-start friction shortens time-to-first-value.
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "off" # mis-configured given 8 primaries
-    testingModel: "sequential"
-    confidenceLevel: 0.95
-  metrics_count: 8
-  primary_metrics_summary: |
-    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
-    negative (a "Time to First Action" metric with direction=down where
-    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
-
-expected_behavior:
-  verdict: WAIT
-  must_mention:
-    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
-    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
-    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
-    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
-    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
-    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
-  must_not_do:
-    - "Slice by every available property after the fact (the fishing-expedition warning)"
-    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
-    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
-    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
-  references_consulted:
-    - segment-of-interest-selection.md
-    - per-metric-interpretation.md
-    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
deleted file mode 100644
index f634236..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml
+++ /dev/null
@@ -1,79 +0,0 @@
-name: pelando-plus-2-others
-prd_source: |
-  Pelando — "+2 others"
-  Customer reported that when a multi-variant test concludes with a winner banner
-  plus a small-print "+2 others", they cannot tell which non-winner variants are
-  benign vs which contain a guardrail regression they need to act on. The skill
-  should pivot the summary per variant, polarity-correct each, and call out the
-  losers, not gloss over them.
-
-trigger_phrase: |
-  Can you make sense of this experiment for me? The UI shows treatment_a winning
-  on the primary plus "+2 others" but I have no idea whether treatment_b or
-  treatment_c are okay to ignore.
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "benjamini-hochberg"
-    testingModel: "sequential"
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Activation Rate"
-    - id: m_guardrail_latency
-      type: guardrail
-      direction: down
-      name: "p95 Latency (ms)"
-    - id: m_guardrail_errors
-      type: guardrail
-      direction: down
-      name: "Error Rate"
-  live_exposures:
-    control: 41123
-    treatment_a: 40987
-    treatment_b: 41210
-    treatment_c: 40755
-  live_srm_analysis:
-    # platform-flagged passing
-    p_value: 0.42
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment_a,
-          lift: 0.041,
-          liftConfidence: 0.95,
-        }
-      - {
-          metricId: m_guardrail_latency,
-          variant: treatment_b,
-          lift: 0.08,
-          liftConfidence: 0.95,
-        }
-    negative:
-      - {
-          metricId: m_primary,
-          variant: treatment_c,
-          lift: -0.022,
-          liftConfidence: 0.95,
-        }
-    no:
-      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
-
-expected_behavior:
-  verdict: ITERATE
-  must_mention:
-    - "Pivot the summary by variant before declaring a winner"
-    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
-    - "treatment_c regresses the primary"
-    - "Multi-variant verdict requires each treatment to be judged independently against control"
-    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
-  must_not_do:
-    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
-    - "Trust the bucket name (positive/negative) as the business verdict"
-    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
-  references_consulted:
-    - per-metric-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
deleted file mode 100644
index 325a3bf..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml
+++ /dev/null
@@ -1,61 +0,0 @@
-name: polarsteps-no-workaround
-prd_source: |
-  Polarsteps — "no documented workaround"
-  Customer's experiment is failing SRM and they cannot find a documented path
-  forward. The skill should consume the platform's SRM verdict (not recompute
-  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
-  surface ordered likely causes plus a specific recommended action — not
-  punt with "investigate further."
-
-trigger_phrase: |
-  My experiment is failing SRM and the result lift looks too good to be true
-  (+18% on the primary). The docs just say "investigate" — what does that
-  actually mean here? Should I trust the lift?
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    srm:
-      enabled: true
-      targetAllocations: { control: 50, treatment: 50 }
-    excludeQA: false # potentially relevant
-  live_exposures:
-    control: 18250
-    treatment: 22980
-  live_srm_analysis:
-    # platform-flagged FAILING
-    p_value: 0.00002
-    chi_square: 18.4
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment,
-          lift: 0.18,
-          liftConfidence: 0.95,
-        }
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Trip Plan Created"
-
-expected_behavior:
-  verdict: DO_NOT_DECIDE
-  must_mention:
-    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
-    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
-    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
-    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
-    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
-    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
-    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
-    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
-  must_not_do:
-    - "Calculate the chi-square or re-derive an SRM p-value threshold"
-    - "Recommend shipping or treating the +18% lift as real"
-    - "Hand the user a generic 'investigate further' without ordered causes and an action"
-    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
-  references_consulted:
-    - health-check-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
similarity index 84%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md
rename to plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
index efaeae5..1e65de1 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
@@ -1,6 +1,6 @@
-# `Get-Experiment` Field Map
+# Experiment-Details Field Map
 
-Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
 
 This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
 
@@ -122,16 +122,13 @@ For a kill, pass `success=false`.
 
 ## Lifecycle hand-off
 
+To ship/kill, update the experiment with the `decide` action and these fields:
+
 ```
-Update-Experiment(
-  experiment_id=<id>,
-  experiment={
-    "action": "decide",
-    "success": true | false,
-    "variant": "<winner_key>",      # required when success=true
-    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-  }
-)
+action     → "decide"
+success    → true | false
+variant    → "<winner_key>"      # required when success=true
+message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
 ```
 
 `message` is required on every `decide` call.
@@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health
 
 ---
 
-## When to reach for sibling tools
+## When to reach for sibling capabilities
 
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
-- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
-- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
+- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
+- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
index 4471219..9ec66df 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
@@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured
 ### Investigation checklist
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
-2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
@@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 1. Identify which metric × variant pair triggered the failure (after the platform's correction).
 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
-3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
 
@@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 ### Investigation checklist
 
 1. Check `live_exposures` totals — which variant is undersampled?
-2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
-3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
index 3b44385..1e8678c 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
 ---
 
@@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute
 
 Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
-Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
 - `unique` (Bernoulli) → conversion **rate** as the baseline.
 - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
@@ -165,7 +165,7 @@ Check `settings.testingModel`:
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
 
-Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
index 6877d2a..fcf9cfd 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
 ---
 
@@ -80,7 +80,7 @@ This is the everyday case of mixed effects.
 ## What NOT to do
 
 - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
-- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
 - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
 - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
 - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
index 88640f4..b758b8e 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
 
-> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
 ---
 
@@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that.
 - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
-Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
index fdad2cd..142089c 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
@@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
 - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
 
-If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
 
 ### 2. Observed effect is smaller than the MDE
 
@@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
-- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
 - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
-3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
index 4e344d3..0164c56 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
 license: Apache-2.0
 ---
 
@@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio
 
 ## Requirements
 
-- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
-- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
+- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its
 - "What does this Retro A/A failure mean?"
 - "Can you compare the session replays for control vs treatment?"
 
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
 
 ---
 
-## How to read `Get-Experiment` output
+## How to read experiment-details output
 
-Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
 
 | Concept                      | Live (preferred)                  | Cached fallback                             |
 | ---------------------------- | --------------------------------- | ------------------------------------------- |
@@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea
 
 If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
 
 ---
 
@@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap
 | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
 | Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
 | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
 
 If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
 
@@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift
 
 **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
 
 ### Step 5 — Verdict
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
 | Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
 | Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
@@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r
 | "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
 | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
 | "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
 
 ---
 
@@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else:
 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
-If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
deleted file mode 100644
index 71278d6..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# Eval fixtures — `experiment-results`
-
-Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
-
-The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
-
-1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
-2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
-
-When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
-
----
-
-## Fixtures
-
-| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
-| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
-| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
-| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
-| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
-
-Each YAML doc has the same shape:
-
-```yaml
-name: <slug>
-prd_source: <one-line attribution>
-trigger_phrase: <what the user types>
-get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
-expected_behavior:
-  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
-  must_mention: [<phrases / framings the skill must cover>]
-  must_not_do: [<failure modes the skill should avoid>]
-  references_consulted: [<which reference files the skill should pull open>]
-```
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
deleted file mode 100644
index da61d9e..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml
+++ /dev/null
@@ -1,48 +0,0 @@
-name: confetti-8-metrics
-prd_source: |
-  Confetti — "8 metrics for new visitors"
-  Customer is running an experiment with 8 primary-ish metrics and explicitly
-  cares about new-visitor behavior. They want a segment-driven read, not a
-  dump of 8 lifts. The skill should pre-commit to segments tied to the
-  hypothesis (new vs returning), call out the multiple-testing concern with
-  8 metrics, and produce a verdict scoped to the segment that matters.
-
-trigger_phrase: |
-  We're tracking 8 metrics on this onboarding redesign experiment and I really
-  care about how new visitors respond. Can you read this and tell me whether
-  it's a ship for the new-user audience?
-
-get_experiment_summary:
-  hypothesis: |
-    If we redesign the first-session onboarding flow, then activation rate
-    among NEW visitors will increase by ≥5% relative, because reducing
-    cold-start friction shortens time-to-first-value.
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "off" # mis-configured given 8 primaries
-    testingModel: "sequential"
-    confidenceLevel: 0.95
-  metrics_count: 8
-  primary_metrics_summary: |
-    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
-    negative (a "Time to First Action" metric with direction=down where
-    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
-
-expected_behavior:
-  verdict: WAIT
-  must_mention:
-    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
-    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
-    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
-    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
-    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
-    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
-  must_not_do:
-    - "Slice by every available property after the fact (the fishing-expedition warning)"
-    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
-    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
-    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
-  references_consulted:
-    - segment-of-interest-selection.md
-    - per-metric-interpretation.md
-    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
deleted file mode 100644
index f634236..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml
+++ /dev/null
@@ -1,79 +0,0 @@
-name: pelando-plus-2-others
-prd_source: |
-  Pelando — "+2 others"
-  Customer reported that when a multi-variant test concludes with a winner banner
-  plus a small-print "+2 others", they cannot tell which non-winner variants are
-  benign vs which contain a guardrail regression they need to act on. The skill
-  should pivot the summary per variant, polarity-correct each, and call out the
-  losers, not gloss over them.
-
-trigger_phrase: |
-  Can you make sense of this experiment for me? The UI shows treatment_a winning
-  on the primary plus "+2 others" but I have no idea whether treatment_b or
-  treatment_c are okay to ignore.
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "benjamini-hochberg"
-    testingModel: "sequential"
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Activation Rate"
-    - id: m_guardrail_latency
-      type: guardrail
-      direction: down
-      name: "p95 Latency (ms)"
-    - id: m_guardrail_errors
-      type: guardrail
-      direction: down
-      name: "Error Rate"
-  live_exposures:
-    control: 41123
-    treatment_a: 40987
-    treatment_b: 41210
-    treatment_c: 40755
-  live_srm_analysis:
-    # platform-flagged passing
-    p_value: 0.42
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment_a,
-          lift: 0.041,
-          liftConfidence: 0.95,
-        }
-      - {
-          metricId: m_guardrail_latency,
-          variant: treatment_b,
-          lift: 0.08,
-          liftConfidence: 0.95,
-        }
-    negative:
-      - {
-          metricId: m_primary,
-          variant: treatment_c,
-          lift: -0.022,
-          liftConfidence: 0.95,
-        }
-    no:
-      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
-
-expected_behavior:
-  verdict: ITERATE
-  must_mention:
-    - "Pivot the summary by variant before declaring a winner"
-    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
-    - "treatment_c regresses the primary"
-    - "Multi-variant verdict requires each treatment to be judged independently against control"
-    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
-  must_not_do:
-    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
-    - "Trust the bucket name (positive/negative) as the business verdict"
-    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
-  references_consulted:
-    - per-metric-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
deleted file mode 100644
index 325a3bf..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml
+++ /dev/null
@@ -1,61 +0,0 @@
-name: polarsteps-no-workaround
-prd_source: |
-  Polarsteps — "no documented workaround"
-  Customer's experiment is failing SRM and they cannot find a documented path
-  forward. The skill should consume the platform's SRM verdict (not recompute
-  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
-  surface ordered likely causes plus a specific recommended action — not
-  punt with "investigate further."
-
-trigger_phrase: |
-  My experiment is failing SRM and the result lift looks too good to be true
-  (+18% on the primary). The docs just say "investigate" — what does that
-  actually mean here? Should I trust the lift?
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    srm:
-      enabled: true
-      targetAllocations: { control: 50, treatment: 50 }
-    excludeQA: false # potentially relevant
-  live_exposures:
-    control: 18250
-    treatment: 22980
-  live_srm_analysis:
-    # platform-flagged FAILING
-    p_value: 0.00002
-    chi_square: 18.4
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment,
-          lift: 0.18,
-          liftConfidence: 0.95,
-        }
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Trip Plan Created"
-
-expected_behavior:
-  verdict: DO_NOT_DECIDE
-  must_mention:
-    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
-    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
-    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
-    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
-    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
-    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
-    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
-    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
-  must_not_do:
-    - "Calculate the chi-square or re-derive an SRM p-value threshold"
-    - "Recommend shipping or treating the +18% lift as real"
-    - "Hand the user a generic 'investigate further' without ordered causes and an action"
-    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
-  references_consulted:
-    - health-check-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
similarity index 84%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md
rename to plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
index efaeae5..1e65de1 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
@@ -1,6 +1,6 @@
-# `Get-Experiment` Field Map
+# Experiment-Details Field Map
 
-Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
 
 This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
 
@@ -122,16 +122,13 @@ For a kill, pass `success=false`.
 
 ## Lifecycle hand-off
 
+To ship/kill, update the experiment with the `decide` action and these fields:
+
 ```
-Update-Experiment(
-  experiment_id=<id>,
-  experiment={
-    "action": "decide",
-    "success": true | false,
-    "variant": "<winner_key>",      # required when success=true
-    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-  }
-)
+action     → "decide"
+success    → true | false
+variant    → "<winner_key>"      # required when success=true
+message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
 ```
 
 `message` is required on every `decide` call.
@@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health
 
 ---
 
-## When to reach for sibling tools
+## When to reach for sibling capabilities
 
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
-- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
-- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
+- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
+- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
index 4471219..9ec66df 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
@@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured
 ### Investigation checklist
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
-2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
@@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 1. Identify which metric × variant pair triggered the failure (after the platform's correction).
 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
-3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
 
@@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 ### Investigation checklist
 
 1. Check `live_exposures` totals — which variant is undersampled?
-2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
-3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
index 3b44385..1e8678c 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
 ---
 
@@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute
 
 Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
-Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
 - `unique` (Bernoulli) → conversion **rate** as the baseline.
 - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
@@ -165,7 +165,7 @@ Check `settings.testingModel`:
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
 
-Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
index 6877d2a..fcf9cfd 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
 ---
 
@@ -80,7 +80,7 @@ This is the everyday case of mixed effects.
 ## What NOT to do
 
 - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
-- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
 - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
 - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
 - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
index 88640f4..b758b8e 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
 
-> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
 ---
 
@@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that.
 - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
-Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
index fdad2cd..142089c 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
@@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
 - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
 
-If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
 
 ### 2. Observed effect is smaller than the MDE
 
@@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
-- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
 - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
-3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
index 4e344d3..0164c56 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
 license: Apache-2.0
 ---
 
@@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio
 
 ## Requirements
 
-- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`).
-- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
+- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its
 - "What does this Retro A/A failure mean?"
 - "Can you compare the session replays for control vs treatment?"
 
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool.
+Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
 
 ---
 
-## How to read `Get-Experiment` output
+## How to read experiment-details output
 
-Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
+Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
 
 | Concept                      | Live (preferred)                  | Cached fallback                             |
 | ---------------------------- | --------------------------------- | ------------------------------------------- |
@@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea
 
 If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below.
+See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
 
 ---
 
@@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap
 | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
 | Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
 | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig              | Any flagged misconfig invalidates analysis.                                                                                                                    |
+| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
 
 If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
 
@@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift
 
 **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
 
 ### Step 5 — Verdict
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=<winner>, message=<rationale>)`                                                          |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
 | Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
 | Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
@@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r
 | "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
 | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
 | "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which `Get-Experiment` field has X?"                                           | [references/get-experiment-fields.md](references/get-experiment-fields.md)                       |
+| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
 
 ---
 
@@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else:
 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run.
+5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
-If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict.
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
deleted file mode 100644
index 71278d6..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# Eval fixtures — `experiment-results`
-
-Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place.
-
-The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses:
-
-1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field.
-2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric.
-
-When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating.
-
----
-
-## Fixtures
-
-| Fixture                         | PRD source quote                                                                                                         | What it exercises                                                                              |
-| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
-| `pelando-plus-2-others.yaml`    | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on)                               | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise.   |
-| `confetti-8-metrics.yaml`       | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. |
-| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward)             | Health-check interpretation; Kohavi framing; ordered-causes recommendation.                    |
-
-Each YAML doc has the same shape:
-
-```yaml
-name: <slug>
-prd_source: <one-line attribution>
-trigger_phrase: <what the user types>
-get_experiment_summary: <key fields the skill would see; not full response — just enough for the eval>
-expected_behavior:
-  verdict: <SHIP | ITERATE | KILL | WAIT | DO_NOT_DECIDE>
-  must_mention: [<phrases / framings the skill must cover>]
-  must_not_do: [<failure modes the skill should avoid>]
-  references_consulted: [<which reference files the skill should pull open>]
-```
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
deleted file mode 100644
index da61d9e..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml
+++ /dev/null
@@ -1,48 +0,0 @@
-name: confetti-8-metrics
-prd_source: |
-  Confetti — "8 metrics for new visitors"
-  Customer is running an experiment with 8 primary-ish metrics and explicitly
-  cares about new-visitor behavior. They want a segment-driven read, not a
-  dump of 8 lifts. The skill should pre-commit to segments tied to the
-  hypothesis (new vs returning), call out the multiple-testing concern with
-  8 metrics, and produce a verdict scoped to the segment that matters.
-
-trigger_phrase: |
-  We're tracking 8 metrics on this onboarding redesign experiment and I really
-  care about how new visitors respond. Can you read this and tell me whether
-  it's a ship for the new-user audience?
-
-get_experiment_summary:
-  hypothesis: |
-    If we redesign the first-session onboarding flow, then activation rate
-    among NEW visitors will increase by ≥5% relative, because reducing
-    cold-start friction shortens time-to-first-value.
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "off" # mis-configured given 8 primaries
-    testingModel: "sequential"
-    confidenceLevel: 0.95
-  metrics_count: 8
-  primary_metrics_summary: |
-    Of 8 primaries: 2 significant positive (polarity-correct), 1 significant
-    negative (a "Time to First Action" metric with direction=down where
-    lift is -7% — actually a WIN once polarity-applied), 5 inconclusive.
-
-expected_behavior:
-  verdict: WAIT
-  must_mention:
-    - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters"
-    - "Recommend at most 3–5 segments and call new vs returning the primary slice"
-    - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)"
-    - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down"
-    - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8"
-    - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP"
-  must_not_do:
-    - "Slice by every available property after the fact (the fishing-expedition warning)"
-    - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting"
-    - "Call the experiment a ship because 2 of 8 primaries are significant positive"
-    - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled"
-  references_consulted:
-    - segment-of-interest-selection.md
-    - per-metric-interpretation.md
-    - health-check-interpretation.md # for the misconfig flag
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
deleted file mode 100644
index f634236..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml
+++ /dev/null
@@ -1,79 +0,0 @@
-name: pelando-plus-2-others
-prd_source: |
-  Pelando — "+2 others"
-  Customer reported that when a multi-variant test concludes with a winner banner
-  plus a small-print "+2 others", they cannot tell which non-winner variants are
-  benign vs which contain a guardrail regression they need to act on. The skill
-  should pivot the summary per variant, polarity-correct each, and call out the
-  losers, not gloss over them.
-
-trigger_phrase: |
-  Can you make sense of this experiment for me? The UI shows treatment_a winning
-  on the primary plus "+2 others" but I have no idea whether treatment_b or
-  treatment_c are okay to ignore.
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    multipleTestingCorrection: "benjamini-hochberg"
-    testingModel: "sequential"
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Activation Rate"
-    - id: m_guardrail_latency
-      type: guardrail
-      direction: down
-      name: "p95 Latency (ms)"
-    - id: m_guardrail_errors
-      type: guardrail
-      direction: down
-      name: "Error Rate"
-  live_exposures:
-    control: 41123
-    treatment_a: 40987
-    treatment_b: 41210
-    treatment_c: 40755
-  live_srm_analysis:
-    # platform-flagged passing
-    p_value: 0.42
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment_a,
-          lift: 0.041,
-          liftConfidence: 0.95,
-        }
-      - {
-          metricId: m_guardrail_latency,
-          variant: treatment_b,
-          lift: 0.08,
-          liftConfidence: 0.95,
-        }
-    negative:
-      - {
-          metricId: m_primary,
-          variant: treatment_c,
-          lift: -0.022,
-          liftConfidence: 0.95,
-        }
-    no:
-      - { metricId: m_primary, variant: treatment_b, lift: 0.004 }
-
-expected_behavior:
-  verdict: ITERATE
-  must_mention:
-    - "Pivot the summary by variant before declaring a winner"
-    - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)"
-    - "treatment_c regresses the primary"
-    - "Multi-variant verdict requires each treatment to be judged independently against control"
-    - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running"
-  must_not_do:
-    - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each"
-    - "Trust the bucket name (positive/negative) as the business verdict"
-    - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg"
-  references_consulted:
-    - per-metric-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
deleted file mode 100644
index 325a3bf..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml
+++ /dev/null
@@ -1,61 +0,0 @@
-name: polarsteps-no-workaround
-prd_source: |
-  Polarsteps — "no documented workaround"
-  Customer's experiment is failing SRM and they cannot find a documented path
-  forward. The skill should consume the platform's SRM verdict (not recompute
-  chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and
-  surface ordered likely causes plus a specific recommended action — not
-  punt with "investigate further."
-
-trigger_phrase: |
-  My experiment is failing SRM and the result lift looks too good to be true
-  (+18% on the primary). The docs just say "investigate" — what does that
-  actually mean here? Should I trust the lift?
-
-get_experiment_summary:
-  settings:
-    controlKey: "control"
-    srm:
-      enabled: true
-      targetAllocations: { control: 50, treatment: 50 }
-    excludeQA: false # potentially relevant
-  live_exposures:
-    control: 18250
-    treatment: 22980
-  live_srm_analysis:
-    # platform-flagged FAILING
-    p_value: 0.00002
-    chi_square: 18.4
-  summary:
-    positive:
-      - {
-          metricId: m_primary,
-          variant: treatment,
-          lift: 0.18,
-          liftConfidence: 0.95,
-        }
-  metrics:
-    - id: m_primary
-      type: primary
-      direction: up
-      name: "Trip Plan Created"
-
-expected_behavior:
-  verdict: DO_NOT_DECIDE
-  must_mention:
-    - "SRM is failing per the platform's verdict — do NOT trust the +18% lift"
-    - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment"
-    - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win"
-    - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing"
-    - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken"
-    - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off"
-    - "Do NOT recompute the SRM chi-square — consume the platform's verdict"
-    - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data"
-  must_not_do:
-    - "Calculate the chi-square or re-derive an SRM p-value threshold"
-    - "Recommend shipping or treating the +18% lift as real"
-    - "Hand the user a generic 'investigate further' without ordered causes and an action"
-    - "Skip Kohavi framing — it's the whole reason this check is the #1 gate"
-  references_consulted:
-    - health-check-interpretation.md
-    - get-experiment-fields.md
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
similarity index 84%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md
rename to plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
index efaeae5..1e65de1 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
@@ -1,6 +1,6 @@
-# `Get-Experiment` Field Map
+# Experiment-Details Field Map
 
-Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`.
+Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
 
 This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
 
@@ -122,16 +122,13 @@ For a kill, pass `success=false`.
 
 ## Lifecycle hand-off
 
+To ship/kill, update the experiment with the `decide` action and these fields:
+
 ```
-Update-Experiment(
-  experiment_id=<id>,
-  experiment={
-    "action": "decide",
-    "success": true | false,
-    "variant": "<winner_key>",      # required when success=true
-    "message": "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-  }
-)
+action     → "decide"
+success    → true | false
+variant    → "<winner_key>"      # required when success=true
+message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
 ```
 
 `message` is required on every `decide` call.
@@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health
 
 ---
 
-## When to reach for sibling tools
+## When to reach for sibling capabilities
 
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`.
-- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`.
-- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)).
+- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
+- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
+- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
+- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
+- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
index 4471219..9ec66df 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
@@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured
 ### Investigation checklist
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
-2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history.
+2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
+4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
@@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 1. Identify which metric × variant pair triggered the failure (after the platform's correction).
 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production.
-3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm.
+3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm.
 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort.
 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing.
 
@@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 ### Investigation checklist
 
 1. Check `live_exposures` totals — which variant is undersampled?
-2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back?
-3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`.
+2. Inspect feature-flag rollout — was rollout dialed back?
+3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries.
+1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
index 3b44385..1e8678c 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate.
+**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
 ---
 
@@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute
 
 Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
-Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
+Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
 - `unique` (Bernoulli) → conversion **rate** as the baseline.
 - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size.
@@ -165,7 +165,7 @@ Check `settings.testingModel`:
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
 
-Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict.
+Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
index 6877d2a..fcf9cfd 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
@@ -2,7 +2,7 @@
 
 Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts.
+> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
 ---
 
@@ -80,7 +80,7 @@ This is the everyday case of mixed effects.
 ## What NOT to do
 
 - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition.
-- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it.
+- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it.
 - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal.
 - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism.
 - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence).
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
index 88640f4..b758b8e 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
@@ -2,7 +2,7 @@
 
 Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
 
-> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
+> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
 ---
 
@@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that.
 - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
-Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
+Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
index fdad2cd..142089c 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
@@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
 - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5.
 
-If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment.
+If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment.
 
 ### 2. Observed effect is smaller than the MDE
 
@@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed.
-- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
 - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math.
+When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
-3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action.
+2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.

From 1f13db66d5a0ed0b7b33134f7c8d8dc176257674 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Fri, 5 Jun 2026 01:26:34 +0000
Subject: [PATCH 03/11] Trim SKILL.md by deferring duplicated content to
 references
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The spine is always loaded; references are lazy. Move spine content
that duplicated reference material out:
- Drop the 47-line ASCII decision tree (numbered list reads equivalently)
- Replace the Step 1 trustworthiness field table with a one-line gate
  (full table lives in experiment-fields.md + health-check-interpretation.md)
- Compress Step 4 baseline-lookup detail to a pointer (full procedure
  in per-metric-interpretation.md)
- Move multi-variant + decide-call shape + special variant constants
  to experiment-fields.md §Lifecycle hand-off (where they already are)
- Drop the 16-line common-pitfalls cheat sheet (each pitfall is
  covered in the relevant reference)

SKILL.md: 236 → 110 lines. Decision-tree spine and verdict table
preserved; polarity recipe stays inline since it's load-bearing for
every step. All references still linked from the spine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../skills/experiment-results/SKILL.md        | 178 +++---------------
 .../skills/experiment-results/SKILL.md        | 178 +++---------------
 .../skills/experiment-results/SKILL.md        | 178 +++---------------
 3 files changed, 78 insertions(+), 456 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
index 0164c56..44f7254 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
@@ -6,12 +6,12 @@ license: Apache-2.0
 
 # Experiment Results Interpretation
 
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
 
 ## Requirements
 
 - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics=
 | SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
 | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
 | Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
 
-If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
+The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
 
 ---
 
-## The Decision Tree
-
-This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
-
-```
-┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
-│   SRM ok? → exposures sufficient? →          │
-│   Retro A/A clean? → minimum duration met? → │
-│   no misconfig?                              │
-│        │                                     │
-│      fail → STOP. See references/            │
-│             health-check-interpretation.md   │
-└──────────────┬───────────────────────────────┘
-               ↓ pass
-┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
-│   For each non-control variant × primary,    │
-│   apply the polarity recipe (sign-of-lift +  │
-│   metric.direction). Significant + correct   │
-│   polarity = "win"; significant + wrong      │
-│   polarity = "loss".                         │
-│        │                                     │
-│   nothing significant on primaries →         │
-│   see references/why-no-statsig.md           │
-└──────────────┬───────────────────────────────┘
-               ↓ at least one primary win
-┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
-│   Any guardrail significant in the wrong     │
-│   polarity? → regression → ITERATE not ship  │
-└──────────────┬───────────────────────────────┘
-               ↓ guardrails clean
-┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
-│   Convert the lift on the primary into       │
-│   absolute terms. Is it big enough to        │
-│   matter to the business?                    │
-│   Statistically significant ≠ ships.         │
-└──────────────┬───────────────────────────────┘
-               ↓ meaningful magnitude
-┌─ Step 5: VERDICT ────────────────────────────┐
-│   Trust ✓ + primary win + guardrails ✓ +     │
-│   meaningful magnitude → SHIP                │
-│   Trust ✓ + primary win + guardrail regress  │
-│     → ITERATE                                │
-│   Trust ✓ + primary neutral after target     │
-│     → KILL or ITERATE                        │
-│   Trust ✗                                    │
-│     → DO NOT DECIDE; report failures         │
-│   Hasn't reached target sample/duration      │
-│     → WAIT (or extend, or restart with more  │
-│       power — see why-no-statsig.md)         │
-└──────────────────────────────────────────────┘
-```
-
-### Step 1 — Trustworthiness gate (consume the verdicts)
-
-Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
-
-| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
-| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
-| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
-| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
-| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
-| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
-
-If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
-
-### Step 2 — Statistical significance with polarity
-
-**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
-
-#### Polarity recipe
-
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
-
-- `lift is None` or `lift == 0` → **neutral**.
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
-
-#### How to read the summary
-
-1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
-2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
-3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
-4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
-
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
-
-Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
-
-If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
-
-### Step 3 — Guardrail check
-
-Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
-
-- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
-- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
-- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
-
-### Step 4 — Practical significance
-
-Statistical significance ≠ business impact. For every primary metric that won:
-
-1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
-2. Read the **lift** from the winning variant's row.
-3. Compute absolute lift: `baseline_value × lift`.
-4. Project to population per period: ask the user for traffic estimates if not in context.
-
-A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+## The decision tree
+
+Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
+
+1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
+2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
+3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
+4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
+5. **Verdict** — see table below.
+
+### Polarity recipe (load-bearing — keep in mind for every metric)
 
-**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+- `lift is None` or `lift == 0` → **neutral**
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**
 
-### Step 5 — Verdict
+A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
+
+### Verdict table
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
@@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
 | Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
 
-For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
-
-`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
-
-Special variant constants when `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant
-- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
-
-For a kill, pass `success=false`.
+For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
 
 ---
 
 ## Going deeper
 
-Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+Open the relevant reference on demand:
 
 | User asks about…                                                                | Open                                                                                             |
 | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
@@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else:
 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
 If experiment details are unavailable or return errors, say so — do not invent a verdict.
-
----
-
-## Common pitfalls (cheat sheet)
-
-- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
-- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
-- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
-- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
-- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
-- ⛔ Treating a `null` lift as "no effect" — it means computation failed
-- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
-- ⛔ Interpreting a `< 3 day` experiment instead of refusing
-- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
-- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
-- ⛔ Conflating **statistical significance** with **practical significance**
-- ⛔ Ignoring **guardrail regressions** because the primary won
-- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
-- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
index 0164c56..44f7254 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
@@ -6,12 +6,12 @@ license: Apache-2.0
 
 # Experiment Results Interpretation
 
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
 
 ## Requirements
 
 - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics=
 | SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
 | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
 | Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
 
-If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
+The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
 
 ---
 
-## The Decision Tree
-
-This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
-
-```
-┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
-│   SRM ok? → exposures sufficient? →          │
-│   Retro A/A clean? → minimum duration met? → │
-│   no misconfig?                              │
-│        │                                     │
-│      fail → STOP. See references/            │
-│             health-check-interpretation.md   │
-└──────────────┬───────────────────────────────┘
-               ↓ pass
-┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
-│   For each non-control variant × primary,    │
-│   apply the polarity recipe (sign-of-lift +  │
-│   metric.direction). Significant + correct   │
-│   polarity = "win"; significant + wrong      │
-│   polarity = "loss".                         │
-│        │                                     │
-│   nothing significant on primaries →         │
-│   see references/why-no-statsig.md           │
-└──────────────┬───────────────────────────────┘
-               ↓ at least one primary win
-┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
-│   Any guardrail significant in the wrong     │
-│   polarity? → regression → ITERATE not ship  │
-└──────────────┬───────────────────────────────┘
-               ↓ guardrails clean
-┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
-│   Convert the lift on the primary into       │
-│   absolute terms. Is it big enough to        │
-│   matter to the business?                    │
-│   Statistically significant ≠ ships.         │
-└──────────────┬───────────────────────────────┘
-               ↓ meaningful magnitude
-┌─ Step 5: VERDICT ────────────────────────────┐
-│   Trust ✓ + primary win + guardrails ✓ +     │
-│   meaningful magnitude → SHIP                │
-│   Trust ✓ + primary win + guardrail regress  │
-│     → ITERATE                                │
-│   Trust ✓ + primary neutral after target     │
-│     → KILL or ITERATE                        │
-│   Trust ✗                                    │
-│     → DO NOT DECIDE; report failures         │
-│   Hasn't reached target sample/duration      │
-│     → WAIT (or extend, or restart with more  │
-│       power — see why-no-statsig.md)         │
-└──────────────────────────────────────────────┘
-```
-
-### Step 1 — Trustworthiness gate (consume the verdicts)
-
-Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
-
-| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
-| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
-| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
-| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
-| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
-| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
-
-If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
-
-### Step 2 — Statistical significance with polarity
-
-**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
-
-#### Polarity recipe
-
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
-
-- `lift is None` or `lift == 0` → **neutral**.
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
-
-#### How to read the summary
-
-1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
-2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
-3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
-4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
-
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
-
-Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
-
-If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
-
-### Step 3 — Guardrail check
-
-Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
-
-- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
-- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
-- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
-
-### Step 4 — Practical significance
-
-Statistical significance ≠ business impact. For every primary metric that won:
-
-1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
-2. Read the **lift** from the winning variant's row.
-3. Compute absolute lift: `baseline_value × lift`.
-4. Project to population per period: ask the user for traffic estimates if not in context.
-
-A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+## The decision tree
+
+Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
+
+1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
+2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
+3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
+4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
+5. **Verdict** — see table below.
+
+### Polarity recipe (load-bearing — keep in mind for every metric)
 
-**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+- `lift is None` or `lift == 0` → **neutral**
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**
 
-### Step 5 — Verdict
+A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
+
+### Verdict table
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
@@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
 | Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
 
-For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
-
-`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
-
-Special variant constants when `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant
-- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
-
-For a kill, pass `success=false`.
+For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
 
 ---
 
 ## Going deeper
 
-Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+Open the relevant reference on demand:
 
 | User asks about…                                                                | Open                                                                                             |
 | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
@@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else:
 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
 If experiment details are unavailable or return errors, say so — do not invent a verdict.
-
----
-
-## Common pitfalls (cheat sheet)
-
-- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
-- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
-- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
-- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
-- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
-- ⛔ Treating a `null` lift as "no effect" — it means computation failed
-- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
-- ⛔ Interpreting a `< 3 day` experiment instead of refusing
-- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
-- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
-- ⛔ Conflating **statistical significance** with **practical significance**
-- ⛔ Ignoring **guardrail regressions** because the primary won
-- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
-- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
index 0164c56..44f7254 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
@@ -6,12 +6,12 @@ license: Apache-2.0
 
 # Experiment Results Interpretation
 
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it.
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
 
 ## Requirements
 
 - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
+- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
 
 ## When to use this skill
 
@@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics=
 | SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
 | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
 | Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-| When was this computed?      | "now"                             | `exposures_cache.$last_computed`            |
 
-If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision.
+If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
 
-If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below.
+The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
 
 ---
 
-## The Decision Tree
-
-This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem.
-
-```
-┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐
-│   SRM ok? → exposures sufficient? →          │
-│   Retro A/A clean? → minimum duration met? → │
-│   no misconfig?                              │
-│        │                                     │
-│      fail → STOP. See references/            │
-│             health-check-interpretation.md   │
-└──────────────┬───────────────────────────────┘
-               ↓ pass
-┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐
-│   For each non-control variant × primary,    │
-│   apply the polarity recipe (sign-of-lift +  │
-│   metric.direction). Significant + correct   │
-│   polarity = "win"; significant + wrong      │
-│   polarity = "loss".                         │
-│        │                                     │
-│   nothing significant on primaries →         │
-│   see references/why-no-statsig.md           │
-└──────────────┬───────────────────────────────┘
-               ↓ at least one primary win
-┌─ Step 3: GUARDRAIL CHECK ────────────────────┐
-│   Any guardrail significant in the wrong     │
-│   polarity? → regression → ITERATE not ship  │
-└──────────────┬───────────────────────────────┘
-               ↓ guardrails clean
-┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐
-│   Convert the lift on the primary into       │
-│   absolute terms. Is it big enough to        │
-│   matter to the business?                    │
-│   Statistically significant ≠ ships.         │
-└──────────────┬───────────────────────────────┘
-               ↓ meaningful magnitude
-┌─ Step 5: VERDICT ────────────────────────────┐
-│   Trust ✓ + primary win + guardrails ✓ +     │
-│   meaningful magnitude → SHIP                │
-│   Trust ✓ + primary win + guardrail regress  │
-│     → ITERATE                                │
-│   Trust ✓ + primary neutral after target     │
-│     → KILL or ITERATE                        │
-│   Trust ✗                                    │
-│     → DO NOT DECIDE; report failures         │
-│   Hasn't reached target sample/duration      │
-│     → WAIT (or extend, or restart with more  │
-│       power — see why-no-statsig.md)         │
-└──────────────────────────────────────────────┘
-```
-
-### Step 1 — Trustworthiness gate (consume the verdicts)
-
-Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself.
-
-| Check                    | Field to read                                                                                          | What "fail" looks like                                                                                                                                         |
-| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| SRM                      | `live_srm_analysis` (or `exposures_cache.$srm_analysis`)                                               | Platform flags as failing — do not compute the chi-square yourself.                                                                                            |
-| Sufficient exposures     | `live_exposures` per variant                                                                           | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. |
-| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis                                                | Platform flags a significant pre-period difference.                                                                                                            |
-| Minimum elapsed time     | `end_date - start_date`                                                                                | Less than ~3 days regardless of sample size — interpretation is unreliable.                                                                                    |
-| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed.                                             |
-| Misconfiguration         | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig                      | Any flagged misconfig invalidates analysis.                                                                                                                    |
-
-If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery").
-
-### Step 2 — Statistical significance with polarity
-
-**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner.
-
-#### Polarity recipe
-
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric).
-
-- `lift is None` or `lift == 0` → **neutral**.
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict.
-
-#### How to read the summary
-
-1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user.
-2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful.
-3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss.
-4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect."
-
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.**
-
-Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
-
-If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md).
-
-### Step 3 — Guardrail check
-
-Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`).
-
-- A small primary win + a clear guardrail regression → usually **iterate, do not ship**.
-- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression.
-- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`.
-
-### Step 4 — Practical significance
-
-Statistical significance ≠ business impact. For every primary metric that won:
-
-1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`.
-2. Read the **lift** from the winning variant's row.
-3. Compute absolute lift: `baseline_value × lift`.
-4. Project to population per period: ask the user for traffic estimates if not in context.
-
-A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful.
+## The decision tree
+
+Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
+
+1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
+2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
+3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
+4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
+5. **Verdict** — see table below.
+
+### Polarity recipe (load-bearing — keep in mind for every metric)
 
-**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
 
-If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total.
+- `lift is None` or `lift == 0` → **neutral**
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**
 
-### Step 5 — Verdict
+A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
+
+### Verdict table
 
 | Situation                                                              | Recommendation                                                                                                                                               |
 | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
@@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r
 | Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
 | Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
 
-For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate.
-
-`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted.
-
-Special variant constants when `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant
-- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI)
-
-For a kill, pass `success=false`.
+For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
 
 ---
 
 ## Going deeper
 
-Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand:
+Open the relevant reference on demand:
 
 | User asks about…                                                                | Open                                                                                             |
 | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
@@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else:
 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
 
 If experiment details are unavailable or return errors, say so — do not invent a verdict.
-
----
-
-## Common pitfalls (cheat sheet)
-
-- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law)
-- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned
-- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction`
-- ⛔ Trusting a >30% lift without checking whether the **denominator changed**
-- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`)
-- ⛔ Treating a `null` lift as "no effect" — it means computation failed
-- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement"
-- ⛔ Interpreting a `< 3 day` experiment instead of refusing
-- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative)
-- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever)
-- ⛔ Conflating **statistical significance** with **practical significance**
-- ⛔ Ignoring **guardrail regressions** because the primary won
-- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction
-- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md))

From 4b8b01e166972d11b23e8d508118047a68e6554e Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Fri, 5 Jun 2026 23:36:42 +0000
Subject: [PATCH 04/11] Add negative-trigger guidance to experiment-results
 description

Surfaces the setup-skill boundary at routing time. The exclusion
existed in the body ("Do not trigger for experiment setup questions")
but the agent never reached it during skill selection.

Sync mixpanel-mcp-eu and mixpanel-mcp-in.

Assisted by Claude
---
 plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md | 2 +-
 plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md | 2 +-
 plugins/mixpanel-mcp/skills/experiment-results/SKILL.md    | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
index 44f7254..7bc71c4 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
index 44f7254..7bc71c4 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
index 44f7254..7bc71c4 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 

From 0e59d994a8853fec567ea4b6afa47a768406e433 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 04:54:42 +0000
Subject: [PATCH 05/11] Address PR review: restructure skill, rename to
 interpret-experiment
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses @gslopez's review on PR #23.

- Rename skill from `experiment-results` to `interpret-experiment` (verb-noun,
  matches `create-dashboard` / `manage-lexicon` / `deep-research`).
- Restructure SKILL.md into intro / Glossary / Components / Steps shape (mirrors
  `manage-lexicon`). Glossary defines Variant, Primary/Guardrail/Secondary,
  Lift, Polarity, Significance, SRM, Retro A/A, Twyman, CUPED, Winsorization,
  MDE, Trustworthiness gate so later steps can use the terms without redefining.
- Drop API-parameter phrasing (`compute_exposures=true`, decide-call shape).
  Replace with intent — the tool layer maps it to the right call.
- Drop the duplicated negative-trigger paragraph from the SKILL body (already
  in `description:` frontmatter where the loader sees it).
- Define the polarity recipe once in Components; replace the verbatim duplicate
  in per-metric-interpretation.md with a back-reference.
- Replace the live/cache field-path table with a single fallback rule in
  Components; the field schema is the tool's job, not the skill's.
- Add explicit "confirm with the user before concluding — irreversible" to the
  verdict table and to the Steps output shape.
- Delete `experiment-fields.md` (was duplicating tool-response schema docs).
  Promote its domain content into `lifecycle-handoff.md` (decide-call rationale,
  multi-variant choice, special variant constants).
- Break the overloaded misconfig bullets in health-check §7 into seven
  Condition / Interpretation / Action sub-sections; rename §7 to
  "Misconfigurations".
- Replace `§7` / `§SRM` / `§Misconfig` numeric/abbrev cross-refs with section
  titles so they don't rot when sections reorder.
- Trim generic p-value content in per-metric-interpretation.md to the
  Mixpanel-specific traps (Welch's t, `liftConfidence` is the confidence level
  not the CI width).
- Drop "Open this when…" preambles from every reference — the LLM is already
  reading the file by the time it opens.
- Sync eu/in plugin copies via `make sync-skills FORCE=1`.

Assisted by Claude
---
 README.md                                     |  18 +-
 .../skills/experiment-results/SKILL.md        | 110 ------------
 .../references/experiment-fields.md           | 158 ------------------
 .../skills/interpret-experiment/SKILL.md      | 127 ++++++++++++++
 .../references/health-check-interpretation.md |  70 ++++++--
 .../references/lifecycle-handoff.md           |  39 +++++
 .../references/per-metric-interpretation.md   |  27 ++-
 .../segment-breakdown-interpretation.md       |   4 +-
 .../segment-of-interest-selection.md          |   2 +-
 .../references/session-replay-analysis.md     |   2 +-
 .../references/why-no-statsig.md              |   4 +-
 .../skills/experiment-results/SKILL.md        | 110 ------------
 .../references/experiment-fields.md           | 158 ------------------
 .../skills/interpret-experiment/SKILL.md      | 127 ++++++++++++++
 .../references/health-check-interpretation.md |  70 ++++++--
 .../references/lifecycle-handoff.md           |  39 +++++
 .../references/per-metric-interpretation.md   |  27 ++-
 .../segment-breakdown-interpretation.md       |   4 +-
 .../segment-of-interest-selection.md          |   2 +-
 .../references/session-replay-analysis.md     |   2 +-
 .../references/why-no-statsig.md              |   4 +-
 .../skills/experiment-results/SKILL.md        | 110 ------------
 .../references/experiment-fields.md           | 158 ------------------
 .../skills/interpret-experiment/SKILL.md      | 127 ++++++++++++++
 .../references/health-check-interpretation.md |  70 ++++++--
 .../references/lifecycle-handoff.md           |  39 +++++
 .../references/per-metric-interpretation.md   |  27 ++-
 .../segment-breakdown-interpretation.md       |   4 +-
 .../segment-of-interest-selection.md          |   2 +-
 .../references/session-replay-analysis.md     |   2 +-
 .../references/why-no-statsig.md              |   4 +-
 31 files changed, 732 insertions(+), 915 deletions(-)
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
 delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/health-check-interpretation.md (73%)
 create mode 100644 plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
 rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/per-metric-interpretation.md (87%)
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/segment-breakdown-interpretation.md (94%)
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%)
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/session-replay-analysis.md (96%)
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/why-no-statsig.md (94%)
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
 delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
 rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/health-check-interpretation.md (73%)
 create mode 100644 plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
 rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/per-metric-interpretation.md (87%)
 rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/segment-breakdown-interpretation.md (94%)
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%)
 rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/session-replay-analysis.md (96%)
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/why-no-statsig.md (94%)
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
 delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
 create mode 100644 plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/health-check-interpretation.md (73%)
 create mode 100644 plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/per-metric-interpretation.md (87%)
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/segment-breakdown-interpretation.md (94%)
 rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%)
 rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/session-replay-analysis.md (96%)
 rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/why-no-statsig.md (94%)

diff --git a/README.md b/README.md
index 3518635..67b1872 100644
--- a/README.md
+++ b/README.md
@@ -4,18 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http
 
 ## Skills
 
-| Skill | Description |
-|---|---|
-| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. |
-| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. |
-| [`experiment-results`](plugins/mixpanel-mcp/skills/experiment-results/) | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. |
-| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. |
-| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. |
+| Skill                                                                             | Description                                                                                                                                                                                                                                                                                                                                                          |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/)               | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout.                                                                                                                                                                                                                                                                    |
+| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/)                     | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis.                                                                                                                                                                                         |
+| [`interpret-experiment`](plugins/mixpanel-mcp/skills/interpret-experiment/)       | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. |
+| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/)                   | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags.                                                                                                                               |
+| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes.                                                                                                                                                                                                                                 |
 
 ### Internal skills
 
-| Skill | Description |
-|---|---|
+| Skill                                          | Description                                                                                                                                                                                |
+| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill <skill-name>` before requesting a code review. |
 
 ## Getting Started
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
deleted file mode 100644
index 7bc71c4..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
-license: Apache-2.0
----
-
-# Experiment Results Interpretation
-
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
-
-## Requirements
-
-- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
-
-## When to use this skill
-
-Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
-
-- "What do these results mean?" / "Should we ship this?"
-- "Is this experiment trustworthy?" / "Why is SRM failing?"
-- "Why hasn't this hit statistical significance yet?"
-- "Break this down by `<segment>`" / "What segments should I look at?"
-- "What does this Retro A/A failure mean?"
-- "Can you compare the session replays for control vs treatment?"
-
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
-
----
-
-## How to read experiment-details output
-
-Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
-
-| Concept                      | Live (preferred)                  | Cached fallback                             |
-| ---------------------------- | --------------------------------- | ------------------------------------------- |
-| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
-| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
-| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
-| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-
-If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
-
----
-
-## The decision tree
-
-Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
-
-1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
-2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
-3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
-4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
-5. **Verdict** — see table below.
-
-### Polarity recipe (load-bearing — keep in mind for every metric)
-
-`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
-
-- `lift is None` or `lift == 0` → **neutral**
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
-
-Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
-
-### Verdict table
-
-| Situation                                                              | Recommendation                                                                                                                                               |
-| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
-| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
-| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
-| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
-| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
-
-For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
-
----
-
-## Going deeper
-
-Open the relevant reference on demand:
-
-| User asks about…                                                                | Open                                                                                             |
-| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
-| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
-| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
-| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
-| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
-| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
-| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
-
----
-
-## Output
-
-Default to this shape unless the user asks for something else:
-
-1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
-2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
-3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
-4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
-
-If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
deleted file mode 100644
index 1e65de1..0000000
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# Experiment-Details Field Map
-
-Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
-
-This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
-
----
-
-## Identity & lifecycle
-
-```
-id, name, description, hypothesis, status, start_date, end_date
-creator_email, tags, url, workspace_id
-feature_flag_id                       → for feature-flag-based experiments
-settings.controlKey                   → variant key treated as control (often "control"; may be "")
-```
-
-`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
-
----
-
-## Trustworthiness
-
-```
-live_srm_analysis                     → SRM verdict (consume — don't recompute)
-  .p_value
-  .chi_square
-live_exposures[<variantKey>]          → per-variant exposure counts (live)
-exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
-exposures_cache.$srm_analysis         → cached SRM analysis
-exposures_cache.$last_computed        → when the cache was last refreshed
-settings.srm.enabled                  → whether the SRM check ran
-settings.srm.targetAllocations        → expected per-variant allocation (percent)
-settings.preExperimentBias            → whether Retro A/A was enabled
-settings.excludeQA                    → whether QA traffic was filtered
-live_results_errors                   → non-null = live computation failed; surface and fall back to cache
-```
-
----
-
-## Per-metric per-variant results
-
-```
-live_metrics[<metricId>][<variantKey>]
-  .value             → metric value for this variant
-  .sampleSize        → sample size for this variant on this metric
-  .lift              → (treatment - control) / control  (0 for control row)
-  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
-  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
-
-results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
-```
-
----
-
-## Bucketed summary
-
-```
-results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
-results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
-results_cache.summary.no[]            → items with significance == "NO"
-
-Each item:
-  .metricId
-  .variant
-  .value
-  .lift
-  .liftConfidence
-  .sampleSize
-  .significance
-```
-
-**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
-
----
-
-## Metric catalog (for polarity lookups)
-
-```
-metrics[]
-  .id, .name
-  .type ("primary" | "guardrail" | "secondary")
-  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
-```
-
-Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
-
----
-
-## Settings that change interpretation
-
-```
-settings.confidenceLevel              → significance threshold (e.g. 0.95)
-settings.testingModel                 → "frequentist" or "sequential"
-settings.endCondition                 → "sample_size" or "days"
-settings.sampleSize / .endAfterDays   → planned end target
-settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
-settings.cuped.enabled                → CUPED variance reduction applied
-settings.cuped.preExposureDatePreset  → pre-exposure window
-settings.winsorization.enabled        → outlier capping applied
-settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
-```
-
----
-
-## Decision metadata (post-decide)
-
-```
-results_cache.message                 → decision rationale
-results_cache.variant                 → shipped variant key (or special constant)
-status                                → "concluded" | "success" | "fail"
-```
-
-Special variant constants for `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant.
-- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
-
-For a kill, pass `success=false`.
-
----
-
-## Lifecycle hand-off
-
-To ship/kill, update the experiment with the `decide` action and these fields:
-
-```
-action     → "decide"
-success    → true | false
-variant    → "<winner_key>"      # required when success=true
-message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-```
-
-`message` is required on every `decide` call.
-
----
-
-## Misconfig field map (cross-link)
-
-For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
-
-- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
-- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
-- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
-- `settings.confidenceLevel != 0.95`
-- `metrics[]` entries with `name == ""`
-- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
-
----
-
-## When to reach for sibling capabilities
-
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
-- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
-- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c2d7591
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
similarity index 73%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
index 9ec66df..e9082fa 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -1,8 +1,8 @@
 # Health-Check Interpretation
 
-Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
 
 ---
 
@@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho
 
 ---
 
-## 7. Misconfigurations to flag during Step 1
+## 7. Misconfigurations
 
-These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
 
-- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
-- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
-- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
-- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
-- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+### Multiple-testing correction off with several primaries
+
+**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+
+**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+
+**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+
+### Extreme winsorization percentile
+
+**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
+
+**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+
+### SRM check disabled
+
+**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+
+**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
+
+**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
+
+**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+
+**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Condition**: `settings.confidenceLevel != 0.95`.
+
+**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
+
+**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Condition**: `metrics[]` contains entries with `name == ""`.
+
+**Interpretation**: likely a broken or placeholder metric reference.
+
+**Action**: flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
+
+**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+
+**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..4d8189d
--- /dev/null
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
similarity index 87%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
index 1e8678c..3f272ad 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -1,6 +1,6 @@
 # Per-Metric Interpretation
 
-Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
 **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
@@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any
 
 ---
 
-## Polarity recipe (repeat from the spine — critical)
+## Polarity recipe
 
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
 
-- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
 
 ---
 
-## Reading the p-value correctly
+## Reading the p-value in this platform
 
-The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
-- ❌ The probability that the treatment works.
-- ❌ The probability the result will replicate.
-- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
-- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
-Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
 
 ---
 
@@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95
 lift = (treatment_mean - control_mean) / control_mean
 ```
 
-- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
 - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
 - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
 
@@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 ## Variance-reduction & outlier settings that change interpretation
 
 - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
similarity index 94%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index fcf9cfd..e0c43d2 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -1,6 +1,6 @@
 # Segment-Breakdown Interpretation
 
-Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
 > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
@@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme
 | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
 | Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
 
-When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
similarity index 95%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
index ea9f22b..b0c8f58 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -1,6 +1,6 @@
 # Segment-of-Interest Selection
 
-Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
 
 The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
similarity index 96%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
index b758b8e..59ad25e 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
@@ -1,6 +1,6 @@
 # Session-Replay Analysis Guidance
 
-Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+Turn a quantitative experiment result into a behavior story using session replays.
 
 > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
similarity index 94%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
index 142089c..a4e69d4 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -1,8 +1,8 @@
 # Why Hasn't This Reached Statistical Significance Yet?
 
-Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
deleted file mode 100644
index 7bc71c4..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
-license: Apache-2.0
----
-
-# Experiment Results Interpretation
-
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
-
-## Requirements
-
-- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
-
-## When to use this skill
-
-Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
-
-- "What do these results mean?" / "Should we ship this?"
-- "Is this experiment trustworthy?" / "Why is SRM failing?"
-- "Why hasn't this hit statistical significance yet?"
-- "Break this down by `<segment>`" / "What segments should I look at?"
-- "What does this Retro A/A failure mean?"
-- "Can you compare the session replays for control vs treatment?"
-
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
-
----
-
-## How to read experiment-details output
-
-Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
-
-| Concept                      | Live (preferred)                  | Cached fallback                             |
-| ---------------------------- | --------------------------------- | ------------------------------------------- |
-| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
-| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
-| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
-| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-
-If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
-
----
-
-## The decision tree
-
-Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
-
-1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
-2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
-3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
-4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
-5. **Verdict** — see table below.
-
-### Polarity recipe (load-bearing — keep in mind for every metric)
-
-`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
-
-- `lift is None` or `lift == 0` → **neutral**
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
-
-Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
-
-### Verdict table
-
-| Situation                                                              | Recommendation                                                                                                                                               |
-| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
-| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
-| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
-| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
-| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
-
-For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
-
----
-
-## Going deeper
-
-Open the relevant reference on demand:
-
-| User asks about…                                                                | Open                                                                                             |
-| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
-| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
-| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
-| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
-| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
-| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
-| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
-
----
-
-## Output
-
-Default to this shape unless the user asks for something else:
-
-1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
-2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
-3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
-4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
-
-If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
deleted file mode 100644
index 1e65de1..0000000
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# Experiment-Details Field Map
-
-Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
-
-This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
-
----
-
-## Identity & lifecycle
-
-```
-id, name, description, hypothesis, status, start_date, end_date
-creator_email, tags, url, workspace_id
-feature_flag_id                       → for feature-flag-based experiments
-settings.controlKey                   → variant key treated as control (often "control"; may be "")
-```
-
-`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
-
----
-
-## Trustworthiness
-
-```
-live_srm_analysis                     → SRM verdict (consume — don't recompute)
-  .p_value
-  .chi_square
-live_exposures[<variantKey>]          → per-variant exposure counts (live)
-exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
-exposures_cache.$srm_analysis         → cached SRM analysis
-exposures_cache.$last_computed        → when the cache was last refreshed
-settings.srm.enabled                  → whether the SRM check ran
-settings.srm.targetAllocations        → expected per-variant allocation (percent)
-settings.preExperimentBias            → whether Retro A/A was enabled
-settings.excludeQA                    → whether QA traffic was filtered
-live_results_errors                   → non-null = live computation failed; surface and fall back to cache
-```
-
----
-
-## Per-metric per-variant results
-
-```
-live_metrics[<metricId>][<variantKey>]
-  .value             → metric value for this variant
-  .sampleSize        → sample size for this variant on this metric
-  .lift              → (treatment - control) / control  (0 for control row)
-  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
-  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
-
-results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
-```
-
----
-
-## Bucketed summary
-
-```
-results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
-results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
-results_cache.summary.no[]            → items with significance == "NO"
-
-Each item:
-  .metricId
-  .variant
-  .value
-  .lift
-  .liftConfidence
-  .sampleSize
-  .significance
-```
-
-**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
-
----
-
-## Metric catalog (for polarity lookups)
-
-```
-metrics[]
-  .id, .name
-  .type ("primary" | "guardrail" | "secondary")
-  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
-```
-
-Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
-
----
-
-## Settings that change interpretation
-
-```
-settings.confidenceLevel              → significance threshold (e.g. 0.95)
-settings.testingModel                 → "frequentist" or "sequential"
-settings.endCondition                 → "sample_size" or "days"
-settings.sampleSize / .endAfterDays   → planned end target
-settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
-settings.cuped.enabled                → CUPED variance reduction applied
-settings.cuped.preExposureDatePreset  → pre-exposure window
-settings.winsorization.enabled        → outlier capping applied
-settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
-```
-
----
-
-## Decision metadata (post-decide)
-
-```
-results_cache.message                 → decision rationale
-results_cache.variant                 → shipped variant key (or special constant)
-status                                → "concluded" | "success" | "fail"
-```
-
-Special variant constants for `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant.
-- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
-
-For a kill, pass `success=false`.
-
----
-
-## Lifecycle hand-off
-
-To ship/kill, update the experiment with the `decide` action and these fields:
-
-```
-action     → "decide"
-success    → true | false
-variant    → "<winner_key>"      # required when success=true
-message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-```
-
-`message` is required on every `decide` call.
-
----
-
-## Misconfig field map (cross-link)
-
-For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
-
-- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
-- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
-- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
-- `settings.confidenceLevel != 0.95`
-- `metrics[]` entries with `name == ""`
-- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
-
----
-
-## When to reach for sibling capabilities
-
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
-- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
-- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c2d7591
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
similarity index 73%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
index 9ec66df..e9082fa 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -1,8 +1,8 @@
 # Health-Check Interpretation
 
-Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
 
 ---
 
@@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho
 
 ---
 
-## 7. Misconfigurations to flag during Step 1
+## 7. Misconfigurations
 
-These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
 
-- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
-- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
-- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
-- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
-- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+### Multiple-testing correction off with several primaries
+
+**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+
+**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+
+**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+
+### Extreme winsorization percentile
+
+**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
+
+**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+
+### SRM check disabled
+
+**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+
+**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
+
+**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
+
+**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+
+**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Condition**: `settings.confidenceLevel != 0.95`.
+
+**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
+
+**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Condition**: `metrics[]` contains entries with `name == ""`.
+
+**Interpretation**: likely a broken or placeholder metric reference.
+
+**Action**: flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
+
+**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+
+**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..4d8189d
--- /dev/null
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
similarity index 87%
rename from plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
index 1e8678c..3f272ad 100644
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -1,6 +1,6 @@
 # Per-Metric Interpretation
 
-Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
 **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
@@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any
 
 ---
 
-## Polarity recipe (repeat from the spine — critical)
+## Polarity recipe
 
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
 
-- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
 
 ---
 
-## Reading the p-value correctly
+## Reading the p-value in this platform
 
-The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
-- ❌ The probability that the treatment works.
-- ❌ The probability the result will replicate.
-- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
-- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
-Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
 
 ---
 
@@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95
 lift = (treatment_mean - control_mean) / control_mean
 ```
 
-- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
 - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
 - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
 
@@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 ## Variance-reduction & outlier settings that change interpretation
 
 - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
similarity index 94%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index fcf9cfd..e0c43d2 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -1,6 +1,6 @@
 # Segment-Breakdown Interpretation
 
-Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
 > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
@@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme
 | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
 | Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
 
-When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
similarity index 95%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
index ea9f22b..b0c8f58 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -1,6 +1,6 @@
 # Segment-of-Interest Selection
 
-Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
 
 The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
similarity index 96%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
index b758b8e..59ad25e 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
@@ -1,6 +1,6 @@
 # Session-Replay Analysis Guidance
 
-Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+Turn a quantitative experiment result into a behavior story using session replays.
 
 > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
similarity index 94%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
index 142089c..a4e69d4 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -1,8 +1,8 @@
 # Why Hasn't This Reached Statistical Significance Yet?
 
-Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
deleted file mode 100644
index 7bc71c4..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-name: experiment-results
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
-license: Apache-2.0
----
-
-# Experiment Results Interpretation
-
-You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth.
-
-## Requirements
-
-- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions).
-- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values.
-
-## When to use this skill
-
-Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:
-
-- "What do these results mean?" / "Should we ship this?"
-- "Is this experiment trustworthy?" / "Why is SRM failing?"
-- "Why hasn't this hit statistical significance yet?"
-- "Break this down by `<segment>`" / "What segments should I look at?"
-- "What does this Retro A/A failure mean?"
-- "Can you compare the session replays for control vs treatment?"
-
-Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill.
-
----
-
-## How to read experiment-details output
-
-Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**
-
-| Concept                      | Live (preferred)                  | Cached fallback                             |
-| ---------------------------- | --------------------------------- | ------------------------------------------- |
-| Per-variant exposure counts  | `live_exposures`                  | `exposures_cache` (strip `$`-prefixed keys) |
-| SRM check                    | `live_srm_analysis`               | `exposures_cache.$srm_analysis`             |
-| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]`  |
-| Bucketed summary             | recompute from `live_metrics`     | `results_cache.summary`                     |
-
-If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."
-
-The full field map is in [references/experiment-fields.md](references/experiment-fields.md).
-
----
-
-## The decision tree
-
-Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.
-
-1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).
-2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md).
-3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship.
-4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
-5. **Verdict** — see table below.
-
-### Polarity recipe (load-bearing — keep in mind for every metric)
-
-`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good:
-
-- `lift is None` or `lift == 0` → **neutral**
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
-
-Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`.
-
-### Verdict table
-
-| Situation                                                              | Recommendation                                                                                                                                               |
-| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=<winner>`, and a `message` rationale.                                           |
-| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                   |
-| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                  |
-| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                         |
-| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). |
-
-For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call.
-
----
-
-## Going deeper
-
-Open the relevant reference on demand:
-
-| User asks about…                                                                | Open                                                                                             |
-| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
-| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail      | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
-| "Translate this lift / CI / p-value into English"                               | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
-| "Why hasn't this hit statsig yet? Should we wait or stop?"                      | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
-| "Which segments should I break this down on?"                                   | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
-| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
-| "Can session replays help explain this result?"                                 | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
-| "Which field in the experiment-details response has X?"                         | [references/experiment-fields.md](references/experiment-fields.md)                               |
-
----
-
-## Output
-
-Default to this shape unless the user asks for something else:
-
-1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
-2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine).
-3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win.
-4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc.
-5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run.
-
-If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
deleted file mode 100644
index 1e65de1..0000000
--- a/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# Experiment-Details Field Map
-
-Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`.
-
-This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply.
-
----
-
-## Identity & lifecycle
-
-```
-id, name, description, hypothesis, status, start_date, end_date
-creator_email, tags, url, workspace_id
-feature_flag_id                       → for feature-flag-based experiments
-settings.controlKey                   → variant key treated as control (often "control"; may be "")
-```
-
-`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below).
-
----
-
-## Trustworthiness
-
-```
-live_srm_analysis                     → SRM verdict (consume — don't recompute)
-  .p_value
-  .chi_square
-live_exposures[<variantKey>]          → per-variant exposure counts (live)
-exposures_cache[<variantKey>]         → per-variant exposure counts (cached fallback)
-exposures_cache.$srm_analysis         → cached SRM analysis
-exposures_cache.$last_computed        → when the cache was last refreshed
-settings.srm.enabled                  → whether the SRM check ran
-settings.srm.targetAllocations        → expected per-variant allocation (percent)
-settings.preExperimentBias            → whether Retro A/A was enabled
-settings.excludeQA                    → whether QA traffic was filtered
-live_results_errors                   → non-null = live computation failed; surface and fall back to cache
-```
-
----
-
-## Per-metric per-variant results
-
-```
-live_metrics[<metricId>][<variantKey>]
-  .value             → metric value for this variant
-  .sampleSize        → sample size for this variant on this metric
-  .lift              → (treatment - control) / control  (0 for control row)
-  .liftConfidence    → confidence LEVEL used (e.g. 0.95) — NOT the CI width
-  .significance      → "YES_POSITIVE" | "YES_NEGATIVE" | "NO"  (sign-of-lift, NOT polarity)
-
-results_cache.metrics[<metricId>][<variantKey>]  → cached fallback, same shape
-```
-
----
-
-## Bucketed summary
-
-```
-results_cache.summary.positive[]      → items with significance == "YES_POSITIVE" (lift > 0, sig)
-results_cache.summary.negative[]      → items with significance == "YES_NEGATIVE" (lift < 0, sig)
-results_cache.summary.no[]            → items with significance == "NO"
-
-Each item:
-  .metricId
-  .variant
-  .value
-  .lift
-  .liftConfidence
-  .sampleSize
-  .significance
-```
-
-**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion.
-
----
-
-## Metric catalog (for polarity lookups)
-
-```
-metrics[]
-  .id, .name
-  .type ("primary" | "guardrail" | "secondary")
-  .direction ("up" | "down")          → always set; defaults to "up" if the source metric was unset
-```
-
-Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation.
-
----
-
-## Settings that change interpretation
-
-```
-settings.confidenceLevel              → significance threshold (e.g. 0.95)
-settings.testingModel                 → "frequentist" or "sequential"
-settings.endCondition                 → "sample_size" or "days"
-settings.sampleSize / .endAfterDays   → planned end target
-settings.multipleTestingCorrection    → "off" | "bonferroni" | "benjamini-hochberg"
-settings.cuped.enabled                → CUPED variance reduction applied
-settings.cuped.preExposureDatePreset  → pre-exposure window
-settings.winsorization.enabled        → outlier capping applied
-settings.winsorization.percentile     → cap percentile (default 95; lower values are extreme)
-```
-
----
-
-## Decision metadata (post-decide)
-
-```
-results_cache.message                 → decision rationale
-results_cache.variant                 → shipped variant key (or special constant)
-status                                → "concluded" | "success" | "fail"
-```
-
-Special variant constants for `success=true`:
-
-- `__no_variant_shipped__` — ship the change without picking a variant.
-- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`).
-
-For a kill, pass `success=false`.
-
----
-
-## Lifecycle hand-off
-
-To ship/kill, update the experiment with the `decide` action and these fields:
-
-```
-action     → "decide"
-success    → true | false
-variant    → "<winner_key>"      # required when success=true
-message    → "<rationale: metrics evaluated, polarity, tradeoffs accepted>"
-```
-
-`message` is required on every `decide` call.
-
----
-
-## Misconfig field map (cross-link)
-
-For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7.
-
-- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants
-- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99)
-- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious)
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only"
-- `settings.confidenceLevel != 0.95`
-- `metrics[]` entries with `name == ""`
-- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics`
-
----
-
-## When to reach for sibling capabilities
-
-- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill.
-- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters.
-- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action.
-- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state.
-- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md).
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
new file mode 100644
index 0000000..c2d7591
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: interpret-experiment
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+license: Apache-2.0
+---
+
+# Interpret Experiment
+
+You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values.
+
+---
+
+# Glossary
+
+Concepts the rest of this skill uses without redefining.
+
+- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control.
+- **Primary / Guardrail / Secondary metric.**
+  - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured.
+  - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win.
+  - **Secondary** — exploratory only, never decisional, no correction applied.
+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components.
+- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute.
+- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted.
+- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started.
+- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact.
+- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
+- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+
+---
+
+# Components
+
+The pieces every interpretation uses. Defined here once so they don't drift across the steps and references.
+
+## Polarity recipe (load-bearing — apply on every metric row)
+
+The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion.
+
+Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
+
+- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively).
+- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
+- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
+
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+
+The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+
+## Data-source fallback
+
+Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect."
+
+## Verdict table
+
+| Situation                                                              | Recommendation                                                                                                                                                                       |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** |
+| Trust ✓, primary polarity positive, guardrail polarity negative        | **ITERATE.** Investigate the regression; do not auto-ship.                                                                                                                           |
+| Trust ✓, primary polarity neutral after target sample reached          | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md).                                                          |
+| Trust ✓, target sample/duration not yet reached                        | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)).                                                                 |
+| Trust ✗                                                                | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md).                         |
+
+For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md).
+
+---
+
+# Steps
+
+Top-down: what to do, in order.
+
+## 1. Fetch the experiment
+
+Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
+
+Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
+
+## 2. Run the trustworthiness gate (the Decision Tree)
+
+Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing.
+
+### 2a. Trustworthiness
+
+SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.).
+
+### 2b. Statistical significance
+
+Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md).
+
+### 2c. Guardrail check
+
+Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression.
+
+### 2d. Practical significance
+
+Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%.
+
+### 2e. Verdict
+
+Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible.
+
+## 3. Going deeper (open references on demand)
+
+| User asks about…                                                                    | Open                                                                                             |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md)           |
+| "Translate this lift / CI / p-value into English"                                   | [references/per-metric-interpretation.md](references/per-metric-interpretation.md)               |
+| "Why hasn't this hit statsig yet? Should we wait or stop?"                          | [references/why-no-statsig.md](references/why-no-statsig.md)                                     |
+| "Which segments should I break this down on?"                                       | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md)       |
+| "What does this segment-by-segment result mean?"                                    | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) |
+| "Can session replays help explain this result?"                                     | [references/session-replay-analysis.md](references/session-replay-analysis.md)                   |
+| "How do I actually conclude this experiment? Multi-variant ship?"                   | [references/lifecycle-handoff.md](references/lifecycle-handoff.md)                               |
+
+## 4. Output
+
+Default to this shape unless the user asks for something else:
+
+1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`.
+2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine).
+3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win.
+4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc.
+5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next.
+
+If experiment details are unavailable or return errors, say so — do not invent a verdict.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
similarity index 73%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
index 9ec66df..e9082fa 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -1,8 +1,8 @@
 # Health-Check Interpretation
 
-Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
+Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
+**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
 
 ---
 
@@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho
 
 ---
 
-## 7. Misconfigurations to flag during Step 1
+## 7. Misconfigurations
 
-These don't always invalidate results, but they change how to _read_ them. Surface them as warnings.
+These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate.
 
-- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration.
-- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze.
-- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational.
-- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate.
-- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis.
-- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results.
+### Multiple-testing correction off with several primaries
+
+**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+
+**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
+
+**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+
+### Extreme winsorization percentile
+
+**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
+
+**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+
+### SRM check disabled
+
+**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+
+**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
+
+**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+
+### CUPED on new-users-only cohort
+
+**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
+
+**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+
+**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+
+### Non-default confidence level
+
+**Condition**: `settings.confidenceLevel != 0.95`.
+
+**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
+
+**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+
+### Broken or placeholder metric entries
+
+**Condition**: `metrics[]` contains entries with `name == ""`.
+
+**Interpretation**: likely a broken or placeholder metric reference.
+
+**Action**: flag and skip during analysis.
+
+### Primary metric with no computed result
+
+**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
+
+**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+
+**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
new file mode 100644
index 0000000..4d8189d
--- /dev/null
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -0,0 +1,39 @@
+# Lifecycle Hand-off
+
+How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description.
+
+---
+
+## Confirm before concluding — always
+
+Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization.
+
+## The three pieces every decide call needs
+
+A decide call expresses three things:
+
+1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop.
+2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below.
+3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder.
+
+## Special variant choices for success
+
+When you have a winning result but no single variant to ship:
+
+- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.)
+- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.)
+
+When the verdict is KILL — no winner — record success as false. No variant key is needed in that case.
+
+## Multi-variant experiments
+
+For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied:
+
+- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost).
+- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner.
+
+A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner.
+
+## After concluding
+
+The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
similarity index 87%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
index 1e8678c..3f272ad 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -1,6 +1,6 @@
 # Per-Metric Interpretation
 
-Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
+Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
 **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
 
@@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any
 
 ---
 
-## Polarity recipe (repeat from the spine — critical)
+## Polarity recipe
 
-`metric.direction` is `"up"` or `"down"` (defaults to `"up"`).
+Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering:
 
-- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively).
-- `direction == "up"` → **positive** if `lift > 0`, else **negative**.
-- `direction == "down"` → **positive** if `lift < 0`, else **negative**.
-
-A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption).
+- A row in `summary.positive` with `direction: "down"` is a **regression**.
+- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption).
 
 ---
 
-## Reading the p-value correctly
+## Reading the p-value in this platform
 
-The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT:
+Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
-- ❌ The probability that the treatment works.
-- ❌ The probability the result will replicate.
-- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample.
-- ❌ Proof of "no effect" when above threshold (see "Inconclusive results").
+The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
-Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative).
+For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction.
 
 ---
 
@@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95
 lift = (treatment_mean - control_mean) / control_mean
 ```
 
-- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width.
 - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct.
 - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect."
 
@@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 ## Variance-reduction & outlier settings that change interpretation
 
 - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig).
+- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
similarity index 94%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index fcf9cfd..e0c43d2 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -1,6 +1,6 @@
 # Segment-Breakdown Interpretation
 
-Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
+Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
 > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
 
@@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme
 | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. |
 | Two opposite-direction effects in different segments that roughly cancel overall                    | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses.    |
 
-When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal.
+When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
similarity index 95%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
index ea9f22b..b0c8f58 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -1,6 +1,6 @@
 # Segment-of-Interest Selection
 
-Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking.
+Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking.
 
 The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them.
 
diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
similarity index 96%
rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
index b758b8e..59ad25e 100644
--- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
@@ -1,6 +1,6 @@
 # Session-Replay Analysis Guidance
 
-Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story.
+Turn a quantitative experiment result into a behavior story using session replays.
 
 > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss.
 
diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
similarity index 94%
rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
index 142089c..a4e69d4 100644
--- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -1,8 +1,8 @@
 # Why Hasn't This Reached Statistical Significance Yet?
 
-Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts.
+Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 

From ec34b1297e11903014d861be330f1bb571cf8284 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 18:48:06 +0000
Subject: [PATCH 06/11] interpret-experiment: phase-1 review fixes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses Phase 1 of the hardcore /review-skill pass.

- Drop stale "Step 1 of the Decision Tree" cross-references in SKILL.md
  Glossary, why-no-statsig.md, and segment-of-interest-selection.md. The new
  spine numbers the trustworthiness gate as step 2, but the name "trustworthiness
  gate" is what's stable — use the name.
- Drop the embedded ~30s retry interval in health-check-interpretation.md §5.
  Retry policy belongs to the tool layer; "retry once, then surface" is enough
  for the skill.
- Hedge five unsourced defaults (Welch's t-test choice, 95% winsorization
  default, ~350 per-variant exposure floor cited in three places, Bonferroni
  correction on multi-variant SRM). Each one becomes "the platform's
  configured/default X — verify in product" instead of a flat assertion.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
---
 .../skills/interpret-experiment/SKILL.md               |  2 +-
 .../references/health-check-interpretation.md          | 10 +++++-----
 .../references/per-metric-interpretation.md            |  2 +-
 .../references/segment-breakdown-interpretation.md     |  8 ++++----
 .../references/segment-of-interest-selection.md        |  4 ++--
 .../interpret-experiment/references/why-no-statsig.md  |  4 ++--
 .../skills/interpret-experiment/SKILL.md               |  2 +-
 .../references/health-check-interpretation.md          | 10 +++++-----
 .../references/per-metric-interpretation.md            |  2 +-
 .../references/segment-breakdown-interpretation.md     |  8 ++++----
 .../references/segment-of-interest-selection.md        |  4 ++--
 .../interpret-experiment/references/why-no-statsig.md  |  4 ++--
 .../mixpanel-mcp/skills/interpret-experiment/SKILL.md  |  2 +-
 .../references/health-check-interpretation.md          | 10 +++++-----
 .../references/per-metric-interpretation.md            |  2 +-
 .../references/segment-breakdown-interpretation.md     |  8 ++++----
 .../references/segment-of-interest-selection.md        |  4 ++--
 .../interpret-experiment/references/why-no-statsig.md  |  4 ++--
 18 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
index c2d7591..c205f29 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining.
 - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
 - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
 - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
-- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
index e9082fa..a0658e2 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
-3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
@@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
-2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
 
@@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
 
-**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
 
 ### SRM check disabled
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
index 3f272ad..d8877fb 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index e0c43d2..f5623e1 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar
 
 ## Sample-size floor per segment
 
-Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
 
-- Segments below the floor → mark "insufficient sample, treat as directional only."
-- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
 - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
 
 ---
@@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt
 Don't recommend a segment-scoped ship unless **all** of these hold:
 
 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
-2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
index b0c8f58..4db49ac 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c
 - Bot-suspicious countries (`bot_traffic` cause from health-check).
 - A specific app version range that shipped a flag-evaluation change.
 
-This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
 
 ### 5. Segments the platform de facto requires
 
@@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th
 
 For each segment you want to break down on:
 
-1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
index a4e69d4..7cc432a 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -13,7 +13,7 @@ Inconclusive can mean two very different things:
 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
 
-Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
 
 Also check:
 
@@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
-- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
 
 Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
index c2d7591..c205f29 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining.
 - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
 - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
 - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
-- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
index e9082fa..a0658e2 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
-3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
@@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
-2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
 
@@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
 
-**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
 
 ### SRM check disabled
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
index 3f272ad..d8877fb 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index e0c43d2..f5623e1 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar
 
 ## Sample-size floor per segment
 
-Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
 
-- Segments below the floor → mark "insufficient sample, treat as directional only."
-- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
 - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
 
 ---
@@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt
 Don't recommend a segment-scoped ship unless **all** of these hold:
 
 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
-2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
index b0c8f58..4db49ac 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c
 - Bot-suspicious countries (`bot_traffic` cause from health-check).
 - A specific app version range that shipped a flag-evaluation change.
 
-This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
 
 ### 5. Segments the platform de facto requires
 
@@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th
 
 For each segment you want to break down on:
 
-1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
index a4e69d4..7cc432a 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -13,7 +13,7 @@ Inconclusive can mean two very different things:
 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
 
-Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
 
 Also check:
 
@@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
-- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
 
 Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
index c2d7591..c205f29 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining.
 - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts.
 - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95.
 - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup.
-- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
+- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
index e9082fa..a0658e2 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
-3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math.
+3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
@@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries.
-2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
+1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
+2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
 
@@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
 
-**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason.
+**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
 
 ### SRM check disabled
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
index 3f272ad..d8877fb 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index e0c43d2..f5623e1 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar
 
 ## Sample-size floor per segment
 
-Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment.
+Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment.
 
-- Segments below the floor → mark "insufficient sample, treat as directional only."
-- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so.
+- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only."
+- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so.
 - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice.
 
 ---
@@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt
 Don't recommend a segment-scoped ship unless **all** of these hold:
 
 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it).
-2. The segment's per-variant sample clears the ~350 floor by a comfortable margin.
+2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin.
 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment.
 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product.
 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
index b0c8f58..4db49ac 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md
@@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c
 - Bot-suspicious countries (`bot_traffic` cause from health-check).
 - A specific app version range that shipped a flag-evaluation change.
 
-This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble.
+This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble.
 
 ### 5. Segments the platform de facto requires
 
@@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th
 
 For each segment you want to break down on:
 
-1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
+1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment.
 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis.
 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison.
 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
index a4e69d4..7cc432a 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -13,7 +13,7 @@ Inconclusive can mean two very different things:
 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about.
 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely.
 
-Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
+Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power.
 
 Also check:
 
@@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
-- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison.
+- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison.
 
 Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment.
 

From 3d1a0913849bcf382c8296d11fcbb65bab532b7d Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 18:55:57 +0000
Subject: [PATCH 07/11] interpret-experiment: phase-2 review fixes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses Phase 2 of the hardcore /review-skill pass. Removes the field-path
schema leaks gslopez originally flagged on PR #23, which had survived the
phase-1 cleanup.

- Rewrite every section header in health-check-interpretation.md (sections 1-6
  + the seven §7 misconfig sub-sections) from "Verdict to consume: <field
  path>" to plain-language intent. Same for the Condition/Interpretation/Action
  scaffolding in §7 — collapsed to "When: <plain condition>" + a free
  paragraph, dropping the labels the reader doesn't need.
- Rewrite every "What to look at" bullet in why-no-statsig.md (reasons 1-5)
  from field-path triplets to intent. Same for the "First, rule out a broken
  result" checks and the EXTEND action.
- Remove the remaining settings.* / live_* / results_cache.* references in
  per-metric-interpretation.md (baseline-fetch, variance/outlier discussion,
  multiple-comparisons section, testing-model section) and SKILL.md's polarity
  recipe (multiple-testing correction).
- Remove field-path leaks from lifecycle-handoff.md (decision-record) and
  session-replay-analysis.md (example user-facing quote).
- Add a one-sentence disambiguation guard at the top of SKILL.md Step 1: if
  the user hasn't named a specific experiment, ask before fetching.
- Expand "SRM" on first mention in the description and replace "hasn't
  reached statistical significance" with "isn't showing a clear winner yet"
  for non-expert legibility. Glossary inside SKILL.md still does the heavy
  definition.

After this commit, `grep -rE 'live_|results_cache|exposures_cache|settings\.<…>'
plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
---
 .../skills/interpret-experiment/SKILL.md      |  6 +-
 .../references/health-check-interpretation.md | 80 ++++++++-----------
 .../references/lifecycle-handoff.md           |  2 +-
 .../references/per-metric-interpretation.md   | 18 ++---
 .../references/session-replay-analysis.md     |  2 +-
 .../references/why-no-statsig.md              | 24 +++---
 .../skills/interpret-experiment/SKILL.md      |  6 +-
 .../references/health-check-interpretation.md | 80 ++++++++-----------
 .../references/lifecycle-handoff.md           |  2 +-
 .../references/per-metric-interpretation.md   | 18 ++---
 .../references/session-replay-analysis.md     |  2 +-
 .../references/why-no-statsig.md              | 24 +++---
 .../skills/interpret-experiment/SKILL.md      |  6 +-
 .../references/health-check-interpretation.md | 80 ++++++++-----------
 .../references/lifecycle-handoff.md           |  2 +-
 .../references/per-metric-interpretation.md   | 18 ++---
 .../references/session-replay-analysis.md     |  2 +-
 .../references/why-no-statsig.md              | 24 +++---
 18 files changed, 180 insertions(+), 216 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
index c205f29..18b15f7 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 
@@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 
 A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
 
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
 ## Data-source fallback
 
@@ -74,6 +74,8 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
 Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
index a0658e2..1edc9fa 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai
 
 ## 1. SRM (Sample Ratio Mismatch)
 
-**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
 
 ### What it means
 
-Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
 
 ### Likely causes, ordered most → least likely
 
@@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured
 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
-4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
 
 ### Recommended actions
 
 - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
 - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
-- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
 - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
 
 ### Investigation checklist
 
-1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
-5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
 
@@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 ## 2. Retro A/A (pre-experiment bias) failure
 
-**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
 
 ### What it means
 
@@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 ## 3. Insufficient exposures
 
-**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
 
 ### Investigation checklist
 
-1. Check `live_exposures` totals — which variant is undersampled?
+1. Check per-variant exposure totals — which variant is undersampled?
 2. Inspect feature-flag rollout — was rollout dialed back?
 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve
 
 ## 4. Frequentist peeking
 
-**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
 
 ### What it means
 
@@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Confirm `settings.testingModel == "frequentist"`.
-2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
-4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
 
 (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
 
@@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ## 5. Live computation timeout / broken data
 
-**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
 
 ### Investigation checklist
 
 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
-4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
 
 ---
 
 ## 6. Experiment ran < 3 days
 
-**Verdict to compute (this one is local)**: `end_date - start_date`.
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
 
 Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
 
 > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
 
-If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
 
 ---
 
@@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
 
-**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-
-**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
 
 ### Extreme winsorization percentile
 
-**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
-
-**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
+Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+**When**: the experiment's SRM check is off.
 
-**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
-
-**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
-
-**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+**When**: CUPED is enabled AND the experiment cohort is "new users only".
 
-**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**Condition**: `settings.confidenceLevel != 0.95`.
+**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
 
-**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
-
-**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**Condition**: `metrics[]` contains entries with `name == ""`.
-
-**Interpretation**: likely a broken or placeholder metric reference.
+**When**: the experiment includes metric entries with empty names.
 
-**Action**: flag and skip during analysis.
+Likely a broken or placeholder metric reference. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
-
-**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
 
-**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
index 4d8189d..3a9e24c 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co
 
 ## After concluding
 
-The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
index d8877fb..576ef9f 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
@@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u
 
 Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
 
-1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
 2. Lift from the winning row.
-3. Absolute lift: `baseline_value × lift`. Examples:
+3. Absolute lift: `baseline × lift`. Examples:
    - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
    - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
 
-### Fallback when `value` / `sampleSize` are null
+### Fallback when the baseline value or sample size is missing
 
-Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
 Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
@@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 ## Variance-reduction & outlier settings that change interpretation
 
-- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
@@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
-If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
 
 ---
 
@@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check `settings.testingModel`:
+Check the experiment's testing model:
 
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
index 59ad25e..7282bb4 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md
@@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per
 
 Replay analysis is qualitative. Be honest about that.
 
-- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
 Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
index 7cc432a..dbda2af 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin
 
 Also check:
 
-- `lift is None` on the primary → no measurement, not "no effect."
-- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
-- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
 
 ---
 
@@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually
 
 ### 1. Not enough sample yet (not enough exposures)
 
-**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
 
 - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
@@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 2. Observed effect is smaller than the MDE
 
-**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
 
 - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
 - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
@@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 3. Variance is too high (metric is too noisy)
 
-**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
 
-- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
 - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
 - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
 - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
@@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 4. Traffic split is starving the variant
 
-**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
@@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 ### 5. Exposure config is filtering more users than the user expects
 
-**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
 - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
-- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
index c205f29..18b15f7 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 
@@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 
 A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
 
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
 ## Data-source fallback
 
@@ -74,6 +74,8 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
 Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
index a0658e2..1edc9fa 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai
 
 ## 1. SRM (Sample Ratio Mismatch)
 
-**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
 
 ### What it means
 
-Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
 
 ### Likely causes, ordered most → least likely
 
@@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured
 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
-4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
 
 ### Recommended actions
 
 - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
 - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
-- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
 - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
 
 ### Investigation checklist
 
-1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
-5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
 
@@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 ## 2. Retro A/A (pre-experiment bias) failure
 
-**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
 
 ### What it means
 
@@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 ## 3. Insufficient exposures
 
-**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
 
 ### Investigation checklist
 
-1. Check `live_exposures` totals — which variant is undersampled?
+1. Check per-variant exposure totals — which variant is undersampled?
 2. Inspect feature-flag rollout — was rollout dialed back?
 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve
 
 ## 4. Frequentist peeking
 
-**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
 
 ### What it means
 
@@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Confirm `settings.testingModel == "frequentist"`.
-2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
-4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
 
 (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
 
@@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ## 5. Live computation timeout / broken data
 
-**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
 
 ### Investigation checklist
 
 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
-4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
 
 ---
 
 ## 6. Experiment ran < 3 days
 
-**Verdict to compute (this one is local)**: `end_date - start_date`.
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
 
 Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
 
 > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
 
-If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
 
 ---
 
@@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
 
-**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-
-**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
 
 ### Extreme winsorization percentile
 
-**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
-
-**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
+Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+**When**: the experiment's SRM check is off.
 
-**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
-
-**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
-
-**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+**When**: CUPED is enabled AND the experiment cohort is "new users only".
 
-**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**Condition**: `settings.confidenceLevel != 0.95`.
+**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
 
-**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
-
-**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**Condition**: `metrics[]` contains entries with `name == ""`.
-
-**Interpretation**: likely a broken or placeholder metric reference.
+**When**: the experiment includes metric entries with empty names.
 
-**Action**: flag and skip during analysis.
+Likely a broken or placeholder metric reference. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
-
-**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
 
-**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
index 4d8189d..3a9e24c 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co
 
 ## After concluding
 
-The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
index d8877fb..576ef9f 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
@@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u
 
 Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
 
-1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
 2. Lift from the winning row.
-3. Absolute lift: `baseline_value × lift`. Examples:
+3. Absolute lift: `baseline × lift`. Examples:
    - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
    - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
 
-### Fallback when `value` / `sampleSize` are null
+### Fallback when the baseline value or sample size is missing
 
-Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
 Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
@@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 ## Variance-reduction & outlier settings that change interpretation
 
-- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
@@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
-If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
 
 ---
 
@@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check `settings.testingModel`:
+Check the experiment's testing model:
 
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
index 59ad25e..7282bb4 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md
@@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per
 
 Replay analysis is qualitative. Be honest about that.
 
-- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
 Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
index 7cc432a..dbda2af 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin
 
 Also check:
 
-- `lift is None` on the primary → no measurement, not "no effect."
-- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
-- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
 
 ---
 
@@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually
 
 ### 1. Not enough sample yet (not enough exposures)
 
-**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
 
 - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
@@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 2. Observed effect is smaller than the MDE
 
-**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
 
 - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
 - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
@@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 3. Variance is too high (metric is too noisy)
 
-**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
 
-- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
 - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
 - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
 - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
@@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 4. Traffic split is starving the variant
 
-**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
@@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 ### 5. Exposure config is filtering more users than the user expects
 
-**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
 - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
-- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
index c205f29..18b15f7 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
 license: Apache-2.0
 ---
 
@@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 
 A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
 
-The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**.
+The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
 ## Data-source fallback
 
@@ -74,6 +74,8 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
 Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
index a0658e2..1edc9fa 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai
 
 ## 1. SRM (Sample Ratio Mismatch)
 
-**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself.
+**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself.
 
 ### What it means
 
-Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
+Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness.
 
 ### Likely causes, ordered most → least likely
 
@@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured
 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees.
 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window.
 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation.
-4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment.
+4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment.
 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period.
 
 ### Recommended actions
 
 - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable.
 - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric.
-- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
+- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs.
 - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split.
 
 ### Investigation checklist
 
-1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented?
+1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented?
 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history.
 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math.
-4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly.
-5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it.
+4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly.
+5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter.
 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting.
 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.**
 
@@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured
 
 ## 2. Retro A/A (pre-experiment bias) failure
 
-**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled.
+**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings.
 
 ### What it means
 
@@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that
 
 ## 3. Insufficient exposures
 
-**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
+**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue.
 
 ### Investigation checklist
 
-1. Check `live_exposures` totals — which variant is undersampled?
+1. Check per-variant exposure totals — which variant is undersampled?
 2. Inspect feature-flag rollout — was rollout dialed back?
 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?).
-4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`.
+4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target.
 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math.
 
 If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question.
@@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve
 
 ## 4. Frequentist peeking
 
-**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`).
+**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured).
 
 ### What it means
 
@@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ### Investigation checklist
 
-1. Confirm `settings.testingModel == "frequentist"`.
-2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`).
+1. Confirm the testing model is frequentist (sequential tests don't have this problem).
+2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with).
 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run.
-4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty.
+4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty.
 
 (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.)
 
@@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl
 
 ## 5. Live computation timeout / broken data
 
-**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null.
+**What the platform tells you**: a non-null error block on the live results, with the live data path empty.
 
 ### Investigation checklist
 
 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy.
 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget.
 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision.
-4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation.
+4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation.
 
 ---
 
 ## 6. Experiment ran < 3 days
 
-**Verdict to compute (this one is local)**: `end_date - start_date`.
+**What to compute (this one is local)**: the elapsed time between the experiment's start and end.
 
 Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly:
 
 > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_
 
-If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
+If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window.
 
 ---
 
@@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants.
+**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
 
-**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate).
-
-**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze.
+Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
 
 ### Extreme winsorization percentile
 
-**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99).
-
-**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration.
+**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
 
-**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason.
+Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`.
+**When**: the experiment's SRM check is off.
 
-**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug.
-
-**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing.
+**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only".
-
-**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen.
+**When**: CUPED is enabled AND the experiment cohort is "new users only".
 
-**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**Condition**: `settings.confidenceLevel != 0.95`.
+**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
 
-**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative.
-
-**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate.
+`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**Condition**: `metrics[]` contains entries with `name == ""`.
-
-**Interpretation**: likely a broken or placeholder metric reference.
+**When**: the experiment includes metric entries with empty names.
 
-**Action**: flag and skip during analysis.
+Likely a broken or placeholder metric reference. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`.
-
-**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."**
+**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
 
-**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
index 4d8189d..3a9e24c 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md
@@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co
 
 ## After concluding
 
-The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
+The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
index d8877fb..576ef9f 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of `
 
 ## Reading the p-value in this platform
 
-Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
+Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative).
 
 The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread.
 
@@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u
 
 Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful:
 
-1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`).
+1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row).
 2. Lift from the winning row.
-3. Absolute lift: `baseline_value × lift`. Examples:
+3. Absolute lift: `baseline × lift`. Examples:
    - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate.
    - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`.
 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week."
 
-### Fallback when `value` / `sampleSize` are null
+### Fallback when the baseline value or sample size is missing
 
-Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
+Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.**
 
 Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation:
 
@@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 ## Variance-reduction & outlier settings that change interpretation
 
-- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
-- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
+- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix).
+- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
@@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
-If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
+If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled.
 
 ---
 
@@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check `settings.testingModel`:
+Check the experiment's testing model:
 
 - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
 - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
index 59ad25e..7282bb4 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md
@@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per
 
 Replay analysis is qualitative. Be honest about that.
 
-- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_
+- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_
 - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict.
 
 Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
index 7cc432a..dbda2af 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin
 
 Also check:
 
-- `lift is None` on the primary → no measurement, not "no effect."
-- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement."
-- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions.
+- The primary's lift is missing or null → no measurement, not "no effect."
+- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect."
+- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions.
 
 ---
 
@@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually
 
 ### 1. Not enough sample yet (not enough exposures)
 
-**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`.
+**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using.
 
 - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**.
 - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe.
@@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 2. Observed effect is smaller than the MDE
 
-**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math).
+**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math).
 
 - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1.
 - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options:
@@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 3. Variance is too high (metric is too noisy)
 
-**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`.
+**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled.
 
-- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run.
+- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run.
 - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume.
 - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample.
 - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%.
@@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu
 
 ### 4. Traffic split is starving the variant
 
-**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant.
+**What to check**: the configured traffic split against the actual per-variant exposure counts.
 
 - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue.
 - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later.
@@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM
 
 ### 5. Exposure config is filtering more users than the user expects
 
-**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`.
+**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded.
 
-- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed.
+- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed.
 - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event.
-- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller).
+- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller).
 
 **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md).
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
 
 ---
 

From 55bc4ba2e95d1ddd8b16d747f994a29e3c0882f3 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 18:58:58 +0000
Subject: [PATCH 08/11] interpret-experiment: phase-3 editorial cleanup
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Final pass from the hardcore /review-skill audit.

- per-metric-interpretation.md: collapse "Significance = NO does NOT mean
  'no effect'" (was a 14-line duplicate of why-no-statsig.md's options list)
  into 3 lines with a back-reference. Same for "Frequentist vs Sequential —
  what affects per-metric reading" (was 8 lines duplicating
  health-check-interpretation.md §4) → 4 lines with a back-reference.
- health-check-interpretation.md §7 Misconfigurations: drop the
  When/Interpretation/Action triple-label scaffolding. Each of the 7 sub-
  sections is now a single bold "condition" sentence opening a single
  paragraph of consequence + action. Same information, ~25 lines lighter.

Skill total: 988 → 950 lines (-38). health-check-interpretation.md:
206 → 178 (-14%). per-metric-interpretation.md: 181 → 169.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
---
 .../references/health-check-interpretation.md | 28 +++++--------------
 .../references/per-metric-interpretation.md   | 22 ++++-----------
 .../references/health-check-interpretation.md | 28 +++++--------------
 .../references/per-metric-interpretation.md   | 22 ++++-----------
 .../references/health-check-interpretation.md | 28 +++++--------------
 .../references/per-metric-interpretation.md   | 22 ++++-----------
 6 files changed, 36 insertions(+), 114 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
index 1edc9fa..8875ca2 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
-
-Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
 
 ### Extreme winsorization percentile
 
-**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
-
-Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**When**: the experiment's SRM check is off.
-
-**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**When**: CUPED is enabled AND the experiment cohort is "new users only".
-
-CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
-
-`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**When**: the experiment includes metric entries with empty names.
-
-Likely a broken or placeholder metric reference. Flag and skip during analysis.
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
-
-No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
index 576ef9f..7907e90 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr
 
 ---
 
-## "Significance = NO" does NOT mean "no effect"
+## When a primary metric is inconclusive
 
-A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
 
-Options to suggest when a primary metric lands in `summary.no`:
-
-1. **Extend duration** (if the experiment is still ACTIVE).
-2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
-3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
-4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
-5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
-6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
-
-For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
 
 ---
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check the experiment's testing model:
-
-- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
-- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
 
-Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
index 1edc9fa..8875ca2 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
-
-Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
 
 ### Extreme winsorization percentile
 
-**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
-
-Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**When**: the experiment's SRM check is off.
-
-**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**When**: CUPED is enabled AND the experiment cohort is "new users only".
-
-CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
-
-`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**When**: the experiment includes metric entries with empty names.
-
-Likely a broken or placeholder metric reference. Flag and skip during analysis.
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
-
-No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
index 576ef9f..7907e90 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr
 
 ---
 
-## "Significance = NO" does NOT mean "no effect"
+## When a primary metric is inconclusive
 
-A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
 
-Options to suggest when a primary metric lands in `summary.no`:
-
-1. **Extend duration** (if the experiment is still ACTIVE).
-2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
-3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
-4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
-5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
-6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
-
-For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
 
 ---
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check the experiment's testing model:
-
-- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
-- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
 
-Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
index 1edc9fa..8875ca2 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa
 
 ### Multiple-testing correction off with several primaries
 
-**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants.
-
-Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze.
+**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing.
 
 ### Extreme winsorization percentile
 
-**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95).
-
-Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason.
+**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason.
 
 ### SRM check disabled
 
-**When**: the experiment's SRM check is off.
-
-**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing.
+**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing.
 
 ### CUPED on new-users-only cohort
 
-**When**: CUPED is enabled AND the experiment cohort is "new users only".
-
-CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply.
+**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply.
 
 ### Non-default confidence level
 
-**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95).
-
-`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate.
+**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate.
 
 ### Broken or placeholder metric entries
 
-**When**: the experiment includes metric entries with empty names.
-
-Likely a broken or placeholder metric reference. Flag and skip during analysis.
+**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis.
 
 ### Primary metric with no computed result
 
-**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached).
-
-No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary.
+**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
index 576ef9f..7907e90 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr
 
 ---
 
-## "Significance = NO" does NOT mean "no effect"
+## When a primary metric is inconclusive
 
-A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.**
+A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result.
 
-Options to suggest when a primary metric lands in `summary.no`:
-
-1. **Extend duration** (if the experiment is still ACTIVE).
-2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM).
-3. **Use Sequential testing model** for the next experiment if continuous monitoring fits.
-4. **Enable CUPED** if the metric correlates with pre-exposure behavior.
-5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment.
-6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding.
-
-For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md).
+For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md).
 
 ---
 
 ## Frequentist vs Sequential — what affects per-metric reading
 
-Check the experiment's testing model:
-
-- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration.
-- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended.
+Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem.
 
-Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict.
+For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md).
 
 ---
 

From 5de808df788bc90352d0c066d636efcc1654b1c9 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 19:35:53 +0000
Subject: [PATCH 09/11] interpret-experiment: phase-4 polish
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Final micro-pass from the third hardcore /review-skill audit. Four surgical
edits; ~6 lines of net change.

- SKILL.md polarity recipe: drop the lingering `settings.controlKey`
  reference ("use settings.controlKey" → "the platform marks which variant
  is control"). Same fix in per-metric-interpretation.md's tier table for
  the surviving `multipleTestingCorrection` reference.
- why-no-statsig.md output shape: drop the "which fields told you" phrasing,
  which undid phase-2 right at the moment of summary. The example numbers
  stay; the field-citation framing goes.
- SKILL.md step 1: add one sentence to the disambiguation guard naming the
  identifier-matching convention (ID first, then case-insensitive name).
- health-check-interpretation.md and per-metric-interpretation.md: drop the
  duplicate "Never recompute thresholds" preamble paragraph — the rule lives
  in SKILL.md and is loaded with the spine. The references no longer need to
  restate it.

`grep -rE 'live_|results_cache|exposures_cache|settings\.<…>|multipleTestingCorrection'
plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
---
 plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md  | 4 ++--
 .../references/health-check-interpretation.md                 | 2 --
 .../references/per-metric-interpretation.md                   | 4 +---
 .../skills/interpret-experiment/references/why-no-statsig.md  | 2 +-
 plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md  | 4 ++--
 .../references/health-check-interpretation.md                 | 2 --
 .../references/per-metric-interpretation.md                   | 4 +---
 .../skills/interpret-experiment/references/why-no-statsig.md  | 2 +-
 plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md     | 4 ++--
 .../references/health-check-interpretation.md                 | 2 --
 .../references/per-metric-interpretation.md                   | 4 +---
 .../skills/interpret-experiment/references/why-no-statsig.md  | 2 +-
 12 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
index 18b15f7..396114c 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 - `direction == "up"` → **positive** if `lift > 0`, else **negative**.
 - `direction == "down"` → **positive** if `lift < 0`, else **negative**.
 
-A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
 
 The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
@@ -74,7 +74,7 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
-If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
 
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
index 8875ca2..1467468 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md
@@ -2,8 +2,6 @@
 
 Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
-
 ---
 
 ## Kohavi framing — always cite when a health check fails
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
index 7907e90..e46381c 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -2,8 +2,6 @@
 
 Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
-
 ---
 
 ## The mental model
@@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 | Tier          | How it influences the verdict                                                                                                                                                                                 |
 | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
index dbda2af..6b3d932 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
 3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
index 18b15f7..396114c 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 - `direction == "up"` → **positive** if `lift > 0`, else **negative**.
 - `direction == "down"` → **positive** if `lift < 0`, else **negative**.
 
-A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
 
 The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
@@ -74,7 +74,7 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
-If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
 
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
index 8875ca2..1467468 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md
@@ -2,8 +2,6 @@
 
 Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
-
 ---
 
 ## Kohavi framing — always cite when a health check fails
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
index 7907e90..e46381c 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -2,8 +2,6 @@
 
 Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
-
 ---
 
 ## The mental model
@@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 | Tier          | How it influences the verdict                                                                                                                                                                                 |
 | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
index dbda2af..6b3d932 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
 3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
index 18b15f7..396114c 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"):
 - `direction == "up"` → **positive** if `lift > 0`, else **negative**.
 - `direction == "down"` → **positive** if `lift < 0`, else **negative**.
 
-A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`).
+A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control.
 
 The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**.
 
@@ -74,7 +74,7 @@ Top-down: what to do, in order.
 
 ## 1. Fetch the experiment
 
-If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs.
+If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match.
 
 Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments.
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
index 8875ca2..1467468 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md
@@ -2,8 +2,6 @@
 
 Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.
 
-**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers.
-
 ---
 
 ## Kohavi framing — always cite when a health check fails
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
index 7907e90..e46381c 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md
@@ -2,8 +2,6 @@
 
 Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_
 
-**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate.
-
 ---
 
 ## The mental model
@@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd
 
 | Tier          | How it influences the verdict                                                                                                                                                                                 |
 | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Primary**   | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants).                                              |
+| **Primary**   | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants).                                                    |
 | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude.                                                                                                                                          |
 | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. |
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
index dbda2af..6b3d932 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex
 ## Output shape
 
 1. **The reason** (one of the five above), in one sentence.
-2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.).
+2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%").
 3. **Recommendation** from the table above, with the specific experiment update or follow-up action.
 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment.

From 6ad6fe72921f4e7001cc707528faf5f4b67b1614 Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 20:49:15 +0000
Subject: [PATCH 10/11] =?UTF-8?q?interpret-experiment:=20rename=20cross-re?=
 =?UTF-8?q?fs=20experiment-setup=20=E2=86=92=20design-experiment?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The setup-side skill was renamed from `experiment-setup` to
`design-experiment` on its own PR (mixpanel/ai-plugins#24) to follow the
verb-noun convention. Update this skill's cross-references to match.

Sites updated:
- SKILL.md description's negative-trigger sentence
- references/why-no-statsig.md (two mentions)

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
---
 plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md  | 2 +-
 .../skills/interpret-experiment/references/why-no-statsig.md  | 4 ++--
 plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md  | 2 +-
 .../skills/interpret-experiment/references/why-no-statsig.md  | 4 ++--
 plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md     | 2 +-
 .../skills/interpret-experiment/references/why-no-statsig.md  | 4 ++--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
index 396114c..c370fc0 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
 license: Apache-2.0
 ---
 
diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
index 6b3d932..37ec069 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md
@@ -2,7 +2,7 @@
 
 Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
 
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
index 396114c..c370fc0 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
 license: Apache-2.0
 ---
 
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
index 6b3d932..37ec069 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md
@@ -2,7 +2,7 @@
 
 Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
 
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
index 396114c..c370fc0 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: interpret-experiment
-description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill.
+description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill.
 license: Apache-2.0
 ---
 
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
index 6b3d932..37ec069 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md
@@ -2,7 +2,7 @@
 
 Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts.
 
-The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
+The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one.
 
 ---
 
@@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself.
 | Exposure config is filtering           | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample.               |
 | Experiment finished, well-powered      | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters.       |
 
-When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math.
+When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math.
 
 ---
 

From 67dcb35d2d804b80904fa36d393b30cf15429e7a Mon Sep 17 00:00:00 2001
From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Date: Tue, 9 Jun 2026 21:53:00 +0000
Subject: [PATCH 11/11] Move platform-support disclaimer below the content in
 segment-breakdown-interpretation

The disclaimer about per-segment platform support was the second
paragraph, separating the file's purpose from its content with five
lines of caveats. Moved to a "Platform support status" section at the
end of the file so the reader hits the mental model immediately.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../references/segment-breakdown-interpretation.md        | 8 ++++++--
 .../references/segment-breakdown-interpretation.md        | 8 ++++++--
 .../references/segment-breakdown-interpretation.md        | 8 ++++++--
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index f5623e1..98c7bbc 100644
--- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -2,8 +2,6 @@
 
 Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
-
 ---
 
 ## The mental model
@@ -93,3 +91,9 @@ This is the everyday case of mixed effects.
 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index f5623e1..98c7bbc 100644
--- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -2,8 +2,6 @@
 
 Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
-
 ---
 
 ## The mental model
@@ -93,3 +91,9 @@ This is the everyday case of mixed effects.
 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
index f5623e1..98c7bbc 100644
--- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
+++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md
@@ -2,8 +2,6 @@
 
 Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place.
 
-> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.
-
 ---
 
 ## The mental model
@@ -93,3 +91,9 @@ This is the everyday case of mixed effects.
 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered).
 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's."
 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating).
+
+---
+
+## Platform support status
+
+Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.