From a6ec43f10d27e2db858a12eec7d8f5b70dad19ab Mon Sep 17 00:00:00 2001 From: "mixpanel-claude-code-agent[bot]" <237517943+mixpanel-claude-code-agent[bot]@users.noreply.github.com> Date: Thu, 4 Jun 2026 22:32:41 +0000 Subject: [PATCH 01/11] Add experiment-results skill Authors a single home for all results- and health-phase expertise: the agent loads this skill and reads the verdicts that Get-Experiment returns, rather than recomputing thresholds. Replaces the interpretation portion of several superseded per-capability tools. Skill is structured for progressive disclosure: the spine (5-step decision tree, polarity recipe, ship/iterate/kill/wait verdict) lives in SKILL.md, and deep-dive references cover health-check causes, per-metric phrasing, why-no-statsig, segment-of-interest selection, segment-breakdown reading, session-replay analysis, and the Get-Experiment field map. Eval fixtures seeded from PRD customer quotes (Pelando "+2 others", Confetti "8 metrics for new visitors", Polarsteps "no documented workaround"). Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills. Linear: https://linear.app/mixpanel/issue/MULTI-582 Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 15 +- .../skills/experiment-results/SKILL.md | 236 ++++++++++++++++++ .../skills/experiment-results/evals/README.md | 34 +++ .../evals/confetti-8-metrics.yaml | 48 ++++ .../evals/pelando-plus-2-others.yaml | 79 ++++++ .../evals/polarsteps-no-workaround.yaml | 61 +++++ .../references/get-experiment-fields.md | 161 ++++++++++++ .../references/health-check-interpretation.md | 158 ++++++++++++ .../references/per-metric-interpretation.md | 188 ++++++++++++++ .../segment-breakdown-interpretation.md | 95 +++++++ .../segment-of-interest-selection.md | 116 +++++++++ .../references/session-replay-analysis.md | 109 ++++++++ .../references/why-no-statsig.md | 115 +++++++++ .../skills/experiment-results/SKILL.md | 236 ++++++++++++++++++ .../skills/experiment-results/evals/README.md | 34 +++ .../evals/confetti-8-metrics.yaml | 48 ++++ .../evals/pelando-plus-2-others.yaml | 79 ++++++ .../evals/polarsteps-no-workaround.yaml | 61 +++++ .../references/get-experiment-fields.md | 161 ++++++++++++ .../references/health-check-interpretation.md | 158 ++++++++++++ .../references/per-metric-interpretation.md | 188 ++++++++++++++ .../segment-breakdown-interpretation.md | 95 +++++++ .../segment-of-interest-selection.md | 116 +++++++++ .../references/session-replay-analysis.md | 109 ++++++++ .../references/why-no-statsig.md | 115 +++++++++ .../skills/experiment-results/SKILL.md | 236 ++++++++++++++++++ .../skills/experiment-results/evals/README.md | 34 +++ .../evals/confetti-8-metrics.yaml | 48 ++++ .../evals/pelando-plus-2-others.yaml | 79 ++++++ .../evals/polarsteps-no-workaround.yaml | 61 +++++ .../references/get-experiment-fields.md | 161 ++++++++++++ .../references/health-check-interpretation.md | 158 ++++++++++++ .../references/per-metric-interpretation.md | 188 ++++++++++++++ .../segment-breakdown-interpretation.md | 95 +++++++ .../segment-of-interest-selection.md | 116 +++++++++ .../references/session-replay-analysis.md | 109 ++++++++ .../references/why-no-statsig.md | 115 +++++++++ 37 files changed, 4209 insertions(+), 6 deletions(-) create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/SKILL.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/README.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md diff --git a/README.md b/README.md index c79cb95..3f843f2 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,12 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http ## Skills -| Skill | Description | -|---|---| -| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | -| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | -| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| Skill | Description | +| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | +| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | +| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| [`experiment-results`](plugins/mixpanel-mcp/skills/experiment-results/) | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. | ## Getting Started @@ -23,21 +24,23 @@ claude plugin marketplace add mixpanel/ai-plugins 2. Install the plugin for your region: **US** + ```bash claude plugin install mixpanel-mcp ``` **EU** + ```bash claude plugin install mixpanel-mcp-eu ``` **India** + ```bash claude plugin install mixpanel-mcp-in ``` - ### Cursor Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md new file mode 100644 index 0000000..4e344d3 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md @@ -0,0 +1,236 @@ +--- +name: experiment-results +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +license: Apache-2.0 +--- + +# Experiment Results Interpretation + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. + +## Requirements + +- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). +- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. + +## When to use this skill + +Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: + +- "What do these results mean?" / "Should we ship this?" +- "Is this experiment trustworthy?" / "Why is SRM failing?" +- "Why hasn't this hit statistical significance yet?" +- "Break this down by ``" / "What segments should I look at?" +- "What does this Retro A/A failure mean?" +- "Can you compare the session replays for control vs treatment?" + +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. + +--- + +## How to read `Get-Experiment` output + +Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** + +| Concept | Live (preferred) | Cached fallback | +| ---------------------------- | --------------------------------- | ------------------------------------------- | +| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | +| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | +| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | +| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | +| When was this computed? | "now" | `exposures_cache.$last_computed` | + +If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. + +If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." + +See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. + +--- + +## The Decision Tree + +This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. + +``` +┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ +│ SRM ok? → exposures sufficient? → │ +│ Retro A/A clean? → minimum duration met? → │ +│ no misconfig? │ +│ │ │ +│ fail → STOP. See references/ │ +│ health-check-interpretation.md │ +└──────────────┬───────────────────────────────┘ + ↓ pass +┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ +│ For each non-control variant × primary, │ +│ apply the polarity recipe (sign-of-lift + │ +│ metric.direction). Significant + correct │ +│ polarity = "win"; significant + wrong │ +│ polarity = "loss". │ +│ │ │ +│ nothing significant on primaries → │ +│ see references/why-no-statsig.md │ +└──────────────┬───────────────────────────────┘ + ↓ at least one primary win +┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ +│ Any guardrail significant in the wrong │ +│ polarity? → regression → ITERATE not ship │ +└──────────────┬───────────────────────────────┘ + ↓ guardrails clean +┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ +│ Convert the lift on the primary into │ +│ absolute terms. Is it big enough to │ +│ matter to the business? │ +│ Statistically significant ≠ ships. │ +└──────────────┬───────────────────────────────┘ + ↓ meaningful magnitude +┌─ Step 5: VERDICT ────────────────────────────┐ +│ Trust ✓ + primary win + guardrails ✓ + │ +│ meaningful magnitude → SHIP │ +│ Trust ✓ + primary win + guardrail regress │ +│ → ITERATE │ +│ Trust ✓ + primary neutral after target │ +│ → KILL or ITERATE │ +│ Trust ✗ │ +│ → DO NOT DECIDE; report failures │ +│ Hasn't reached target sample/duration │ +│ → WAIT (or extend, or restart with more │ +│ power — see why-no-statsig.md) │ +└──────────────────────────────────────────────┘ +``` + +### Step 1 — Trustworthiness gate (consume the verdicts) + +Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. + +| Check | Field to read | What "fail" looks like | +| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | +| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | +| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | +| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | +| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | +| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | + +If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). + +### Step 2 — Statistical significance with polarity + +**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. + +#### Polarity recipe + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). + +- `lift is None` or `lift == 0` → **neutral**. +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. + +#### How to read the summary + +1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. +2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. +3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. +4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** + +Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). + +### Step 3 — Guardrail check + +Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). + +- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. +- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. +- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. + +### Step 4 — Practical significance + +Statistical significance ≠ business impact. For every primary metric that won: + +1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. +2. Read the **lift** from the winning variant's row. +3. Compute absolute lift: `baseline_value × lift`. +4. Project to population per period: ask the user for traffic estimates if not in context. + +A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. + +**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. + +### Step 5 — Verdict + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. + +`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. + +Special variant constants when `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant +- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) + +For a kill, pass `success=false`. + +--- + +## Going deeper + +Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: + +| User asks about… | Open | +| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | + +--- + +## Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. +5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. + +If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. + +--- + +## Common pitfalls (cheat sheet) + +- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) +- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned +- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` +- ⛔ Trusting a >30% lift without checking whether the **denominator changed** +- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) +- ⛔ Treating a `null` lift as "no effect" — it means computation failed +- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" +- ⛔ Interpreting a `< 3 day` experiment instead of refusing +- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) +- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) +- ⛔ Conflating **statistical significance** with **practical significance** +- ⛔ Ignoring **guardrail regressions** because the primary won +- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction +- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md new file mode 100644 index 0000000..71278d6 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md @@ -0,0 +1,34 @@ +# Eval fixtures — `experiment-results` + +Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. + +The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: + +1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. +2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. + +When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. + +--- + +## Fixtures + +| Fixture | PRD source quote | What it exercises | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | +| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | +| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | +| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | + +Each YAML doc has the same shape: + +```yaml +name: +prd_source: +trigger_phrase: +get_experiment_summary: +expected_behavior: + verdict: + must_mention: [] + must_not_do: [] + references_consulted: [] +``` diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml new file mode 100644 index 0000000..da61d9e --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml @@ -0,0 +1,48 @@ +name: confetti-8-metrics +prd_source: | + Confetti — "8 metrics for new visitors" + Customer is running an experiment with 8 primary-ish metrics and explicitly + cares about new-visitor behavior. They want a segment-driven read, not a + dump of 8 lifts. The skill should pre-commit to segments tied to the + hypothesis (new vs returning), call out the multiple-testing concern with + 8 metrics, and produce a verdict scoped to the segment that matters. + +trigger_phrase: | + We're tracking 8 metrics on this onboarding redesign experiment and I really + care about how new visitors respond. Can you read this and tell me whether + it's a ship for the new-user audience? + +get_experiment_summary: + hypothesis: | + If we redesign the first-session onboarding flow, then activation rate + among NEW visitors will increase by ≥5% relative, because reducing + cold-start friction shortens time-to-first-value. + settings: + controlKey: "control" + multipleTestingCorrection: "off" # mis-configured given 8 primaries + testingModel: "sequential" + confidenceLevel: 0.95 + metrics_count: 8 + primary_metrics_summary: | + Of 8 primaries: 2 significant positive (polarity-correct), 1 significant + negative (a "Time to First Action" metric with direction=down where + lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. + +expected_behavior: + verdict: WAIT + must_mention: + - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" + - "Recommend at most 3–5 segments and call new vs returning the primary slice" + - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" + - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" + - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" + - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" + must_not_do: + - "Slice by every available property after the fact (the fishing-expedition warning)" + - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" + - "Call the experiment a ship because 2 of 8 primaries are significant positive" + - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" + references_consulted: + - segment-of-interest-selection.md + - per-metric-interpretation.md + - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml new file mode 100644 index 0000000..f634236 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml @@ -0,0 +1,79 @@ +name: pelando-plus-2-others +prd_source: | + Pelando — "+2 others" + Customer reported that when a multi-variant test concludes with a winner banner + plus a small-print "+2 others", they cannot tell which non-winner variants are + benign vs which contain a guardrail regression they need to act on. The skill + should pivot the summary per variant, polarity-correct each, and call out the + losers, not gloss over them. + +trigger_phrase: | + Can you make sense of this experiment for me? The UI shows treatment_a winning + on the primary plus "+2 others" but I have no idea whether treatment_b or + treatment_c are okay to ignore. + +get_experiment_summary: + settings: + controlKey: "control" + multipleTestingCorrection: "benjamini-hochberg" + testingModel: "sequential" + metrics: + - id: m_primary + type: primary + direction: up + name: "Activation Rate" + - id: m_guardrail_latency + type: guardrail + direction: down + name: "p95 Latency (ms)" + - id: m_guardrail_errors + type: guardrail + direction: down + name: "Error Rate" + live_exposures: + control: 41123 + treatment_a: 40987 + treatment_b: 41210 + treatment_c: 40755 + live_srm_analysis: + # platform-flagged passing + p_value: 0.42 + summary: + positive: + - { + metricId: m_primary, + variant: treatment_a, + lift: 0.041, + liftConfidence: 0.95, + } + - { + metricId: m_guardrail_latency, + variant: treatment_b, + lift: 0.08, + liftConfidence: 0.95, + } + negative: + - { + metricId: m_primary, + variant: treatment_c, + lift: -0.022, + liftConfidence: 0.95, + } + no: + - { metricId: m_primary, variant: treatment_b, lift: 0.004 } + +expected_behavior: + verdict: ITERATE + must_mention: + - "Pivot the summary by variant before declaring a winner" + - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" + - "treatment_c regresses the primary" + - "Multi-variant verdict requires each treatment to be judged independently against control" + - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" + must_not_do: + - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" + - "Trust the bucket name (positive/negative) as the business verdict" + - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" + references_consulted: + - per-metric-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml new file mode 100644 index 0000000..325a3bf --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml @@ -0,0 +1,61 @@ +name: polarsteps-no-workaround +prd_source: | + Polarsteps — "no documented workaround" + Customer's experiment is failing SRM and they cannot find a documented path + forward. The skill should consume the platform's SRM verdict (not recompute + chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and + surface ordered likely causes plus a specific recommended action — not + punt with "investigate further." + +trigger_phrase: | + My experiment is failing SRM and the result lift looks too good to be true + (+18% on the primary). The docs just say "investigate" — what does that + actually mean here? Should I trust the lift? + +get_experiment_summary: + settings: + controlKey: "control" + srm: + enabled: true + targetAllocations: { control: 50, treatment: 50 } + excludeQA: false # potentially relevant + live_exposures: + control: 18250 + treatment: 22980 + live_srm_analysis: + # platform-flagged FAILING + p_value: 0.00002 + chi_square: 18.4 + summary: + positive: + - { + metricId: m_primary, + variant: treatment, + lift: 0.18, + liftConfidence: 0.95, + } + metrics: + - id: m_primary + type: primary + direction: up + name: "Trip Plan Created" + +expected_behavior: + verdict: DO_NOT_DECIDE + must_mention: + - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" + - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" + - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" + - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" + - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" + - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" + - "Do NOT recompute the SRM chi-square — consume the platform's verdict" + - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" + must_not_do: + - "Calculate the chi-square or re-derive an SRM p-value threshold" + - "Recommend shipping or treating the +18% lift as real" + - "Hand the user a generic 'investigate further' without ordered causes and an action" + - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" + references_consulted: + - health-check-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md new file mode 100644 index 0000000..efaeae5 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md @@ -0,0 +1,161 @@ +# `Get-Experiment` Field Map + +Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. + +This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. + +--- + +## Identity & lifecycle + +``` +id, name, description, hypothesis, status, start_date, end_date +creator_email, tags, url, workspace_id +feature_flag_id → for feature-flag-based experiments +settings.controlKey → variant key treated as control (often "control"; may be "") +``` + +`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). + +--- + +## Trustworthiness + +``` +live_srm_analysis → SRM verdict (consume — don't recompute) + .p_value + .chi_square +live_exposures[] → per-variant exposure counts (live) +exposures_cache[] → per-variant exposure counts (cached fallback) +exposures_cache.$srm_analysis → cached SRM analysis +exposures_cache.$last_computed → when the cache was last refreshed +settings.srm.enabled → whether the SRM check ran +settings.srm.targetAllocations → expected per-variant allocation (percent) +settings.preExperimentBias → whether Retro A/A was enabled +settings.excludeQA → whether QA traffic was filtered +live_results_errors → non-null = live computation failed; surface and fall back to cache +``` + +--- + +## Per-metric per-variant results + +``` +live_metrics[][] + .value → metric value for this variant + .sampleSize → sample size for this variant on this metric + .lift → (treatment - control) / control (0 for control row) + .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width + .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) + +results_cache.metrics[][] → cached fallback, same shape +``` + +--- + +## Bucketed summary + +``` +results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) +results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) +results_cache.summary.no[] → items with significance == "NO" + +Each item: + .metricId + .variant + .value + .lift + .liftConfidence + .sampleSize + .significance +``` + +**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. + +--- + +## Metric catalog (for polarity lookups) + +``` +metrics[] + .id, .name + .type ("primary" | "guardrail" | "secondary") + .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset +``` + +Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. + +--- + +## Settings that change interpretation + +``` +settings.confidenceLevel → significance threshold (e.g. 0.95) +settings.testingModel → "frequentist" or "sequential" +settings.endCondition → "sample_size" or "days" +settings.sampleSize / .endAfterDays → planned end target +settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" +settings.cuped.enabled → CUPED variance reduction applied +settings.cuped.preExposureDatePreset → pre-exposure window +settings.winsorization.enabled → outlier capping applied +settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) +``` + +--- + +## Decision metadata (post-decide) + +``` +results_cache.message → decision rationale +results_cache.variant → shipped variant key (or special constant) +status → "concluded" | "success" | "fail" +``` + +Special variant constants for `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant. +- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). + +For a kill, pass `success=false`. + +--- + +## Lifecycle hand-off + +``` +Update-Experiment( + experiment_id=, + experiment={ + "action": "decide", + "success": true | false, + "variant": "", # required when success=true + "message": "" + } +) +``` + +`message` is required on every `decide` call. + +--- + +## Misconfig field map (cross-link) + +For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. + +- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants +- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) +- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" +- `settings.confidenceLevel != 0.95` +- `metrics[]` entries with `name == ""` +- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` + +--- + +## When to reach for sibling tools + +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. +- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. +- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md new file mode 100644 index 0000000..4471219 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md @@ -0,0 +1,158 @@ +# Health-Check Interpretation + +Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check `live_exposures` totals — which variant is undersampled? +2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? +3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm `settings.testingModel == "frequentist"`. +2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. + +### Investigation checklist + +1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**Verdict to compute (this one is local)**: `end_date - start_date`. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations to flag during Step 1 + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. + +- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). +- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. +- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. +- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. +- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. +- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md new file mode 100644 index 0000000..3b44385 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md @@ -0,0 +1,188 @@ +# Per-Metric Interpretation + +Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe (repeat from the spine — critical) + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). + +- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). + +--- + +## Reading the p-value correctly + +The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: + +- ❌ The probability that the treatment works. +- ❌ The probability the result will replicate. +- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. +- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). + +Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +2. Lift from the winning row. +3. Absolute lift: `baseline_value × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when `value` / `sampleSize` are null + +Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## "Significance = NO" does NOT mean "no effect" + +A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** + +Options to suggest when a primary metric lands in `summary.no`: + +1. **Extend duration** (if the experiment is still ACTIVE). +2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). +3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. +4. **Enable CUPED** if the metric correlates with pre-exposure behavior. +5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. +6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. + +For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Check `settings.testingModel`: + +- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. +- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. + +Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..6877d2a --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -0,0 +1,95 @@ +# Segment-Breakdown Interpretation + +Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. + +- Segments below the floor → mark "insufficient sample, treat as directional only." +- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md new file mode 100644 index 0000000..ea9f22b --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md new file mode 100644 index 0000000..88640f4 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. + +> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md new file mode 100644 index 0000000..fdad2cd --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- `lift is None` on the primary → no measurement, not "no effect." +- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." +- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. + +- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md new file mode 100644 index 0000000..4e344d3 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md @@ -0,0 +1,236 @@ +--- +name: experiment-results +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +license: Apache-2.0 +--- + +# Experiment Results Interpretation + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. + +## Requirements + +- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). +- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. + +## When to use this skill + +Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: + +- "What do these results mean?" / "Should we ship this?" +- "Is this experiment trustworthy?" / "Why is SRM failing?" +- "Why hasn't this hit statistical significance yet?" +- "Break this down by ``" / "What segments should I look at?" +- "What does this Retro A/A failure mean?" +- "Can you compare the session replays for control vs treatment?" + +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. + +--- + +## How to read `Get-Experiment` output + +Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** + +| Concept | Live (preferred) | Cached fallback | +| ---------------------------- | --------------------------------- | ------------------------------------------- | +| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | +| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | +| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | +| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | +| When was this computed? | "now" | `exposures_cache.$last_computed` | + +If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. + +If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." + +See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. + +--- + +## The Decision Tree + +This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. + +``` +┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ +│ SRM ok? → exposures sufficient? → │ +│ Retro A/A clean? → minimum duration met? → │ +│ no misconfig? │ +│ │ │ +│ fail → STOP. See references/ │ +│ health-check-interpretation.md │ +└──────────────┬───────────────────────────────┘ + ↓ pass +┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ +│ For each non-control variant × primary, │ +│ apply the polarity recipe (sign-of-lift + │ +│ metric.direction). Significant + correct │ +│ polarity = "win"; significant + wrong │ +│ polarity = "loss". │ +│ │ │ +│ nothing significant on primaries → │ +│ see references/why-no-statsig.md │ +└──────────────┬───────────────────────────────┘ + ↓ at least one primary win +┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ +│ Any guardrail significant in the wrong │ +│ polarity? → regression → ITERATE not ship │ +└──────────────┬───────────────────────────────┘ + ↓ guardrails clean +┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ +│ Convert the lift on the primary into │ +│ absolute terms. Is it big enough to │ +│ matter to the business? │ +│ Statistically significant ≠ ships. │ +└──────────────┬───────────────────────────────┘ + ↓ meaningful magnitude +┌─ Step 5: VERDICT ────────────────────────────┐ +│ Trust ✓ + primary win + guardrails ✓ + │ +│ meaningful magnitude → SHIP │ +│ Trust ✓ + primary win + guardrail regress │ +│ → ITERATE │ +│ Trust ✓ + primary neutral after target │ +│ → KILL or ITERATE │ +│ Trust ✗ │ +│ → DO NOT DECIDE; report failures │ +│ Hasn't reached target sample/duration │ +│ → WAIT (or extend, or restart with more │ +│ power — see why-no-statsig.md) │ +└──────────────────────────────────────────────┘ +``` + +### Step 1 — Trustworthiness gate (consume the verdicts) + +Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. + +| Check | Field to read | What "fail" looks like | +| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | +| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | +| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | +| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | +| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | +| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | + +If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). + +### Step 2 — Statistical significance with polarity + +**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. + +#### Polarity recipe + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). + +- `lift is None` or `lift == 0` → **neutral**. +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. + +#### How to read the summary + +1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. +2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. +3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. +4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** + +Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). + +### Step 3 — Guardrail check + +Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). + +- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. +- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. +- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. + +### Step 4 — Practical significance + +Statistical significance ≠ business impact. For every primary metric that won: + +1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. +2. Read the **lift** from the winning variant's row. +3. Compute absolute lift: `baseline_value × lift`. +4. Project to population per period: ask the user for traffic estimates if not in context. + +A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. + +**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. + +### Step 5 — Verdict + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. + +`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. + +Special variant constants when `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant +- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) + +For a kill, pass `success=false`. + +--- + +## Going deeper + +Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: + +| User asks about… | Open | +| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | + +--- + +## Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. +5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. + +If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. + +--- + +## Common pitfalls (cheat sheet) + +- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) +- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned +- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` +- ⛔ Trusting a >30% lift without checking whether the **denominator changed** +- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) +- ⛔ Treating a `null` lift as "no effect" — it means computation failed +- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" +- ⛔ Interpreting a `< 3 day` experiment instead of refusing +- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) +- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) +- ⛔ Conflating **statistical significance** with **practical significance** +- ⛔ Ignoring **guardrail regressions** because the primary won +- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction +- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md new file mode 100644 index 0000000..71278d6 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md @@ -0,0 +1,34 @@ +# Eval fixtures — `experiment-results` + +Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. + +The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: + +1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. +2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. + +When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. + +--- + +## Fixtures + +| Fixture | PRD source quote | What it exercises | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | +| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | +| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | +| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | + +Each YAML doc has the same shape: + +```yaml +name: +prd_source: +trigger_phrase: +get_experiment_summary: +expected_behavior: + verdict: + must_mention: [] + must_not_do: [] + references_consulted: [] +``` diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml new file mode 100644 index 0000000..da61d9e --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml @@ -0,0 +1,48 @@ +name: confetti-8-metrics +prd_source: | + Confetti — "8 metrics for new visitors" + Customer is running an experiment with 8 primary-ish metrics and explicitly + cares about new-visitor behavior. They want a segment-driven read, not a + dump of 8 lifts. The skill should pre-commit to segments tied to the + hypothesis (new vs returning), call out the multiple-testing concern with + 8 metrics, and produce a verdict scoped to the segment that matters. + +trigger_phrase: | + We're tracking 8 metrics on this onboarding redesign experiment and I really + care about how new visitors respond. Can you read this and tell me whether + it's a ship for the new-user audience? + +get_experiment_summary: + hypothesis: | + If we redesign the first-session onboarding flow, then activation rate + among NEW visitors will increase by ≥5% relative, because reducing + cold-start friction shortens time-to-first-value. + settings: + controlKey: "control" + multipleTestingCorrection: "off" # mis-configured given 8 primaries + testingModel: "sequential" + confidenceLevel: 0.95 + metrics_count: 8 + primary_metrics_summary: | + Of 8 primaries: 2 significant positive (polarity-correct), 1 significant + negative (a "Time to First Action" metric with direction=down where + lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. + +expected_behavior: + verdict: WAIT + must_mention: + - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" + - "Recommend at most 3–5 segments and call new vs returning the primary slice" + - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" + - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" + - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" + - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" + must_not_do: + - "Slice by every available property after the fact (the fishing-expedition warning)" + - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" + - "Call the experiment a ship because 2 of 8 primaries are significant positive" + - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" + references_consulted: + - segment-of-interest-selection.md + - per-metric-interpretation.md + - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml new file mode 100644 index 0000000..f634236 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml @@ -0,0 +1,79 @@ +name: pelando-plus-2-others +prd_source: | + Pelando — "+2 others" + Customer reported that when a multi-variant test concludes with a winner banner + plus a small-print "+2 others", they cannot tell which non-winner variants are + benign vs which contain a guardrail regression they need to act on. The skill + should pivot the summary per variant, polarity-correct each, and call out the + losers, not gloss over them. + +trigger_phrase: | + Can you make sense of this experiment for me? The UI shows treatment_a winning + on the primary plus "+2 others" but I have no idea whether treatment_b or + treatment_c are okay to ignore. + +get_experiment_summary: + settings: + controlKey: "control" + multipleTestingCorrection: "benjamini-hochberg" + testingModel: "sequential" + metrics: + - id: m_primary + type: primary + direction: up + name: "Activation Rate" + - id: m_guardrail_latency + type: guardrail + direction: down + name: "p95 Latency (ms)" + - id: m_guardrail_errors + type: guardrail + direction: down + name: "Error Rate" + live_exposures: + control: 41123 + treatment_a: 40987 + treatment_b: 41210 + treatment_c: 40755 + live_srm_analysis: + # platform-flagged passing + p_value: 0.42 + summary: + positive: + - { + metricId: m_primary, + variant: treatment_a, + lift: 0.041, + liftConfidence: 0.95, + } + - { + metricId: m_guardrail_latency, + variant: treatment_b, + lift: 0.08, + liftConfidence: 0.95, + } + negative: + - { + metricId: m_primary, + variant: treatment_c, + lift: -0.022, + liftConfidence: 0.95, + } + no: + - { metricId: m_primary, variant: treatment_b, lift: 0.004 } + +expected_behavior: + verdict: ITERATE + must_mention: + - "Pivot the summary by variant before declaring a winner" + - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" + - "treatment_c regresses the primary" + - "Multi-variant verdict requires each treatment to be judged independently against control" + - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" + must_not_do: + - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" + - "Trust the bucket name (positive/negative) as the business verdict" + - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" + references_consulted: + - per-metric-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml new file mode 100644 index 0000000..325a3bf --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml @@ -0,0 +1,61 @@ +name: polarsteps-no-workaround +prd_source: | + Polarsteps — "no documented workaround" + Customer's experiment is failing SRM and they cannot find a documented path + forward. The skill should consume the platform's SRM verdict (not recompute + chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and + surface ordered likely causes plus a specific recommended action — not + punt with "investigate further." + +trigger_phrase: | + My experiment is failing SRM and the result lift looks too good to be true + (+18% on the primary). The docs just say "investigate" — what does that + actually mean here? Should I trust the lift? + +get_experiment_summary: + settings: + controlKey: "control" + srm: + enabled: true + targetAllocations: { control: 50, treatment: 50 } + excludeQA: false # potentially relevant + live_exposures: + control: 18250 + treatment: 22980 + live_srm_analysis: + # platform-flagged FAILING + p_value: 0.00002 + chi_square: 18.4 + summary: + positive: + - { + metricId: m_primary, + variant: treatment, + lift: 0.18, + liftConfidence: 0.95, + } + metrics: + - id: m_primary + type: primary + direction: up + name: "Trip Plan Created" + +expected_behavior: + verdict: DO_NOT_DECIDE + must_mention: + - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" + - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" + - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" + - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" + - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" + - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" + - "Do NOT recompute the SRM chi-square — consume the platform's verdict" + - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" + must_not_do: + - "Calculate the chi-square or re-derive an SRM p-value threshold" + - "Recommend shipping or treating the +18% lift as real" + - "Hand the user a generic 'investigate further' without ordered causes and an action" + - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" + references_consulted: + - health-check-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md new file mode 100644 index 0000000..efaeae5 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md @@ -0,0 +1,161 @@ +# `Get-Experiment` Field Map + +Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. + +This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. + +--- + +## Identity & lifecycle + +``` +id, name, description, hypothesis, status, start_date, end_date +creator_email, tags, url, workspace_id +feature_flag_id → for feature-flag-based experiments +settings.controlKey → variant key treated as control (often "control"; may be "") +``` + +`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). + +--- + +## Trustworthiness + +``` +live_srm_analysis → SRM verdict (consume — don't recompute) + .p_value + .chi_square +live_exposures[] → per-variant exposure counts (live) +exposures_cache[] → per-variant exposure counts (cached fallback) +exposures_cache.$srm_analysis → cached SRM analysis +exposures_cache.$last_computed → when the cache was last refreshed +settings.srm.enabled → whether the SRM check ran +settings.srm.targetAllocations → expected per-variant allocation (percent) +settings.preExperimentBias → whether Retro A/A was enabled +settings.excludeQA → whether QA traffic was filtered +live_results_errors → non-null = live computation failed; surface and fall back to cache +``` + +--- + +## Per-metric per-variant results + +``` +live_metrics[][] + .value → metric value for this variant + .sampleSize → sample size for this variant on this metric + .lift → (treatment - control) / control (0 for control row) + .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width + .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) + +results_cache.metrics[][] → cached fallback, same shape +``` + +--- + +## Bucketed summary + +``` +results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) +results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) +results_cache.summary.no[] → items with significance == "NO" + +Each item: + .metricId + .variant + .value + .lift + .liftConfidence + .sampleSize + .significance +``` + +**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. + +--- + +## Metric catalog (for polarity lookups) + +``` +metrics[] + .id, .name + .type ("primary" | "guardrail" | "secondary") + .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset +``` + +Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. + +--- + +## Settings that change interpretation + +``` +settings.confidenceLevel → significance threshold (e.g. 0.95) +settings.testingModel → "frequentist" or "sequential" +settings.endCondition → "sample_size" or "days" +settings.sampleSize / .endAfterDays → planned end target +settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" +settings.cuped.enabled → CUPED variance reduction applied +settings.cuped.preExposureDatePreset → pre-exposure window +settings.winsorization.enabled → outlier capping applied +settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) +``` + +--- + +## Decision metadata (post-decide) + +``` +results_cache.message → decision rationale +results_cache.variant → shipped variant key (or special constant) +status → "concluded" | "success" | "fail" +``` + +Special variant constants for `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant. +- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). + +For a kill, pass `success=false`. + +--- + +## Lifecycle hand-off + +``` +Update-Experiment( + experiment_id=, + experiment={ + "action": "decide", + "success": true | false, + "variant": "", # required when success=true + "message": "" + } +) +``` + +`message` is required on every `decide` call. + +--- + +## Misconfig field map (cross-link) + +For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. + +- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants +- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) +- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" +- `settings.confidenceLevel != 0.95` +- `metrics[]` entries with `name == ""` +- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` + +--- + +## When to reach for sibling tools + +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. +- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. +- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md new file mode 100644 index 0000000..4471219 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md @@ -0,0 +1,158 @@ +# Health-Check Interpretation + +Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check `live_exposures` totals — which variant is undersampled? +2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? +3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm `settings.testingModel == "frequentist"`. +2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. + +### Investigation checklist + +1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**Verdict to compute (this one is local)**: `end_date - start_date`. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations to flag during Step 1 + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. + +- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). +- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. +- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. +- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. +- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. +- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md new file mode 100644 index 0000000..3b44385 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md @@ -0,0 +1,188 @@ +# Per-Metric Interpretation + +Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe (repeat from the spine — critical) + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). + +- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). + +--- + +## Reading the p-value correctly + +The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: + +- ❌ The probability that the treatment works. +- ❌ The probability the result will replicate. +- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. +- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). + +Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +2. Lift from the winning row. +3. Absolute lift: `baseline_value × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when `value` / `sampleSize` are null + +Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## "Significance = NO" does NOT mean "no effect" + +A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** + +Options to suggest when a primary metric lands in `summary.no`: + +1. **Extend duration** (if the experiment is still ACTIVE). +2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). +3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. +4. **Enable CUPED** if the metric correlates with pre-exposure behavior. +5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. +6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. + +For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Check `settings.testingModel`: + +- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. +- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. + +Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..6877d2a --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -0,0 +1,95 @@ +# Segment-Breakdown Interpretation + +Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. + +- Segments below the floor → mark "insufficient sample, treat as directional only." +- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md new file mode 100644 index 0000000..ea9f22b --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md new file mode 100644 index 0000000..88640f4 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. + +> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md new file mode 100644 index 0000000..fdad2cd --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- `lift is None` on the primary → no measurement, not "no effect." +- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." +- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. + +- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md new file mode 100644 index 0000000..4e344d3 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md @@ -0,0 +1,236 @@ +--- +name: experiment-results +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +license: Apache-2.0 +--- + +# Experiment Results Interpretation + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. + +## Requirements + +- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). +- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. + +## When to use this skill + +Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: + +- "What do these results mean?" / "Should we ship this?" +- "Is this experiment trustworthy?" / "Why is SRM failing?" +- "Why hasn't this hit statistical significance yet?" +- "Break this down by ``" / "What segments should I look at?" +- "What does this Retro A/A failure mean?" +- "Can you compare the session replays for control vs treatment?" + +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. + +--- + +## How to read `Get-Experiment` output + +Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** + +| Concept | Live (preferred) | Cached fallback | +| ---------------------------- | --------------------------------- | ------------------------------------------- | +| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | +| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | +| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | +| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | +| When was this computed? | "now" | `exposures_cache.$last_computed` | + +If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. + +If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." + +See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. + +--- + +## The Decision Tree + +This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. + +``` +┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ +│ SRM ok? → exposures sufficient? → │ +│ Retro A/A clean? → minimum duration met? → │ +│ no misconfig? │ +│ │ │ +│ fail → STOP. See references/ │ +│ health-check-interpretation.md │ +└──────────────┬───────────────────────────────┘ + ↓ pass +┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ +│ For each non-control variant × primary, │ +│ apply the polarity recipe (sign-of-lift + │ +│ metric.direction). Significant + correct │ +│ polarity = "win"; significant + wrong │ +│ polarity = "loss". │ +│ │ │ +│ nothing significant on primaries → │ +│ see references/why-no-statsig.md │ +└──────────────┬───────────────────────────────┘ + ↓ at least one primary win +┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ +│ Any guardrail significant in the wrong │ +│ polarity? → regression → ITERATE not ship │ +└──────────────┬───────────────────────────────┘ + ↓ guardrails clean +┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ +│ Convert the lift on the primary into │ +│ absolute terms. Is it big enough to │ +│ matter to the business? │ +│ Statistically significant ≠ ships. │ +└──────────────┬───────────────────────────────┘ + ↓ meaningful magnitude +┌─ Step 5: VERDICT ────────────────────────────┐ +│ Trust ✓ + primary win + guardrails ✓ + │ +│ meaningful magnitude → SHIP │ +│ Trust ✓ + primary win + guardrail regress │ +│ → ITERATE │ +│ Trust ✓ + primary neutral after target │ +│ → KILL or ITERATE │ +│ Trust ✗ │ +│ → DO NOT DECIDE; report failures │ +│ Hasn't reached target sample/duration │ +│ → WAIT (or extend, or restart with more │ +│ power — see why-no-statsig.md) │ +└──────────────────────────────────────────────┘ +``` + +### Step 1 — Trustworthiness gate (consume the verdicts) + +Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. + +| Check | Field to read | What "fail" looks like | +| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | +| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | +| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | +| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | +| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | +| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | + +If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). + +### Step 2 — Statistical significance with polarity + +**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. + +#### Polarity recipe + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). + +- `lift is None` or `lift == 0` → **neutral**. +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. + +#### How to read the summary + +1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. +2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. +3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. +4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** + +Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). + +### Step 3 — Guardrail check + +Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). + +- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. +- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. +- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. + +### Step 4 — Practical significance + +Statistical significance ≠ business impact. For every primary metric that won: + +1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. +2. Read the **lift** from the winning variant's row. +3. Compute absolute lift: `baseline_value × lift`. +4. Project to population per period: ask the user for traffic estimates if not in context. + +A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. + +**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. + +### Step 5 — Verdict + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. + +`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. + +Special variant constants when `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant +- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) + +For a kill, pass `success=false`. + +--- + +## Going deeper + +Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: + +| User asks about… | Open | +| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | + +--- + +## Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. +5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. + +If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. + +--- + +## Common pitfalls (cheat sheet) + +- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) +- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned +- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` +- ⛔ Trusting a >30% lift without checking whether the **denominator changed** +- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) +- ⛔ Treating a `null` lift as "no effect" — it means computation failed +- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" +- ⛔ Interpreting a `< 3 day` experiment instead of refusing +- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) +- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) +- ⛔ Conflating **statistical significance** with **practical significance** +- ⛔ Ignoring **guardrail regressions** because the primary won +- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction +- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md new file mode 100644 index 0000000..71278d6 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md @@ -0,0 +1,34 @@ +# Eval fixtures — `experiment-results` + +Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. + +The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: + +1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. +2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. + +When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. + +--- + +## Fixtures + +| Fixture | PRD source quote | What it exercises | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | +| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | +| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | +| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | + +Each YAML doc has the same shape: + +```yaml +name: +prd_source: +trigger_phrase: +get_experiment_summary: +expected_behavior: + verdict: + must_mention: [] + must_not_do: [] + references_consulted: [] +``` diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml new file mode 100644 index 0000000..da61d9e --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml @@ -0,0 +1,48 @@ +name: confetti-8-metrics +prd_source: | + Confetti — "8 metrics for new visitors" + Customer is running an experiment with 8 primary-ish metrics and explicitly + cares about new-visitor behavior. They want a segment-driven read, not a + dump of 8 lifts. The skill should pre-commit to segments tied to the + hypothesis (new vs returning), call out the multiple-testing concern with + 8 metrics, and produce a verdict scoped to the segment that matters. + +trigger_phrase: | + We're tracking 8 metrics on this onboarding redesign experiment and I really + care about how new visitors respond. Can you read this and tell me whether + it's a ship for the new-user audience? + +get_experiment_summary: + hypothesis: | + If we redesign the first-session onboarding flow, then activation rate + among NEW visitors will increase by ≥5% relative, because reducing + cold-start friction shortens time-to-first-value. + settings: + controlKey: "control" + multipleTestingCorrection: "off" # mis-configured given 8 primaries + testingModel: "sequential" + confidenceLevel: 0.95 + metrics_count: 8 + primary_metrics_summary: | + Of 8 primaries: 2 significant positive (polarity-correct), 1 significant + negative (a "Time to First Action" metric with direction=down where + lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. + +expected_behavior: + verdict: WAIT + must_mention: + - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" + - "Recommend at most 3–5 segments and call new vs returning the primary slice" + - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" + - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" + - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" + - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" + must_not_do: + - "Slice by every available property after the fact (the fishing-expedition warning)" + - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" + - "Call the experiment a ship because 2 of 8 primaries are significant positive" + - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" + references_consulted: + - segment-of-interest-selection.md + - per-metric-interpretation.md + - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml new file mode 100644 index 0000000..f634236 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml @@ -0,0 +1,79 @@ +name: pelando-plus-2-others +prd_source: | + Pelando — "+2 others" + Customer reported that when a multi-variant test concludes with a winner banner + plus a small-print "+2 others", they cannot tell which non-winner variants are + benign vs which contain a guardrail regression they need to act on. The skill + should pivot the summary per variant, polarity-correct each, and call out the + losers, not gloss over them. + +trigger_phrase: | + Can you make sense of this experiment for me? The UI shows treatment_a winning + on the primary plus "+2 others" but I have no idea whether treatment_b or + treatment_c are okay to ignore. + +get_experiment_summary: + settings: + controlKey: "control" + multipleTestingCorrection: "benjamini-hochberg" + testingModel: "sequential" + metrics: + - id: m_primary + type: primary + direction: up + name: "Activation Rate" + - id: m_guardrail_latency + type: guardrail + direction: down + name: "p95 Latency (ms)" + - id: m_guardrail_errors + type: guardrail + direction: down + name: "Error Rate" + live_exposures: + control: 41123 + treatment_a: 40987 + treatment_b: 41210 + treatment_c: 40755 + live_srm_analysis: + # platform-flagged passing + p_value: 0.42 + summary: + positive: + - { + metricId: m_primary, + variant: treatment_a, + lift: 0.041, + liftConfidence: 0.95, + } + - { + metricId: m_guardrail_latency, + variant: treatment_b, + lift: 0.08, + liftConfidence: 0.95, + } + negative: + - { + metricId: m_primary, + variant: treatment_c, + lift: -0.022, + liftConfidence: 0.95, + } + no: + - { metricId: m_primary, variant: treatment_b, lift: 0.004 } + +expected_behavior: + verdict: ITERATE + must_mention: + - "Pivot the summary by variant before declaring a winner" + - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" + - "treatment_c regresses the primary" + - "Multi-variant verdict requires each treatment to be judged independently against control" + - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" + must_not_do: + - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" + - "Trust the bucket name (positive/negative) as the business verdict" + - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" + references_consulted: + - per-metric-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml new file mode 100644 index 0000000..325a3bf --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml @@ -0,0 +1,61 @@ +name: polarsteps-no-workaround +prd_source: | + Polarsteps — "no documented workaround" + Customer's experiment is failing SRM and they cannot find a documented path + forward. The skill should consume the platform's SRM verdict (not recompute + chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and + surface ordered likely causes plus a specific recommended action — not + punt with "investigate further." + +trigger_phrase: | + My experiment is failing SRM and the result lift looks too good to be true + (+18% on the primary). The docs just say "investigate" — what does that + actually mean here? Should I trust the lift? + +get_experiment_summary: + settings: + controlKey: "control" + srm: + enabled: true + targetAllocations: { control: 50, treatment: 50 } + excludeQA: false # potentially relevant + live_exposures: + control: 18250 + treatment: 22980 + live_srm_analysis: + # platform-flagged FAILING + p_value: 0.00002 + chi_square: 18.4 + summary: + positive: + - { + metricId: m_primary, + variant: treatment, + lift: 0.18, + liftConfidence: 0.95, + } + metrics: + - id: m_primary + type: primary + direction: up + name: "Trip Plan Created" + +expected_behavior: + verdict: DO_NOT_DECIDE + must_mention: + - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" + - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" + - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" + - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" + - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" + - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" + - "Do NOT recompute the SRM chi-square — consume the platform's verdict" + - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" + must_not_do: + - "Calculate the chi-square or re-derive an SRM p-value threshold" + - "Recommend shipping or treating the +18% lift as real" + - "Hand the user a generic 'investigate further' without ordered causes and an action" + - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" + references_consulted: + - health-check-interpretation.md + - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md new file mode 100644 index 0000000..efaeae5 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md @@ -0,0 +1,161 @@ +# `Get-Experiment` Field Map + +Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. + +This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. + +--- + +## Identity & lifecycle + +``` +id, name, description, hypothesis, status, start_date, end_date +creator_email, tags, url, workspace_id +feature_flag_id → for feature-flag-based experiments +settings.controlKey → variant key treated as control (often "control"; may be "") +``` + +`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). + +--- + +## Trustworthiness + +``` +live_srm_analysis → SRM verdict (consume — don't recompute) + .p_value + .chi_square +live_exposures[] → per-variant exposure counts (live) +exposures_cache[] → per-variant exposure counts (cached fallback) +exposures_cache.$srm_analysis → cached SRM analysis +exposures_cache.$last_computed → when the cache was last refreshed +settings.srm.enabled → whether the SRM check ran +settings.srm.targetAllocations → expected per-variant allocation (percent) +settings.preExperimentBias → whether Retro A/A was enabled +settings.excludeQA → whether QA traffic was filtered +live_results_errors → non-null = live computation failed; surface and fall back to cache +``` + +--- + +## Per-metric per-variant results + +``` +live_metrics[][] + .value → metric value for this variant + .sampleSize → sample size for this variant on this metric + .lift → (treatment - control) / control (0 for control row) + .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width + .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) + +results_cache.metrics[][] → cached fallback, same shape +``` + +--- + +## Bucketed summary + +``` +results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) +results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) +results_cache.summary.no[] → items with significance == "NO" + +Each item: + .metricId + .variant + .value + .lift + .liftConfidence + .sampleSize + .significance +``` + +**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. + +--- + +## Metric catalog (for polarity lookups) + +``` +metrics[] + .id, .name + .type ("primary" | "guardrail" | "secondary") + .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset +``` + +Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. + +--- + +## Settings that change interpretation + +``` +settings.confidenceLevel → significance threshold (e.g. 0.95) +settings.testingModel → "frequentist" or "sequential" +settings.endCondition → "sample_size" or "days" +settings.sampleSize / .endAfterDays → planned end target +settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" +settings.cuped.enabled → CUPED variance reduction applied +settings.cuped.preExposureDatePreset → pre-exposure window +settings.winsorization.enabled → outlier capping applied +settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) +``` + +--- + +## Decision metadata (post-decide) + +``` +results_cache.message → decision rationale +results_cache.variant → shipped variant key (or special constant) +status → "concluded" | "success" | "fail" +``` + +Special variant constants for `success=true`: + +- `__no_variant_shipped__` — ship the change without picking a variant. +- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). + +For a kill, pass `success=false`. + +--- + +## Lifecycle hand-off + +``` +Update-Experiment( + experiment_id=, + experiment={ + "action": "decide", + "success": true | false, + "variant": "", # required when success=true + "message": "" + } +) +``` + +`message` is required on every `decide` call. + +--- + +## Misconfig field map (cross-link) + +For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. + +- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants +- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) +- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" +- `settings.confidenceLevel != 0.95` +- `metrics[]` entries with `name == ""` +- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` + +--- + +## When to reach for sibling tools + +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. +- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. +- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md new file mode 100644 index 0000000..4471219 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md @@ -0,0 +1,158 @@ +# Health-Check Interpretation + +Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check `live_exposures` totals — which variant is undersampled? +2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? +3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm `settings.testingModel == "frequentist"`. +2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. + +### Investigation checklist + +1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**Verdict to compute (this one is local)**: `end_date - start_date`. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations to flag during Step 1 + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. + +- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). +- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. +- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. +- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. +- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. +- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. +- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md new file mode 100644 index 0000000..3b44385 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md @@ -0,0 +1,188 @@ +# Per-Metric Interpretation + +Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe (repeat from the spine — critical) + +`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). + +- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). + +--- + +## Reading the p-value correctly + +The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: + +- ❌ The probability that the treatment works. +- ❌ The probability the result will replicate. +- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. +- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). + +Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +2. Lift from the winning row. +3. Absolute lift: `baseline_value × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when `value` / `sampleSize` are null + +Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## "Significance = NO" does NOT mean "no effect" + +A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** + +Options to suggest when a primary metric lands in `summary.no`: + +1. **Extend duration** (if the experiment is still ACTIVE). +2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). +3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. +4. **Enable CUPED** if the metric correlates with pre-exposure behavior. +5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. +6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. + +For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Check `settings.testingModel`: + +- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. +- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. + +Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..6877d2a --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -0,0 +1,95 @@ +# Segment-Breakdown Interpretation + +Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. + +- Segments below the floor → mark "insufficient sample, treat as directional only." +- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md new file mode 100644 index 0000000..ea9f22b --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md new file mode 100644 index 0000000..88640f4 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. + +> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md new file mode 100644 index 0000000..fdad2cd --- /dev/null +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- `lift is None` on the primary → no measurement, not "no effect." +- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." +- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. + +- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. From aa0a13c6f25f8fcd156428287aa986a7154b8fec Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Fri, 5 Jun 2026 00:30:00 +0000 Subject: [PATCH 02/11] Remove MCP tool name references and delete eval fixtures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace explicit MCP tool names (Get-Experiment, Update-Experiment, Run-Query, Get-Feature-Flag, Get-Experiment-Setup-Guidance) with agent-agnostic phrasing per the convention from #22. Skills describe actions ("request experiment details", "query the metric", "update the experiment") rather than specific tool calls. - Rename references/get-experiment-fields.md → experiment-fields.md so the filename doesn't echo a specific MCP tool name. - Drop evals/ directory — this repo doesn't run evals. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../skills/experiment-results/SKILL.md | 26 +++--- .../skills/experiment-results/evals/README.md | 34 -------- .../evals/confetti-8-metrics.yaml | 48 ----------- .../evals/pelando-plus-2-others.yaml | 79 ------------------- .../evals/polarsteps-no-workaround.yaml | 61 -------------- ...eriment-fields.md => experiment-fields.md} | 31 ++++---- .../references/health-check-interpretation.md | 14 ++-- .../references/per-metric-interpretation.md | 6 +- .../segment-breakdown-interpretation.md | 4 +- .../references/session-replay-analysis.md | 4 +- .../references/why-no-statsig.md | 12 +-- .../skills/experiment-results/SKILL.md | 26 +++--- .../skills/experiment-results/evals/README.md | 34 -------- .../evals/confetti-8-metrics.yaml | 48 ----------- .../evals/pelando-plus-2-others.yaml | 79 ------------------- .../evals/polarsteps-no-workaround.yaml | 61 -------------- ...eriment-fields.md => experiment-fields.md} | 31 ++++---- .../references/health-check-interpretation.md | 14 ++-- .../references/per-metric-interpretation.md | 6 +- .../segment-breakdown-interpretation.md | 4 +- .../references/session-replay-analysis.md | 4 +- .../references/why-no-statsig.md | 12 +-- .../skills/experiment-results/SKILL.md | 26 +++--- .../skills/experiment-results/evals/README.md | 34 -------- .../evals/confetti-8-metrics.yaml | 48 ----------- .../evals/pelando-plus-2-others.yaml | 79 ------------------- .../evals/polarsteps-no-workaround.yaml | 61 -------------- ...eriment-fields.md => experiment-fields.md} | 31 ++++---- .../references/health-check-interpretation.md | 14 ++-- .../references/per-metric-interpretation.md | 6 +- .../segment-breakdown-interpretation.md | 4 +- .../references/session-replay-analysis.md | 4 +- .../references/why-no-statsig.md | 12 +-- 33 files changed, 141 insertions(+), 816 deletions(-) delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml rename plugins/mixpanel-mcp-eu/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%) delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml rename plugins/mixpanel-mcp-in/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%) delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/README.md delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml rename plugins/mixpanel-mcp/skills/experiment-results/references/{get-experiment-fields.md => experiment-fields.md} (84%) diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md index 4e344d3..0164c56 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. license: Apache-2.0 --- @@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio ## Requirements -- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). -- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). +- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its - "What does this Retro A/A failure mean?" - "Can you compare the session replays for control vs treatment?" -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. --- -## How to read `Get-Experiment` output +## How to read experiment-details output -Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** +Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** | Concept | Live (preferred) | Cached fallback | | ---------------------------- | --------------------------------- | ------------------------------------------- | @@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. +See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. --- @@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | | Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | +| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). @@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). -If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. ### Step 5 — Verdict | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | | Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | | Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | @@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r | "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | | "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | +| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | --- @@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else: 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. +5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. -If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. +If experiment details are unavailable or return errors, say so — do not invent a verdict. --- diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md deleted file mode 100644 index 71278d6..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/README.md +++ /dev/null @@ -1,34 +0,0 @@ -# Eval fixtures — `experiment-results` - -Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. - -The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: - -1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. -2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. - -When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. - ---- - -## Fixtures - -| Fixture | PRD source quote | What it exercises | -| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | -| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | -| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | -| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | - -Each YAML doc has the same shape: - -```yaml -name: -prd_source: -trigger_phrase: -get_experiment_summary: -expected_behavior: - verdict: - must_mention: [] - must_not_do: [] - references_consulted: [] -``` diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml deleted file mode 100644 index da61d9e..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/confetti-8-metrics.yaml +++ /dev/null @@ -1,48 +0,0 @@ -name: confetti-8-metrics -prd_source: | - Confetti — "8 metrics for new visitors" - Customer is running an experiment with 8 primary-ish metrics and explicitly - cares about new-visitor behavior. They want a segment-driven read, not a - dump of 8 lifts. The skill should pre-commit to segments tied to the - hypothesis (new vs returning), call out the multiple-testing concern with - 8 metrics, and produce a verdict scoped to the segment that matters. - -trigger_phrase: | - We're tracking 8 metrics on this onboarding redesign experiment and I really - care about how new visitors respond. Can you read this and tell me whether - it's a ship for the new-user audience? - -get_experiment_summary: - hypothesis: | - If we redesign the first-session onboarding flow, then activation rate - among NEW visitors will increase by ≥5% relative, because reducing - cold-start friction shortens time-to-first-value. - settings: - controlKey: "control" - multipleTestingCorrection: "off" # mis-configured given 8 primaries - testingModel: "sequential" - confidenceLevel: 0.95 - metrics_count: 8 - primary_metrics_summary: | - Of 8 primaries: 2 significant positive (polarity-correct), 1 significant - negative (a "Time to First Action" metric with direction=down where - lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. - -expected_behavior: - verdict: WAIT - must_mention: - - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" - - "Recommend at most 3–5 segments and call new vs returning the primary slice" - - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" - - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" - - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" - - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" - must_not_do: - - "Slice by every available property after the fact (the fishing-expedition warning)" - - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" - - "Call the experiment a ship because 2 of 8 primaries are significant positive" - - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" - references_consulted: - - segment-of-interest-selection.md - - per-metric-interpretation.md - - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml deleted file mode 100644 index f634236..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/pelando-plus-2-others.yaml +++ /dev/null @@ -1,79 +0,0 @@ -name: pelando-plus-2-others -prd_source: | - Pelando — "+2 others" - Customer reported that when a multi-variant test concludes with a winner banner - plus a small-print "+2 others", they cannot tell which non-winner variants are - benign vs which contain a guardrail regression they need to act on. The skill - should pivot the summary per variant, polarity-correct each, and call out the - losers, not gloss over them. - -trigger_phrase: | - Can you make sense of this experiment for me? The UI shows treatment_a winning - on the primary plus "+2 others" but I have no idea whether treatment_b or - treatment_c are okay to ignore. - -get_experiment_summary: - settings: - controlKey: "control" - multipleTestingCorrection: "benjamini-hochberg" - testingModel: "sequential" - metrics: - - id: m_primary - type: primary - direction: up - name: "Activation Rate" - - id: m_guardrail_latency - type: guardrail - direction: down - name: "p95 Latency (ms)" - - id: m_guardrail_errors - type: guardrail - direction: down - name: "Error Rate" - live_exposures: - control: 41123 - treatment_a: 40987 - treatment_b: 41210 - treatment_c: 40755 - live_srm_analysis: - # platform-flagged passing - p_value: 0.42 - summary: - positive: - - { - metricId: m_primary, - variant: treatment_a, - lift: 0.041, - liftConfidence: 0.95, - } - - { - metricId: m_guardrail_latency, - variant: treatment_b, - lift: 0.08, - liftConfidence: 0.95, - } - negative: - - { - metricId: m_primary, - variant: treatment_c, - lift: -0.022, - liftConfidence: 0.95, - } - no: - - { metricId: m_primary, variant: treatment_b, lift: 0.004 } - -expected_behavior: - verdict: ITERATE - must_mention: - - "Pivot the summary by variant before declaring a winner" - - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" - - "treatment_c regresses the primary" - - "Multi-variant verdict requires each treatment to be judged independently against control" - - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" - must_not_do: - - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" - - "Trust the bucket name (positive/negative) as the business verdict" - - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" - references_consulted: - - per-metric-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml deleted file mode 100644 index 325a3bf..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/evals/polarsteps-no-workaround.yaml +++ /dev/null @@ -1,61 +0,0 @@ -name: polarsteps-no-workaround -prd_source: | - Polarsteps — "no documented workaround" - Customer's experiment is failing SRM and they cannot find a documented path - forward. The skill should consume the platform's SRM verdict (not recompute - chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and - surface ordered likely causes plus a specific recommended action — not - punt with "investigate further." - -trigger_phrase: | - My experiment is failing SRM and the result lift looks too good to be true - (+18% on the primary). The docs just say "investigate" — what does that - actually mean here? Should I trust the lift? - -get_experiment_summary: - settings: - controlKey: "control" - srm: - enabled: true - targetAllocations: { control: 50, treatment: 50 } - excludeQA: false # potentially relevant - live_exposures: - control: 18250 - treatment: 22980 - live_srm_analysis: - # platform-flagged FAILING - p_value: 0.00002 - chi_square: 18.4 - summary: - positive: - - { - metricId: m_primary, - variant: treatment, - lift: 0.18, - liftConfidence: 0.95, - } - metrics: - - id: m_primary - type: primary - direction: up - name: "Trip Plan Created" - -expected_behavior: - verdict: DO_NOT_DECIDE - must_mention: - - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" - - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" - - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" - - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" - - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" - - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" - - "Do NOT recompute the SRM chi-square — consume the platform's verdict" - - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" - must_not_do: - - "Calculate the chi-square or re-derive an SRM p-value threshold" - - "Recommend shipping or treating the +18% lift as real" - - "Hand the user a generic 'investigate further' without ordered causes and an action" - - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" - references_consulted: - - health-check-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md similarity index 84% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md rename to plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md index efaeae5..1e65de1 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/get-experiment-fields.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md @@ -1,6 +1,6 @@ -# `Get-Experiment` Field Map +# Experiment-Details Field Map -Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. +Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. @@ -122,16 +122,13 @@ For a kill, pass `success=false`. ## Lifecycle hand-off +To ship/kill, update the experiment with the `decide` action and these fields: + ``` -Update-Experiment( - experiment_id=, - experiment={ - "action": "decide", - "success": true | false, - "variant": "", # required when success=true - "message": "" - } -) +action → "decide" +success → true | false +variant → "" # required when success=true +message → "" ``` `message` is required on every `decide` call. @@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health --- -## When to reach for sibling tools +## When to reach for sibling capabilities -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. -- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. -- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. +- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. +- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md index 4471219..9ec66df 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md @@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured ### Investigation checklist 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? -2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 1. Identify which metric × variant pair triggered the failure (after the platform's correction). 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. -3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. @@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that ### Investigation checklist 1. Check `live_exposures` totals — which variant is undersampled? -2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? -3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md index 3b44385..1e8678c 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md @@ -2,7 +2,7 @@ Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. --- @@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** -Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: - `unique` (Bernoulli) → conversion **rate** as the baseline. - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. @@ -165,7 +165,7 @@ Check `settings.testingModel`: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. -Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. +Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. --- diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md index 6877d2a..fcf9cfd 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -2,7 +2,7 @@ Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. +> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. --- @@ -80,7 +80,7 @@ This is the everyday case of mixed effects. ## What NOT to do - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. -- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md index 88640f4..b758b8e 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md @@ -2,7 +2,7 @@ Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. -> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. --- @@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that. - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. -Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. --- diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md index fdad2cd..142089c 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md @@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. -If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. ### 2. Observed effect is smaller than the MDE @@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. -- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. +When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. --- @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). -3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md index 4e344d3..0164c56 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. license: Apache-2.0 --- @@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio ## Requirements -- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). -- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). +- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its - "What does this Retro A/A failure mean?" - "Can you compare the session replays for control vs treatment?" -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. --- -## How to read `Get-Experiment` output +## How to read experiment-details output -Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** +Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** | Concept | Live (preferred) | Cached fallback | | ---------------------------- | --------------------------------- | ------------------------------------------- | @@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. +See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. --- @@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | | Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | +| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). @@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). -If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. ### Step 5 — Verdict | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | | Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | | Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | @@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r | "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | | "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | +| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | --- @@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else: 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. +5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. -If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. +If experiment details are unavailable or return errors, say so — do not invent a verdict. --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md deleted file mode 100644 index 71278d6..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/README.md +++ /dev/null @@ -1,34 +0,0 @@ -# Eval fixtures — `experiment-results` - -Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. - -The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: - -1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. -2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. - -When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. - ---- - -## Fixtures - -| Fixture | PRD source quote | What it exercises | -| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | -| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | -| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | -| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | - -Each YAML doc has the same shape: - -```yaml -name: -prd_source: -trigger_phrase: -get_experiment_summary: -expected_behavior: - verdict: - must_mention: [] - must_not_do: [] - references_consulted: [] -``` diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml deleted file mode 100644 index da61d9e..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/confetti-8-metrics.yaml +++ /dev/null @@ -1,48 +0,0 @@ -name: confetti-8-metrics -prd_source: | - Confetti — "8 metrics for new visitors" - Customer is running an experiment with 8 primary-ish metrics and explicitly - cares about new-visitor behavior. They want a segment-driven read, not a - dump of 8 lifts. The skill should pre-commit to segments tied to the - hypothesis (new vs returning), call out the multiple-testing concern with - 8 metrics, and produce a verdict scoped to the segment that matters. - -trigger_phrase: | - We're tracking 8 metrics on this onboarding redesign experiment and I really - care about how new visitors respond. Can you read this and tell me whether - it's a ship for the new-user audience? - -get_experiment_summary: - hypothesis: | - If we redesign the first-session onboarding flow, then activation rate - among NEW visitors will increase by ≥5% relative, because reducing - cold-start friction shortens time-to-first-value. - settings: - controlKey: "control" - multipleTestingCorrection: "off" # mis-configured given 8 primaries - testingModel: "sequential" - confidenceLevel: 0.95 - metrics_count: 8 - primary_metrics_summary: | - Of 8 primaries: 2 significant positive (polarity-correct), 1 significant - negative (a "Time to First Action" metric with direction=down where - lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. - -expected_behavior: - verdict: WAIT - must_mention: - - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" - - "Recommend at most 3–5 segments and call new vs returning the primary slice" - - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" - - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" - - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" - - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" - must_not_do: - - "Slice by every available property after the fact (the fishing-expedition warning)" - - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" - - "Call the experiment a ship because 2 of 8 primaries are significant positive" - - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" - references_consulted: - - segment-of-interest-selection.md - - per-metric-interpretation.md - - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml deleted file mode 100644 index f634236..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/pelando-plus-2-others.yaml +++ /dev/null @@ -1,79 +0,0 @@ -name: pelando-plus-2-others -prd_source: | - Pelando — "+2 others" - Customer reported that when a multi-variant test concludes with a winner banner - plus a small-print "+2 others", they cannot tell which non-winner variants are - benign vs which contain a guardrail regression they need to act on. The skill - should pivot the summary per variant, polarity-correct each, and call out the - losers, not gloss over them. - -trigger_phrase: | - Can you make sense of this experiment for me? The UI shows treatment_a winning - on the primary plus "+2 others" but I have no idea whether treatment_b or - treatment_c are okay to ignore. - -get_experiment_summary: - settings: - controlKey: "control" - multipleTestingCorrection: "benjamini-hochberg" - testingModel: "sequential" - metrics: - - id: m_primary - type: primary - direction: up - name: "Activation Rate" - - id: m_guardrail_latency - type: guardrail - direction: down - name: "p95 Latency (ms)" - - id: m_guardrail_errors - type: guardrail - direction: down - name: "Error Rate" - live_exposures: - control: 41123 - treatment_a: 40987 - treatment_b: 41210 - treatment_c: 40755 - live_srm_analysis: - # platform-flagged passing - p_value: 0.42 - summary: - positive: - - { - metricId: m_primary, - variant: treatment_a, - lift: 0.041, - liftConfidence: 0.95, - } - - { - metricId: m_guardrail_latency, - variant: treatment_b, - lift: 0.08, - liftConfidence: 0.95, - } - negative: - - { - metricId: m_primary, - variant: treatment_c, - lift: -0.022, - liftConfidence: 0.95, - } - no: - - { metricId: m_primary, variant: treatment_b, lift: 0.004 } - -expected_behavior: - verdict: ITERATE - must_mention: - - "Pivot the summary by variant before declaring a winner" - - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" - - "treatment_c regresses the primary" - - "Multi-variant verdict requires each treatment to be judged independently against control" - - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" - must_not_do: - - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" - - "Trust the bucket name (positive/negative) as the business verdict" - - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" - references_consulted: - - per-metric-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml deleted file mode 100644 index 325a3bf..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/evals/polarsteps-no-workaround.yaml +++ /dev/null @@ -1,61 +0,0 @@ -name: polarsteps-no-workaround -prd_source: | - Polarsteps — "no documented workaround" - Customer's experiment is failing SRM and they cannot find a documented path - forward. The skill should consume the platform's SRM verdict (not recompute - chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and - surface ordered likely causes plus a specific recommended action — not - punt with "investigate further." - -trigger_phrase: | - My experiment is failing SRM and the result lift looks too good to be true - (+18% on the primary). The docs just say "investigate" — what does that - actually mean here? Should I trust the lift? - -get_experiment_summary: - settings: - controlKey: "control" - srm: - enabled: true - targetAllocations: { control: 50, treatment: 50 } - excludeQA: false # potentially relevant - live_exposures: - control: 18250 - treatment: 22980 - live_srm_analysis: - # platform-flagged FAILING - p_value: 0.00002 - chi_square: 18.4 - summary: - positive: - - { - metricId: m_primary, - variant: treatment, - lift: 0.18, - liftConfidence: 0.95, - } - metrics: - - id: m_primary - type: primary - direction: up - name: "Trip Plan Created" - -expected_behavior: - verdict: DO_NOT_DECIDE - must_mention: - - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" - - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" - - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" - - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" - - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" - - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" - - "Do NOT recompute the SRM chi-square — consume the platform's verdict" - - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" - must_not_do: - - "Calculate the chi-square or re-derive an SRM p-value threshold" - - "Recommend shipping or treating the +18% lift as real" - - "Hand the user a generic 'investigate further' without ordered causes and an action" - - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" - references_consulted: - - health-check-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md similarity index 84% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md rename to plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md index efaeae5..1e65de1 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/get-experiment-fields.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md @@ -1,6 +1,6 @@ -# `Get-Experiment` Field Map +# Experiment-Details Field Map -Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. +Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. @@ -122,16 +122,13 @@ For a kill, pass `success=false`. ## Lifecycle hand-off +To ship/kill, update the experiment with the `decide` action and these fields: + ``` -Update-Experiment( - experiment_id=, - experiment={ - "action": "decide", - "success": true | false, - "variant": "", # required when success=true - "message": "" - } -) +action → "decide" +success → true | false +variant → "" # required when success=true +message → "" ``` `message` is required on every `decide` call. @@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health --- -## When to reach for sibling tools +## When to reach for sibling capabilities -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. -- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. -- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. +- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. +- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md index 4471219..9ec66df 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md @@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured ### Investigation checklist 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? -2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 1. Identify which metric × variant pair triggered the failure (after the platform's correction). 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. -3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. @@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that ### Investigation checklist 1. Check `live_exposures` totals — which variant is undersampled? -2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? -3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md index 3b44385..1e8678c 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md @@ -2,7 +2,7 @@ Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. --- @@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** -Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: - `unique` (Bernoulli) → conversion **rate** as the baseline. - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. @@ -165,7 +165,7 @@ Check `settings.testingModel`: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. -Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. +Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md index 6877d2a..fcf9cfd 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -2,7 +2,7 @@ Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. +> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. --- @@ -80,7 +80,7 @@ This is the everyday case of mixed effects. ## What NOT to do - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. -- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md index 88640f4..b758b8e 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md @@ -2,7 +2,7 @@ Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. -> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. --- @@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that. - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. -Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md index fdad2cd..142089c 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md @@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. -If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. ### 2. Observed effect is smaller than the MDE @@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. -- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. +When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. --- @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). -3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md index 4e344d3..0164c56 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts that `Get-Experiment` returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. license: Apache-2.0 --- @@ -10,8 +10,8 @@ You are helping a user read, interpret, or make a ship/iterate/kill/wait decisio ## Requirements -- Access to Mixpanel via the MCP server (specifically the `Get-Experiment` tool — and, for ship/kill decisions, `Update-Experiment`). -- This skill reads the verdicts that `Get-Experiment` already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). +- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -24,13 +24,13 @@ Trigger when the user asks anything about reading an experiment's results or its - "What does this Retro A/A failure mean?" - "Can you compare the session replays for control vs treatment?" -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the setup-side skill or tool. +Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. --- -## How to read `Get-Experiment` output +## How to read experiment-details output -Always call `Get-Experiment` with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** +Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** | Concept | Live (preferred) | Cached fallback | | ---------------------------- | --------------------------------- | ------------------------------------------- | @@ -44,7 +44,7 @@ If `live_results_errors` is non-null, the live path failed. Use the cache, cavea If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -See [references/get-experiment-fields.md](references/get-experiment-fields.md) for the full field map and which fields drive each step below. +See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. --- @@ -111,7 +111,7 @@ Read these fields. Treat the platform's verdict as authoritative — do not reap | Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | | Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | | Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/get-experiment-fields.md](references/get-experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | +| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). @@ -163,13 +163,13 @@ A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift **Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). -If `value` or `sampleSize` is `null` (common when live computation timed out), call `Run-Query` on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. ### Step 5 — Verdict | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** `Update-Experiment(action="decide", success=true, variant=, message=)` | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | | Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | | Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | @@ -200,7 +200,7 @@ Once the spine is clear, the user often asks one of these follow-ups. Open the r | "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | | "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | | "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which `Get-Experiment` field has X?" | [references/get-experiment-fields.md](references/get-experiment-fields.md) | +| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | --- @@ -212,9 +212,9 @@ Default to this shape unless the user asks for something else: 2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). 3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. 4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the `Update-Experiment` call to make, or the deeper investigation to run. +5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. -If `Get-Experiment` is unavailable or returns errors, say so — do not invent a verdict. +If experiment details are unavailable or return errors, say so — do not invent a verdict. --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md b/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md deleted file mode 100644 index 71278d6..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/evals/README.md +++ /dev/null @@ -1,34 +0,0 @@ -# Eval fixtures — `experiment-results` - -Each fixture is a self-contained prompt + expected-behavior pair for the `experiment-results` skill. They are seeded from PRD customer quotes — the customer pain that motivated this skill in the first place. - -The fixtures are not auto-runnable yet (no harness lives in this repo). They're written for two uses: - -1. **Manual rehearsal** — a human (or another agent) can read the prompt, simulate the response the skill should produce, and check it against the `expected_behavior` field. -2. **Regression checkpoint when a runner exists** — when an eval harness is added in this repo, these prompts plug in directly: each YAML doc becomes one case, the `expected_behavior` field becomes the grader rubric. - -When you change `SKILL.md`, walk these fixtures and confirm each one still produces the expected behavior. If a fixture starts failing, decide whether the skill regressed or the fixture itself needs updating. - ---- - -## Fixtures - -| Fixture | PRD source quote | What it exercises | -| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | -| `pelando-plus-2-others.yaml` | Pelando — _"+2 others"_ (results too noisy for the user to triage which results to act on) | Decision tree spine + per-metric polarity; ship/iterate verdict against multi-variant noise. | -| `confetti-8-metrics.yaml` | Confetti — _"8 metrics for new visitors"_ (many primaries; user wants segment-of-interest selection on new vs returning) | Segment-of-interest selection; multiple-testing correction warning; per-metric interpretation. | -| `polarsteps-no-workaround.yaml` | Polarsteps — _"no documented workaround"_ (user wants to understand SRM failure with no canned path forward) | Health-check interpretation; Kohavi framing; ordered-causes recommendation. | - -Each YAML doc has the same shape: - -```yaml -name: -prd_source: -trigger_phrase: -get_experiment_summary: -expected_behavior: - verdict: - must_mention: [] - must_not_do: [] - references_consulted: [] -``` diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml deleted file mode 100644 index da61d9e..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/evals/confetti-8-metrics.yaml +++ /dev/null @@ -1,48 +0,0 @@ -name: confetti-8-metrics -prd_source: | - Confetti — "8 metrics for new visitors" - Customer is running an experiment with 8 primary-ish metrics and explicitly - cares about new-visitor behavior. They want a segment-driven read, not a - dump of 8 lifts. The skill should pre-commit to segments tied to the - hypothesis (new vs returning), call out the multiple-testing concern with - 8 metrics, and produce a verdict scoped to the segment that matters. - -trigger_phrase: | - We're tracking 8 metrics on this onboarding redesign experiment and I really - care about how new visitors respond. Can you read this and tell me whether - it's a ship for the new-user audience? - -get_experiment_summary: - hypothesis: | - If we redesign the first-session onboarding flow, then activation rate - among NEW visitors will increase by ≥5% relative, because reducing - cold-start friction shortens time-to-first-value. - settings: - controlKey: "control" - multipleTestingCorrection: "off" # mis-configured given 8 primaries - testingModel: "sequential" - confidenceLevel: 0.95 - metrics_count: 8 - primary_metrics_summary: | - Of 8 primaries: 2 significant positive (polarity-correct), 1 significant - negative (a "Time to First Action" metric with direction=down where - lift is -7% — actually a WIN once polarity-applied), 5 inconclusive. - -expected_behavior: - verdict: WAIT - must_mention: - - "Pre-commit to the new-vs-returning segment because the hypothesis names new visitors as the cohort that matters" - - "Recommend at most 3–5 segments and call new vs returning the primary slice" - - "Multiple-testing correction is OFF but there are 8 primaries — flag the inflated family-wise FPR explicitly (rough order: with 8 primaries × 1 variant at α=0.05, family-wise FPR is high enough to make a single significant result inconclusive on its own)" - - "Apply polarity recipe per metric — flag the Time to First Action 'negative bucket' as a win because direction=down" - - "Without correction enabled, don't ship on a single significant primary; either enable correction and re-analyze or look at the aggregate of all 8" - - "Verdict is WAIT (re-analyze with multiple-testing correction enabled, segmented to new visitors) — not SHIP" - must_not_do: - - "Slice by every available property after the fact (the fishing-expedition warning)" - - "Treat the 'Time to First Action' metric in the negative bucket as a loss without polarity-correcting" - - "Call the experiment a ship because 2 of 8 primaries are significant positive" - - "Pretend the agent can compute the corrected p-values itself — instead, recommend re-running with multipleTestingCorrection enabled" - references_consulted: - - segment-of-interest-selection.md - - per-metric-interpretation.md - - health-check-interpretation.md # for the misconfig flag diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml deleted file mode 100644 index f634236..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/evals/pelando-plus-2-others.yaml +++ /dev/null @@ -1,79 +0,0 @@ -name: pelando-plus-2-others -prd_source: | - Pelando — "+2 others" - Customer reported that when a multi-variant test concludes with a winner banner - plus a small-print "+2 others", they cannot tell which non-winner variants are - benign vs which contain a guardrail regression they need to act on. The skill - should pivot the summary per variant, polarity-correct each, and call out the - losers, not gloss over them. - -trigger_phrase: | - Can you make sense of this experiment for me? The UI shows treatment_a winning - on the primary plus "+2 others" but I have no idea whether treatment_b or - treatment_c are okay to ignore. - -get_experiment_summary: - settings: - controlKey: "control" - multipleTestingCorrection: "benjamini-hochberg" - testingModel: "sequential" - metrics: - - id: m_primary - type: primary - direction: up - name: "Activation Rate" - - id: m_guardrail_latency - type: guardrail - direction: down - name: "p95 Latency (ms)" - - id: m_guardrail_errors - type: guardrail - direction: down - name: "Error Rate" - live_exposures: - control: 41123 - treatment_a: 40987 - treatment_b: 41210 - treatment_c: 40755 - live_srm_analysis: - # platform-flagged passing - p_value: 0.42 - summary: - positive: - - { - metricId: m_primary, - variant: treatment_a, - lift: 0.041, - liftConfidence: 0.95, - } - - { - metricId: m_guardrail_latency, - variant: treatment_b, - lift: 0.08, - liftConfidence: 0.95, - } - negative: - - { - metricId: m_primary, - variant: treatment_c, - lift: -0.022, - liftConfidence: 0.95, - } - no: - - { metricId: m_primary, variant: treatment_b, lift: 0.004 } - -expected_behavior: - verdict: ITERATE - must_mention: - - "Pivot the summary by variant before declaring a winner" - - "treatment_a wins on the primary but treatment_b shows a latency regression once polarity is applied (direction=down + lift +8% = bad)" - - "treatment_c regresses the primary" - - "Multi-variant verdict requires each treatment to be judged independently against control" - - "Recommend iterate, not ship — at minimum, do not ship treatment_b, and investigate treatment_c before re-running" - must_not_do: - - "Quietly drop treatment_b and treatment_c into '+2 others' without polarity-checking each" - - "Trust the bucket name (positive/negative) as the business verdict" - - "Re-apply multiple-testing correction on top of the platform's benjamini-hochberg" - references_consulted: - - per-metric-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml b/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml deleted file mode 100644 index 325a3bf..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/evals/polarsteps-no-workaround.yaml +++ /dev/null @@ -1,61 +0,0 @@ -name: polarsteps-no-workaround -prd_source: | - Polarsteps — "no documented workaround" - Customer's experiment is failing SRM and they cannot find a documented path - forward. The skill should consume the platform's SRM verdict (not recompute - chi-square), cite Kohavi ("SRM is the #1 trustworthiness check"), and - surface ordered likely causes plus a specific recommended action — not - punt with "investigate further." - -trigger_phrase: | - My experiment is failing SRM and the result lift looks too good to be true - (+18% on the primary). The docs just say "investigate" — what does that - actually mean here? Should I trust the lift? - -get_experiment_summary: - settings: - controlKey: "control" - srm: - enabled: true - targetAllocations: { control: 50, treatment: 50 } - excludeQA: false # potentially relevant - live_exposures: - control: 18250 - treatment: 22980 - live_srm_analysis: - # platform-flagged FAILING - p_value: 0.00002 - chi_square: 18.4 - summary: - positive: - - { - metricId: m_primary, - variant: treatment, - lift: 0.18, - liftConfidence: 0.95, - } - metrics: - - id: m_primary - type: primary - direction: up - name: "Trip Plan Created" - -expected_behavior: - verdict: DO_NOT_DECIDE - must_mention: - - "SRM is failing per the platform's verdict — do NOT trust the +18% lift" - - "Cite Kohavi: SRM is the #1 trustworthiness check; when SRM is failing, lift, p-values, and confidence intervals cannot be attributed to the treatment" - - "Twyman's Law: a +18% lift on a failing-SRM experiment is more likely a bucketing bug than a genuine win" - - "Likely causes ordered most → least likely: bucketing_bug, biased_assignment, bot_traffic, exposure_tracking_bug, ramp_up_timing" - - "Recommended action: pause_and_investigate — pause before drawing conclusions; randomization assumption is broken" - - "Concrete next steps: compare live_exposures to targetAllocations; check feature-flag rules and history via Get-Feature-Flag; Run-Query $experiment_started by variant; enable settings.excludeQA before relaunch given it is currently off" - - "Do NOT recompute the SRM chi-square — consume the platform's verdict" - - "Restart with fixed bucketing once the cause is found; do NOT re-conclude on the broken data" - must_not_do: - - "Calculate the chi-square or re-derive an SRM p-value threshold" - - "Recommend shipping or treating the +18% lift as real" - - "Hand the user a generic 'investigate further' without ordered causes and an action" - - "Skip Kohavi framing — it's the whole reason this check is the #1 gate" - references_consulted: - - health-check-interpretation.md - - get-experiment-fields.md diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md similarity index 84% rename from plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md rename to plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md index efaeae5..1e65de1 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/get-experiment-fields.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md @@ -1,6 +1,6 @@ -# `Get-Experiment` Field Map +# Experiment-Details Field Map -Quick reference for which `Get-Experiment` response field drives each interpretation. Always call with `compute_exposures=true, compute_metrics=true`. +Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. @@ -122,16 +122,13 @@ For a kill, pass `success=false`. ## Lifecycle hand-off +To ship/kill, update the experiment with the `decide` action and these fields: + ``` -Update-Experiment( - experiment_id=, - experiment={ - "action": "decide", - "success": true | false, - "variant": "", # required when success=true - "message": "" - } -) +action → "decide" +success → true | false +variant → "" # required when success=true +message → "" ``` `message` is required on every `decide` call. @@ -152,10 +149,10 @@ For _how_ to react to each of these, see [health-check-interpretation.md](health --- -## When to reach for sibling tools +## When to reach for sibling capabilities -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the setup-side skill / `Get-Experiment-Setup-Guidance`. -- **Raw data for triggered or segmentation analysis** → `Run-Query` on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → `Update-Experiment` with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → `Get-Feature-Flag`. -- **Session replays** for behavioral explanation of a quantitative result → the replay-fetch tool (see [session-replay-analysis.md](session-replay-analysis.md)). +- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. +- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. +- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. +- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. +- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md index 4471219..9ec66df 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md @@ -44,9 +44,9 @@ Users were assigned to variants in proportions that disagree with the configured ### Investigation checklist 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? -2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Use `Get-Feature-Flag` to inspect rollout rules and history. +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. `Run-Query` for `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. +4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -68,7 +68,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 1. Identify which metric × variant pair triggered the failure (after the platform's correction). 2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. -3. Look for cohort skew: did one variant disproportionately receive heavy users? `Run-Query` on the metric pre-experiment grouped by variant to confirm. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. 4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. 5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. @@ -81,9 +81,9 @@ The same statistical comparison run on the **pre-exposure** period revealed that ### Investigation checklist 1. Check `live_exposures` totals — which variant is undersampled? -2. Inspect feature-flag rollout: `Get-Feature-Flag` → was rollout dialed back? -3. `Run-Query` for the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via `Update-Experiment` with `endAfterDays`. +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -115,7 +115,7 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Try `Get-Experiment` again — transient backend load may resolve. Wait ~30s between retries. +1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. 2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md index 3b44385..1e8678c 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md @@ -2,7 +2,7 @@ Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from `Get-Experiment`. Then translate. +**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. --- @@ -88,7 +88,7 @@ Statistical significance ≠ business impact. Always convert a win into absolute Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** -Call `Run-Query` on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: - `unique` (Bernoulli) → conversion **rate** as the baseline. - `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. @@ -165,7 +165,7 @@ Check `settings.testingModel`: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. -Calling `Update-Experiment(action="conclude")` on a Frequentist experiment that hasn't reached its target is a peeking event. Flag it in the verdict. +Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md index 6877d2a..fcf9cfd 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md @@ -2,7 +2,7 @@ Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results in `Get-Experiment` depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment `Run-Query` calls against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If `Get-Experiment` doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the `Run-Query` fallback — do not invent per-segment significance verdicts. +> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. --- @@ -80,7 +80,7 @@ This is the everyday case of mixed effects. ## What NOT to do - ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. -- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment `Run-Query` fallback — they're not corrected unless the platform did it. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. - ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. - ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. - ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md index 88640f4..b758b8e 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md @@ -2,7 +2,7 @@ Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. -> **Tool boundary.** This skill provides the _interpretation_ guidance for replay analysis. The actual replay-fetching tool — pulling replay IDs for control vs treatment cohorts — lives on the platform side (a separate fetch tool exposed alongside `Get-Experiment`, when available). If the fetch tool isn't yet available, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. --- @@ -86,7 +86,7 @@ Replay analysis is qualitative. Be honest about that. - ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. -Tie observations back to specific quantitative results from `Get-Experiment`. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md index fdad2cd..142089c 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md @@ -35,7 +35,7 @@ Walk through these in order. The first one that explains the picture is usually - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. - Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. -If exposures are falling short of plan because traffic dropped: surface that. `Run-Query` on the exposure event with a date breakdown shows whether something changed mid-experiment. +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. ### 2. Observed effect is smaller than the MDE @@ -71,8 +71,8 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM **What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." `Get-Feature-Flag` reveals the rollout rules; `Run-Query` on `$experiment_started` confirms how many users actually got exposed. -- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with `Run-Query`. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the call is `Update-Experiment` with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the setup-side skill for the power math. +When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. --- @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the call is `Update-Experiment ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from `Get-Experiment`** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). -3. **Recommendation** from the table above, with the specific `Update-Experiment` call or follow-up action. +2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. From 1f13db66d5a0ed0b7b33134f7c8d8dc176257674 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Fri, 5 Jun 2026 01:26:34 +0000 Subject: [PATCH 03/11] Trim SKILL.md by deferring duplicated content to references MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The spine is always loaded; references are lazy. Move spine content that duplicated reference material out: - Drop the 47-line ASCII decision tree (numbered list reads equivalently) - Replace the Step 1 trustworthiness field table with a one-line gate (full table lives in experiment-fields.md + health-check-interpretation.md) - Compress Step 4 baseline-lookup detail to a pointer (full procedure in per-metric-interpretation.md) - Move multi-variant + decide-call shape + special variant constants to experiment-fields.md §Lifecycle hand-off (where they already are) - Drop the 16-line common-pitfalls cheat sheet (each pitfall is covered in the relevant reference) SKILL.md: 236 → 110 lines. Decision-tree spine and verdict table preserved; polarity recipe stays inline since it's load-bearing for every step. All references still linked from the spine. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../skills/experiment-results/SKILL.md | 178 +++--------------- .../skills/experiment-results/SKILL.md | 178 +++--------------- .../skills/experiment-results/SKILL.md | 178 +++--------------- 3 files changed, 78 insertions(+), 456 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md index 0164c56..44f7254 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md @@ -6,12 +6,12 @@ license: Apache-2.0 # Experiment Results Interpretation -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. ## Requirements - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics= | SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | | Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | -| When was this computed? | "now" | `exposures_cache.$last_computed` | -If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. +If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. +The full field map is in [references/experiment-fields.md](references/experiment-fields.md). --- -## The Decision Tree - -This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. - -``` -┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ -│ SRM ok? → exposures sufficient? → │ -│ Retro A/A clean? → minimum duration met? → │ -│ no misconfig? │ -│ │ │ -│ fail → STOP. See references/ │ -│ health-check-interpretation.md │ -└──────────────┬───────────────────────────────┘ - ↓ pass -┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ -│ For each non-control variant × primary, │ -│ apply the polarity recipe (sign-of-lift + │ -│ metric.direction). Significant + correct │ -│ polarity = "win"; significant + wrong │ -│ polarity = "loss". │ -│ │ │ -│ nothing significant on primaries → │ -│ see references/why-no-statsig.md │ -└──────────────┬───────────────────────────────┘ - ↓ at least one primary win -┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ -│ Any guardrail significant in the wrong │ -│ polarity? → regression → ITERATE not ship │ -└──────────────┬───────────────────────────────┘ - ↓ guardrails clean -┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ -│ Convert the lift on the primary into │ -│ absolute terms. Is it big enough to │ -│ matter to the business? │ -│ Statistically significant ≠ ships. │ -└──────────────┬───────────────────────────────┘ - ↓ meaningful magnitude -┌─ Step 5: VERDICT ────────────────────────────┐ -│ Trust ✓ + primary win + guardrails ✓ + │ -│ meaningful magnitude → SHIP │ -│ Trust ✓ + primary win + guardrail regress │ -│ → ITERATE │ -│ Trust ✓ + primary neutral after target │ -│ → KILL or ITERATE │ -│ Trust ✗ │ -│ → DO NOT DECIDE; report failures │ -│ Hasn't reached target sample/duration │ -│ → WAIT (or extend, or restart with more │ -│ power — see why-no-statsig.md) │ -└──────────────────────────────────────────────┘ -``` - -### Step 1 — Trustworthiness gate (consume the verdicts) - -Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. - -| Check | Field to read | What "fail" looks like | -| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | -| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | -| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | -| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | -| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | - -If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). - -### Step 2 — Statistical significance with polarity - -**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. - -#### Polarity recipe - -`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). - -- `lift is None` or `lift == 0` → **neutral**. -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. - -#### How to read the summary - -1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. -2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. -3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. -4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." - -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** - -Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). - -If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). - -### Step 3 — Guardrail check - -Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). - -- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. -- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. -- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. - -### Step 4 — Practical significance - -Statistical significance ≠ business impact. For every primary metric that won: - -1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. -2. Read the **lift** from the winning variant's row. -3. Compute absolute lift: `baseline_value × lift`. -4. Project to population per period: ask the user for traffic estimates if not in context. - -A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. +## The decision tree + +Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. + +1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). +2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). +3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. +4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. +5. **Verdict** — see table below. + +### Polarity recipe (load-bearing — keep in mind for every metric) -**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). +`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: -If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +- `lift is None` or `lift == 0` → **neutral** +- `direction == "up"` → **positive** if `lift > 0`, else **negative** +- `direction == "down"` → **positive** if `lift < 0`, else **negative** -### Step 5 — Verdict +A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. + +### Verdict table | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | -For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. - -`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. - -Special variant constants when `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant -- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) - -For a kill, pass `success=false`. +For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. --- ## Going deeper -Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: +Open the relevant reference on demand: | User asks about… | Open | | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | @@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else: 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. If experiment details are unavailable or return errors, say so — do not invent a verdict. - ---- - -## Common pitfalls (cheat sheet) - -- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) -- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned -- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` -- ⛔ Trusting a >30% lift without checking whether the **denominator changed** -- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) -- ⛔ Treating a `null` lift as "no effect" — it means computation failed -- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" -- ⛔ Interpreting a `< 3 day` experiment instead of refusing -- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) -- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) -- ⛔ Conflating **statistical significance** with **practical significance** -- ⛔ Ignoring **guardrail regressions** because the primary won -- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction -- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md index 0164c56..44f7254 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md @@ -6,12 +6,12 @@ license: Apache-2.0 # Experiment Results Interpretation -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. ## Requirements - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics= | SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | | Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | -| When was this computed? | "now" | `exposures_cache.$last_computed` | -If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. +If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. +The full field map is in [references/experiment-fields.md](references/experiment-fields.md). --- -## The Decision Tree - -This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. - -``` -┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ -│ SRM ok? → exposures sufficient? → │ -│ Retro A/A clean? → minimum duration met? → │ -│ no misconfig? │ -│ │ │ -│ fail → STOP. See references/ │ -│ health-check-interpretation.md │ -└──────────────┬───────────────────────────────┘ - ↓ pass -┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ -│ For each non-control variant × primary, │ -│ apply the polarity recipe (sign-of-lift + │ -│ metric.direction). Significant + correct │ -│ polarity = "win"; significant + wrong │ -│ polarity = "loss". │ -│ │ │ -│ nothing significant on primaries → │ -│ see references/why-no-statsig.md │ -└──────────────┬───────────────────────────────┘ - ↓ at least one primary win -┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ -│ Any guardrail significant in the wrong │ -│ polarity? → regression → ITERATE not ship │ -└──────────────┬───────────────────────────────┘ - ↓ guardrails clean -┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ -│ Convert the lift on the primary into │ -│ absolute terms. Is it big enough to │ -│ matter to the business? │ -│ Statistically significant ≠ ships. │ -└──────────────┬───────────────────────────────┘ - ↓ meaningful magnitude -┌─ Step 5: VERDICT ────────────────────────────┐ -│ Trust ✓ + primary win + guardrails ✓ + │ -│ meaningful magnitude → SHIP │ -│ Trust ✓ + primary win + guardrail regress │ -│ → ITERATE │ -│ Trust ✓ + primary neutral after target │ -│ → KILL or ITERATE │ -│ Trust ✗ │ -│ → DO NOT DECIDE; report failures │ -│ Hasn't reached target sample/duration │ -│ → WAIT (or extend, or restart with more │ -│ power — see why-no-statsig.md) │ -└──────────────────────────────────────────────┘ -``` - -### Step 1 — Trustworthiness gate (consume the verdicts) - -Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. - -| Check | Field to read | What "fail" looks like | -| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | -| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | -| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | -| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | -| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | - -If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). - -### Step 2 — Statistical significance with polarity - -**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. - -#### Polarity recipe - -`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). - -- `lift is None` or `lift == 0` → **neutral**. -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. - -#### How to read the summary - -1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. -2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. -3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. -4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." - -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** - -Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). - -If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). - -### Step 3 — Guardrail check - -Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). - -- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. -- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. -- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. - -### Step 4 — Practical significance - -Statistical significance ≠ business impact. For every primary metric that won: - -1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. -2. Read the **lift** from the winning variant's row. -3. Compute absolute lift: `baseline_value × lift`. -4. Project to population per period: ask the user for traffic estimates if not in context. - -A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. +## The decision tree + +Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. + +1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). +2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). +3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. +4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. +5. **Verdict** — see table below. + +### Polarity recipe (load-bearing — keep in mind for every metric) -**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). +`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: -If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +- `lift is None` or `lift == 0` → **neutral** +- `direction == "up"` → **positive** if `lift > 0`, else **negative** +- `direction == "down"` → **positive** if `lift < 0`, else **negative** -### Step 5 — Verdict +A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. + +### Verdict table | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | -For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. - -`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. - -Special variant constants when `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant -- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) - -For a kill, pass `success=false`. +For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. --- ## Going deeper -Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: +Open the relevant reference on demand: | User asks about… | Open | | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | @@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else: 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. If experiment details are unavailable or return errors, say so — do not invent a verdict. - ---- - -## Common pitfalls (cheat sheet) - -- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) -- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned -- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` -- ⛔ Trusting a >30% lift without checking whether the **denominator changed** -- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) -- ⛔ Treating a `null` lift as "no effect" — it means computation failed -- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" -- ⛔ Interpreting a `< 3 day` experiment instead of refusing -- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) -- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) -- ⛔ Conflating **statistical significance** with **practical significance** -- ⛔ Ignoring **guardrail regressions** because the primary won -- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction -- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md index 0164c56..44f7254 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md @@ -6,12 +6,12 @@ license: Apache-2.0 # Experiment Results Interpretation -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. **Read the Decision Tree first** and use it as the spine of every interpretation. Drop into the deeper references only when the situation calls for it. +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. ## Requirements - Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill reads the verdicts the platform's experiment-details response already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. +- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. ## When to use this skill @@ -38,134 +38,36 @@ Always request experiment details with `compute_exposures=true, compute_metrics= | SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | | Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | | Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | -| When was this computed? | "now" | `exposures_cache.$last_computed` | -If `live_results_errors` is non-null, the live path failed. Use the cache, caveat that data is stale, and surface the error to the user — the underlying failure may need fixing before any decision. +If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." -If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -See [references/experiment-fields.md](references/experiment-fields.md) for the full field map and which fields drive each step below. +The full field map is in [references/experiment-fields.md](references/experiment-fields.md). --- -## The Decision Tree - -This is the spine of every interpretation. Run the steps **in order**. **Stop at the first failure** — do not proceed to step N+1 if step N flags a problem. - -``` -┌─ Step 1: TRUSTWORTHINESS GATE ───────────────┐ -│ SRM ok? → exposures sufficient? → │ -│ Retro A/A clean? → minimum duration met? → │ -│ no misconfig? │ -│ │ │ -│ fail → STOP. See references/ │ -│ health-check-interpretation.md │ -└──────────────┬───────────────────────────────┘ - ↓ pass -┌─ Step 2: STATISTICAL SIGNIFICANCE ───────────┐ -│ For each non-control variant × primary, │ -│ apply the polarity recipe (sign-of-lift + │ -│ metric.direction). Significant + correct │ -│ polarity = "win"; significant + wrong │ -│ polarity = "loss". │ -│ │ │ -│ nothing significant on primaries → │ -│ see references/why-no-statsig.md │ -└──────────────┬───────────────────────────────┘ - ↓ at least one primary win -┌─ Step 3: GUARDRAIL CHECK ────────────────────┐ -│ Any guardrail significant in the wrong │ -│ polarity? → regression → ITERATE not ship │ -└──────────────┬───────────────────────────────┘ - ↓ guardrails clean -┌─ Step 4: PRACTICAL SIGNIFICANCE ─────────────┐ -│ Convert the lift on the primary into │ -│ absolute terms. Is it big enough to │ -│ matter to the business? │ -│ Statistically significant ≠ ships. │ -└──────────────┬───────────────────────────────┘ - ↓ meaningful magnitude -┌─ Step 5: VERDICT ────────────────────────────┐ -│ Trust ✓ + primary win + guardrails ✓ + │ -│ meaningful magnitude → SHIP │ -│ Trust ✓ + primary win + guardrail regress │ -│ → ITERATE │ -│ Trust ✓ + primary neutral after target │ -│ → KILL or ITERATE │ -│ Trust ✗ │ -│ → DO NOT DECIDE; report failures │ -│ Hasn't reached target sample/duration │ -│ → WAIT (or extend, or restart with more │ -│ power — see why-no-statsig.md) │ -└──────────────────────────────────────────────┘ -``` - -### Step 1 — Trustworthiness gate (consume the verdicts) - -Read these fields. Treat the platform's verdict as authoritative — do not reapply thresholds yourself. - -| Check | Field to read | What "fail" looks like | -| ------------------------ | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| SRM | `live_srm_analysis` (or `exposures_cache.$srm_analysis`) | Platform flags as failing — do not compute the chi-square yourself. | -| Sufficient exposures | `live_exposures` per variant | Platform-flagged "insufficient." If unflagged but per-variant counts look suspicious, route the user to the health-check reference; do not invent a threshold. | -| Retro A/A (pre-exp bias) | `settings.preExperimentBias` enabled, then the analysis | Platform flags a significant pre-period difference. | -| Minimum elapsed time | `end_date - start_date` | Less than ~3 days regardless of sample size — interpretation is unreliable. | -| Ran for planned duration | `start_date`, `end_date`, `settings.endAfterDays`/`sampleSize`/`endCondition`, `settings.testingModel` | Frequentist: ended before reaching configured target = peeking. Sequential: early stop on significance is allowed. | -| Misconfiguration | See [references/experiment-fields.md](references/experiment-fields.md) §Misconfig | Any flagged misconfig invalidates analysis. | - -If any of these fail, **stop**. Tell the user explicitly that results are not trustworthy. Open [references/health-check-interpretation.md](references/health-check-interpretation.md) for the per-failure root-cause checklists, recommended actions, and the Kohavi framing ("SRM is the #1 trustworthiness check; Twyman's Law: any unusually clean result is more likely a bug than a discovery"). - -### Step 2 — Statistical significance with polarity - -**Critical**: `summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by whether the lift is good for the business. You MUST apply the polarity recipe using each metric's `direction` before declaring a winner. - -#### Polarity recipe - -`metric.direction` is `"up"` or `"down"` (defaults to `"up"` if unset on the source metric). - -- `lift is None` or `lift == 0` → **neutral**. -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. Never trust the bucket name as the business verdict. - -#### How to read the summary - -1. **Filter out the control row.** Use `settings.controlKey` (typically `"control"`; may be empty). Control-vs-control always has lift 0 and inflates the "no effect" count. If `controlKey` is empty, identify control by: (a) the variant literally named `"control"`, (b) the variant whose lift is uniformly 0 across all metrics, or (c) ask the user. -2. For each non-control variant, look up the metric in `summary.positive` / `summary.negative` / `summary.no`. **Trust the bucket name as the significance signal** — the `significance` field on each item may be `null` even when the bucket is meaningful. -3. Apply the polarity recipe using `metric.direction` to translate sign-of-lift into win/loss. -4. If `lift is None` in a summary item, **the calculation failed** for that variant — surface it. Do not interpret as "no effect." - -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is set to `"bonferroni"` or `"benjamini-hochberg"` (across primaries × non-control variants). **Don't re-correct.** - -Turning the per-metric numbers into a plain-language verdict (lift + CI + p-value → "small win," "large regression," "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). - -If nothing on the primaries is significant and the user is asking "why hasn't this hit statsig?", route to [references/why-no-statsig.md](references/why-no-statsig.md). - -### Step 3 — Guardrail check - -Apply the polarity recipe to every guardrail metric (`metric.type == "guardrail"`). - -- A small primary win + a clear guardrail regression → usually **iterate, do not ship**. -- "Not significant" on a guardrail does NOT mean "no regression." It means the experiment couldn't _detect_ one at the chosen confidence. If the guardrail is critical (latency, error rate, retention), flag whether it was powered to detect a meaningful regression. -- Polarity matters here too: a guardrail named "errors" with `direction: "down"` and lift `+5%` (significant) is a regression even though it lands in `summary.positive`. - -### Step 4 — Practical significance - -Statistical significance ≠ business impact. For every primary metric that won: - -1. Read the **baseline value** from the control variant: `live_metrics[metricId][controlKey].value`. -2. Read the **lift** from the winning variant's row. -3. Compute absolute lift: `baseline_value × lift`. -4. Project to population per period: ask the user for traffic estimates if not in context. - -A 5% lift on a 20% baseline metric serving 1M users/week is enormous. A 5% lift on a 0.1% baseline metric serving 1k users/week is noise. Always ground the user in absolute terms before declaring a win meaningful. +## The decision tree + +Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. + +1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). +2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). +3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. +4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. +5. **Verdict** — see table below. + +### Polarity recipe (load-bearing — keep in mind for every metric) -**Twyman's Law check**: before celebrating any lift > ~30%, ask: did the treatment change who is _exposed_ to this metric, not just how they behave? See the changed-denominator notes in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). +`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: -If `value` or `sampleSize` is `null` (common when live computation timed out), run a query on the metric scoped to the control variant over the experiment date range to fetch the baseline. Match the metric's aggregation — `unique` → conversion rate; `total` → per-exposure average (raw total ÷ exposures), not the raw total. +- `lift is None` or `lift == 0` → **neutral** +- `direction == "up"` → **positive** if `lift > 0`, else **negative** +- `direction == "down"` → **positive** if `lift < 0`, else **negative** -### Step 5 — Verdict +A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. + +### Verdict table | Situation | Recommendation | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -175,22 +77,13 @@ If `value` or `sampleSize` is `null` (common when live computation timed out), r | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | -For **multi-variant tests**, pivot the summary by variant and evaluate each treatment independently against control. The winner is the variant with the most polarity-corrected primary wins, zero guardrail regressions, and the largest practical impact. If multiple qualify, prefer the simpler / lower-risk variant. If none qualify, recommend kill or iterate. - -`message` is required on every `decide` call — include the rationale, the metrics evaluated, and any tradeoffs accepted. - -Special variant constants when `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant -- `__defer_variant_decision__` — defer (status becomes `SUCCESS_DEFERRED` in UI) - -For a kill, pass `success=false`. +For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. --- ## Going deeper -Once the spine is clear, the user often asks one of these follow-ups. Open the relevant reference on demand: +Open the relevant reference on demand: | User asks about… | Open | | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | @@ -215,22 +108,3 @@ Default to this shape unless the user asks for something else: 5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. If experiment details are unavailable or return errors, say so — do not invent a verdict. - ---- - -## Common pitfalls (cheat sheet) - -- ⛔ **Skipping Step 1** because the lifts look exciting (Twyman's Law) -- ⛔ **Recomputing thresholds** instead of consuming the verdicts the platform already returned -- ⛔ **Not applying polarity** — reading `summary.positive` as "good" without checking `metric.direction` -- ⛔ Trusting a >30% lift without checking whether the **denominator changed** -- ⛔ **Including the control row** when counting wins/losses (filter by `settings.controlKey`) -- ⛔ Treating a `null` lift as "no effect" — it means computation failed -- ⛔ Treating a missing primary (in `metrics[]` but not in `live_metrics`/`results_cache.metrics`) as "no effect" — it's "no measurement" -- ⛔ Interpreting a `< 3 day` experiment instead of refusing -- ⛔ Forgetting to call out a **non-default `confidenceLevel`** (0.9 inflates false positives; 0.99 is conservative) -- ⛔ Treating **secondary-metric significance** as decisional (it isn't, ever) -- ⛔ Conflating **statistical significance** with **practical significance** -- ⛔ Ignoring **guardrail regressions** because the primary won -- ⛔ Calling a single significant primary with multiple-testing correction off a "win" — look at the aggregate, or enable correction -- ⛔ Concluding "no effect" from an underpowered inconclusive result (route to [references/why-no-statsig.md](references/why-no-statsig.md)) From 4b8b01e166972d11b23e8d508118047a68e6554e Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Fri, 5 Jun 2026 23:36:42 +0000 Subject: [PATCH 04/11] Add negative-trigger guidance to experiment-results description Surfaces the setup-skill boundary at routing time. The exclusion existed in the body ("Do not trigger for experiment setup questions") but the agent never reached it during skill selection. Sync mixpanel-mcp-eu and mixpanel-mcp-in. Assisted by Claude --- plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md | 2 +- plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md | 2 +- plugins/mixpanel-mcp/skills/experiment-results/SKILL.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md index 44f7254..7bc71c4 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md index 44f7254..7bc71c4 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md index 44f7254..7bc71c4 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md +++ b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md @@ -1,6 +1,6 @@ --- name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- From 0e59d994a8853fec567ea4b6afa47a768406e433 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 04:54:42 +0000 Subject: [PATCH 05/11] Address PR review: restructure skill, rename to interpret-experiment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses @gslopez's review on PR #23. - Rename skill from `experiment-results` to `interpret-experiment` (verb-noun, matches `create-dashboard` / `manage-lexicon` / `deep-research`). - Restructure SKILL.md into intro / Glossary / Components / Steps shape (mirrors `manage-lexicon`). Glossary defines Variant, Primary/Guardrail/Secondary, Lift, Polarity, Significance, SRM, Retro A/A, Twyman, CUPED, Winsorization, MDE, Trustworthiness gate so later steps can use the terms without redefining. - Drop API-parameter phrasing (`compute_exposures=true`, decide-call shape). Replace with intent — the tool layer maps it to the right call. - Drop the duplicated negative-trigger paragraph from the SKILL body (already in `description:` frontmatter where the loader sees it). - Define the polarity recipe once in Components; replace the verbatim duplicate in per-metric-interpretation.md with a back-reference. - Replace the live/cache field-path table with a single fallback rule in Components; the field schema is the tool's job, not the skill's. - Add explicit "confirm with the user before concluding — irreversible" to the verdict table and to the Steps output shape. - Delete `experiment-fields.md` (was duplicating tool-response schema docs). Promote its domain content into `lifecycle-handoff.md` (decide-call rationale, multi-variant choice, special variant constants). - Break the overloaded misconfig bullets in health-check §7 into seven Condition / Interpretation / Action sub-sections; rename §7 to "Misconfigurations". - Replace `§7` / `§SRM` / `§Misconfig` numeric/abbrev cross-refs with section titles so they don't rot when sections reorder. - Trim generic p-value content in per-metric-interpretation.md to the Mixpanel-specific traps (Welch's t, `liftConfidence` is the confidence level not the CI width). - Drop "Open this when…" preambles from every reference — the LLM is already reading the file by the time it opens. - Sync eu/in plugin copies via `make sync-skills FORCE=1`. Assisted by Claude --- README.md | 18 +- .../skills/experiment-results/SKILL.md | 110 ------------ .../references/experiment-fields.md | 158 ------------------ .../skills/interpret-experiment/SKILL.md | 127 ++++++++++++++ .../references/health-check-interpretation.md | 70 ++++++-- .../references/lifecycle-handoff.md | 39 +++++ .../references/per-metric-interpretation.md | 27 ++- .../segment-breakdown-interpretation.md | 4 +- .../segment-of-interest-selection.md | 2 +- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 4 +- .../skills/experiment-results/SKILL.md | 110 ------------ .../references/experiment-fields.md | 158 ------------------ .../skills/interpret-experiment/SKILL.md | 127 ++++++++++++++ .../references/health-check-interpretation.md | 70 ++++++-- .../references/lifecycle-handoff.md | 39 +++++ .../references/per-metric-interpretation.md | 27 ++- .../segment-breakdown-interpretation.md | 4 +- .../segment-of-interest-selection.md | 2 +- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 4 +- .../skills/experiment-results/SKILL.md | 110 ------------ .../references/experiment-fields.md | 158 ------------------ .../skills/interpret-experiment/SKILL.md | 127 ++++++++++++++ .../references/health-check-interpretation.md | 70 ++++++-- .../references/lifecycle-handoff.md | 39 +++++ .../references/per-metric-interpretation.md | 27 ++- .../segment-breakdown-interpretation.md | 4 +- .../segment-of-interest-selection.md | 2 +- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 4 +- 31 files changed, 732 insertions(+), 915 deletions(-) delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md delete mode 100644 plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md create mode 100644 plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/health-check-interpretation.md (73%) create mode 100644 plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/per-metric-interpretation.md (87%) rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/segment-breakdown-interpretation.md (94%) rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%) rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/session-replay-analysis.md (96%) rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-eu/skills/interpret-experiment}/references/why-no-statsig.md (94%) delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md delete mode 100644 plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md create mode 100644 plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/health-check-interpretation.md (73%) create mode 100644 plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md rename plugins/{mixpanel-mcp/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/per-metric-interpretation.md (87%) rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/segment-breakdown-interpretation.md (94%) rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%) rename plugins/mixpanel-mcp-in/skills/{experiment-results => interpret-experiment}/references/session-replay-analysis.md (96%) rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp-in/skills/interpret-experiment}/references/why-no-statsig.md (94%) delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/SKILL.md delete mode 100644 plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md create mode 100644 plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/health-check-interpretation.md (73%) create mode 100644 plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/per-metric-interpretation.md (87%) rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/segment-breakdown-interpretation.md (94%) rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/segment-of-interest-selection.md (95%) rename plugins/{mixpanel-mcp-eu/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/session-replay-analysis.md (96%) rename plugins/{mixpanel-mcp-in/skills/experiment-results => mixpanel-mcp/skills/interpret-experiment}/references/why-no-statsig.md (94%) diff --git a/README.md b/README.md index 3518635..67b1872 100644 --- a/README.md +++ b/README.md @@ -4,18 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http ## Skills -| Skill | Description | -|---|---| -| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | -| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | -| [`experiment-results`](plugins/mixpanel-mcp/skills/experiment-results/) | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. | -| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | -| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | +| Skill | Description | +| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | +| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| [`interpret-experiment`](plugins/mixpanel-mcp/skills/interpret-experiment/) | Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, make a ship/iterate/kill/wait call, asks why statsig hasn't been reached, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the verdicts that `Get-Experiment` returns — never recomputes thresholds. | +| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | +| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | ### Internal skills -| Skill | Description | -|---|---| +| Skill | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill ` before requesting a code review. | ## Getting Started diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md deleted file mode 100644 index 7bc71c4..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/SKILL.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. -license: Apache-2.0 ---- - -# Experiment Results Interpretation - -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. - -## Requirements - -- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. - -## When to use this skill - -Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: - -- "What do these results mean?" / "Should we ship this?" -- "Is this experiment trustworthy?" / "Why is SRM failing?" -- "Why hasn't this hit statistical significance yet?" -- "Break this down by ``" / "What segments should I look at?" -- "What does this Retro A/A failure mean?" -- "Can you compare the session replays for control vs treatment?" - -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. - ---- - -## How to read experiment-details output - -Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** - -| Concept | Live (preferred) | Cached fallback | -| ---------------------------- | --------------------------------- | ------------------------------------------- | -| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | -| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | -| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | -| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | - -If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -The full field map is in [references/experiment-fields.md](references/experiment-fields.md). - ---- - -## The decision tree - -Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. - -1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). -2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). -3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. -4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. -5. **Verdict** — see table below. - -### Polarity recipe (load-bearing — keep in mind for every metric) - -`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: - -- `lift is None` or `lift == 0` → **neutral** -- `direction == "up"` → **positive** if `lift > 0`, else **negative** -- `direction == "down"` → **positive** if `lift < 0`, else **negative** - -A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. - -Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. - -### Verdict table - -| Situation | Recommendation | -| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | -| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | -| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | -| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | -| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | - -For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. - ---- - -## Going deeper - -Open the relevant reference on demand: - -| User asks about… | Open | -| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | -| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | -| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | -| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | -| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | -| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | -| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | - ---- - -## Output - -Default to this shape unless the user asks for something else: - -1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. -2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). -3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. -4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. - -If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md deleted file mode 100644 index 1e65de1..0000000 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/experiment-fields.md +++ /dev/null @@ -1,158 +0,0 @@ -# Experiment-Details Field Map - -Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. - -This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. - ---- - -## Identity & lifecycle - -``` -id, name, description, hypothesis, status, start_date, end_date -creator_email, tags, url, workspace_id -feature_flag_id → for feature-flag-based experiments -settings.controlKey → variant key treated as control (often "control"; may be "") -``` - -`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). - ---- - -## Trustworthiness - -``` -live_srm_analysis → SRM verdict (consume — don't recompute) - .p_value - .chi_square -live_exposures[] → per-variant exposure counts (live) -exposures_cache[] → per-variant exposure counts (cached fallback) -exposures_cache.$srm_analysis → cached SRM analysis -exposures_cache.$last_computed → when the cache was last refreshed -settings.srm.enabled → whether the SRM check ran -settings.srm.targetAllocations → expected per-variant allocation (percent) -settings.preExperimentBias → whether Retro A/A was enabled -settings.excludeQA → whether QA traffic was filtered -live_results_errors → non-null = live computation failed; surface and fall back to cache -``` - ---- - -## Per-metric per-variant results - -``` -live_metrics[][] - .value → metric value for this variant - .sampleSize → sample size for this variant on this metric - .lift → (treatment - control) / control (0 for control row) - .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width - .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) - -results_cache.metrics[][] → cached fallback, same shape -``` - ---- - -## Bucketed summary - -``` -results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) -results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) -results_cache.summary.no[] → items with significance == "NO" - -Each item: - .metricId - .variant - .value - .lift - .liftConfidence - .sampleSize - .significance -``` - -**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. - ---- - -## Metric catalog (for polarity lookups) - -``` -metrics[] - .id, .name - .type ("primary" | "guardrail" | "secondary") - .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset -``` - -Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. - ---- - -## Settings that change interpretation - -``` -settings.confidenceLevel → significance threshold (e.g. 0.95) -settings.testingModel → "frequentist" or "sequential" -settings.endCondition → "sample_size" or "days" -settings.sampleSize / .endAfterDays → planned end target -settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" -settings.cuped.enabled → CUPED variance reduction applied -settings.cuped.preExposureDatePreset → pre-exposure window -settings.winsorization.enabled → outlier capping applied -settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) -``` - ---- - -## Decision metadata (post-decide) - -``` -results_cache.message → decision rationale -results_cache.variant → shipped variant key (or special constant) -status → "concluded" | "success" | "fail" -``` - -Special variant constants for `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant. -- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). - -For a kill, pass `success=false`. - ---- - -## Lifecycle hand-off - -To ship/kill, update the experiment with the `decide` action and these fields: - -``` -action → "decide" -success → true | false -variant → "" # required when success=true -message → "" -``` - -`message` is required on every `decide` call. - ---- - -## Misconfig field map (cross-link) - -For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. - -- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants -- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) -- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" -- `settings.confidenceLevel != 0.95` -- `metrics[]` entries with `name == ""` -- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` - ---- - -## When to reach for sibling capabilities - -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. -- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. -- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c2d7591 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -0,0 +1,127 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md similarity index 73% rename from plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md index 9ec66df..e9082fa 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -1,8 +1,8 @@ # Health-Check Interpretation -Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. +**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. --- @@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho --- -## 7. Misconfigurations to flag during Step 1 +## 7. Misconfigurations -These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. -- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). -- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. -- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. -- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. -- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. -- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. +### Multiple-testing correction off with several primaries + +**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. + +**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). + +**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. + +### Extreme winsorization percentile + +**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). + +**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. + +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. + +### SRM check disabled + +**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. + +**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. + +**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". + +**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. + +**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Condition**: `settings.confidenceLevel != 0.95`. + +**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. + +**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Condition**: `metrics[]` contains entries with `name == ""`. + +**Interpretation**: likely a broken or placeholder metric reference. + +**Action**: flag and skip during analysis. + +### Primary metric with no computed result + +**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. + +**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** + +**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..4d8189d --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md similarity index 87% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md index 1e8678c..3f272ad 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -1,6 +1,6 @@ # Per-Metric Interpretation -Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. @@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any --- -## Polarity recipe (repeat from the spine — critical) +## Polarity recipe -`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: -- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). --- -## Reading the p-value correctly +## Reading the p-value in this platform -The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: +Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -- ❌ The probability that the treatment works. -- ❌ The probability the result will replicate. -- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. -- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. -Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. --- @@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95 lift = (treatment_mean - control_mean) / control_mean ``` -- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." @@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md similarity index 94% rename from plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md index fcf9cfd..e0c43d2 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. @@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md similarity index 95% rename from plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md index ea9f22b..b0c8f58 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -1,6 +1,6 @@ # Segment-of-Interest Selection -Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md similarity index 96% rename from plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md index b758b8e..59ad25e 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md @@ -1,6 +1,6 @@ # Session-Replay Analysis Guidance -Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. +Turn a quantitative experiment result into a behavior story using session replays. > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md similarity index 94% rename from plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md rename to plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md index 142089c..a4e69d4 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -1,8 +1,8 @@ # Why Hasn't This Reached Statistical Significance Yet? -Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md deleted file mode 100644 index 7bc71c4..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/SKILL.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. -license: Apache-2.0 ---- - -# Experiment Results Interpretation - -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. - -## Requirements - -- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. - -## When to use this skill - -Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: - -- "What do these results mean?" / "Should we ship this?" -- "Is this experiment trustworthy?" / "Why is SRM failing?" -- "Why hasn't this hit statistical significance yet?" -- "Break this down by ``" / "What segments should I look at?" -- "What does this Retro A/A failure mean?" -- "Can you compare the session replays for control vs treatment?" - -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. - ---- - -## How to read experiment-details output - -Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** - -| Concept | Live (preferred) | Cached fallback | -| ---------------------------- | --------------------------------- | ------------------------------------------- | -| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | -| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | -| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | -| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | - -If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -The full field map is in [references/experiment-fields.md](references/experiment-fields.md). - ---- - -## The decision tree - -Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. - -1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). -2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). -3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. -4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. -5. **Verdict** — see table below. - -### Polarity recipe (load-bearing — keep in mind for every metric) - -`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: - -- `lift is None` or `lift == 0` → **neutral** -- `direction == "up"` → **positive** if `lift > 0`, else **negative** -- `direction == "down"` → **positive** if `lift < 0`, else **negative** - -A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. - -Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. - -### Verdict table - -| Situation | Recommendation | -| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | -| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | -| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | -| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | -| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | - -For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. - ---- - -## Going deeper - -Open the relevant reference on demand: - -| User asks about… | Open | -| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | -| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | -| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | -| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | -| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | -| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | -| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | - ---- - -## Output - -Default to this shape unless the user asks for something else: - -1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. -2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). -3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. -4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. - -If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md deleted file mode 100644 index 1e65de1..0000000 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/experiment-fields.md +++ /dev/null @@ -1,158 +0,0 @@ -# Experiment-Details Field Map - -Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. - -This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. - ---- - -## Identity & lifecycle - -``` -id, name, description, hypothesis, status, start_date, end_date -creator_email, tags, url, workspace_id -feature_flag_id → for feature-flag-based experiments -settings.controlKey → variant key treated as control (often "control"; may be "") -``` - -`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). - ---- - -## Trustworthiness - -``` -live_srm_analysis → SRM verdict (consume — don't recompute) - .p_value - .chi_square -live_exposures[] → per-variant exposure counts (live) -exposures_cache[] → per-variant exposure counts (cached fallback) -exposures_cache.$srm_analysis → cached SRM analysis -exposures_cache.$last_computed → when the cache was last refreshed -settings.srm.enabled → whether the SRM check ran -settings.srm.targetAllocations → expected per-variant allocation (percent) -settings.preExperimentBias → whether Retro A/A was enabled -settings.excludeQA → whether QA traffic was filtered -live_results_errors → non-null = live computation failed; surface and fall back to cache -``` - ---- - -## Per-metric per-variant results - -``` -live_metrics[][] - .value → metric value for this variant - .sampleSize → sample size for this variant on this metric - .lift → (treatment - control) / control (0 for control row) - .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width - .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) - -results_cache.metrics[][] → cached fallback, same shape -``` - ---- - -## Bucketed summary - -``` -results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) -results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) -results_cache.summary.no[] → items with significance == "NO" - -Each item: - .metricId - .variant - .value - .lift - .liftConfidence - .sampleSize - .significance -``` - -**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. - ---- - -## Metric catalog (for polarity lookups) - -``` -metrics[] - .id, .name - .type ("primary" | "guardrail" | "secondary") - .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset -``` - -Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. - ---- - -## Settings that change interpretation - -``` -settings.confidenceLevel → significance threshold (e.g. 0.95) -settings.testingModel → "frequentist" or "sequential" -settings.endCondition → "sample_size" or "days" -settings.sampleSize / .endAfterDays → planned end target -settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" -settings.cuped.enabled → CUPED variance reduction applied -settings.cuped.preExposureDatePreset → pre-exposure window -settings.winsorization.enabled → outlier capping applied -settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) -``` - ---- - -## Decision metadata (post-decide) - -``` -results_cache.message → decision rationale -results_cache.variant → shipped variant key (or special constant) -status → "concluded" | "success" | "fail" -``` - -Special variant constants for `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant. -- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). - -For a kill, pass `success=false`. - ---- - -## Lifecycle hand-off - -To ship/kill, update the experiment with the `decide` action and these fields: - -``` -action → "decide" -success → true | false -variant → "" # required when success=true -message → "" -``` - -`message` is required on every `decide` call. - ---- - -## Misconfig field map (cross-link) - -For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. - -- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants -- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) -- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" -- `settings.confidenceLevel != 0.95` -- `metrics[]` entries with `name == ""` -- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` - ---- - -## When to reach for sibling capabilities - -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. -- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. -- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c2d7591 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -0,0 +1,127 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md similarity index 73% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md index 9ec66df..e9082fa 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -1,8 +1,8 @@ # Health-Check Interpretation -Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. +**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. --- @@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho --- -## 7. Misconfigurations to flag during Step 1 +## 7. Misconfigurations -These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. -- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). -- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. -- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. -- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. -- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. -- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. +### Multiple-testing correction off with several primaries + +**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. + +**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). + +**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. + +### Extreme winsorization percentile + +**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). + +**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. + +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. + +### SRM check disabled + +**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. + +**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. + +**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". + +**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. + +**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Condition**: `settings.confidenceLevel != 0.95`. + +**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. + +**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Condition**: `metrics[]` contains entries with `name == ""`. + +**Interpretation**: likely a broken or placeholder metric reference. + +**Action**: flag and skip during analysis. + +### Primary metric with no computed result + +**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. + +**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** + +**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..4d8189d --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md similarity index 87% rename from plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md index 1e8678c..3f272ad 100644 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -1,6 +1,6 @@ # Per-Metric Interpretation -Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. @@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any --- -## Polarity recipe (repeat from the spine — critical) +## Polarity recipe -`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: -- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). --- -## Reading the p-value correctly +## Reading the p-value in this platform -The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: +Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -- ❌ The probability that the treatment works. -- ❌ The probability the result will replicate. -- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. -- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. -Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. --- @@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95 lift = (treatment_mean - control_mean) / control_mean ``` -- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." @@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md similarity index 94% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md index fcf9cfd..e0c43d2 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. @@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. --- diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md similarity index 95% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md index ea9f22b..b0c8f58 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -1,6 +1,6 @@ # Segment-of-Interest Selection -Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md similarity index 96% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md index b758b8e..59ad25e 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md @@ -1,6 +1,6 @@ # Session-Replay Analysis Guidance -Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. +Turn a quantitative experiment result into a behavior story using session replays. > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md similarity index 94% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md rename to plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md index 142089c..a4e69d4 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -1,8 +1,8 @@ # Why Hasn't This Reached Statistical Significance Yet? -Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- diff --git a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md b/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md deleted file mode 100644 index 7bc71c4..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/SKILL.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -name: experiment-results -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. -license: Apache-2.0 ---- - -# Experiment Results Interpretation - -You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. Use the decision tree below as the spine; open references only when a step needs depth. - -## Requirements - -- Access to Mixpanel (read experiment details and metrics; update experiment lifecycle for ship/kill decisions). -- This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a field is missing, say so — do not synthesize a verdict from raw values. - -## When to use this skill - -Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings: - -- "What do these results mean?" / "Should we ship this?" -- "Is this experiment trustworthy?" / "Why is SRM failing?" -- "Why hasn't this hit statistical significance yet?" -- "Break this down by ``" / "What segments should I look at?" -- "What does this Retro A/A failure mean?" -- "Can you compare the session replays for control vs treatment?" - -Do **not** trigger for experiment **setup** questions ("how should I size this?", "what metrics should I pick?") — those belong to the `experiment-setup` skill. - ---- - -## How to read experiment-details output - -Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.** - -| Concept | Live (preferred) | Cached fallback | -| ---------------------------- | --------------------------------- | ------------------------------------------- | -| Per-variant exposure counts | `live_exposures` | `exposures_cache` (strip `$`-prefixed keys) | -| SRM check | `live_srm_analysis` | `exposures_cache.$srm_analysis` | -| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` | -| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` | - -If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect." - -The full field map is in [references/experiment-fields.md](references/experiment-fields.md). - ---- - -## The decision tree - -Run in order. **Stop at the first failure** — do not proceed if a step flags a problem. - -1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). -2. **Statistical significance** — apply the polarity recipe (below) to each non-control variant × primary. If nothing significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). -3. **Guardrail check** — any guardrail significant in the wrong polarity? Regression → ITERATE not ship. -4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships. -5. **Verdict** — see table below. - -### Polarity recipe (load-bearing — keep in mind for every metric) - -`summary.positive` and `summary.negative` are bucketed by **sign of lift**, NOT by business value. `metric.direction` ("up" / "down", defaults to "up") tells you which sign is good: - -- `lift is None` or `lift == 0` → **neutral** -- `direction == "up"` → **positive** if `lift > 0`, else **negative** -- `direction == "down"` → **positive** if `lift < 0`, else **negative** - -A metric in `summary.positive` with `direction: "down"` is a **regression**, not a win. Filter out the control row first (use `settings.controlKey`). The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. - -Per-metric phrasing (translating lift + CI + p-value into "small win" / "large regression" / "noise") is in [references/per-metric-interpretation.md](references/per-metric-interpretation.md). The same reference covers the changed-denominator check (Twyman's Law) for any lift >~30%, and how to query the baseline if `value` or `sampleSize` is `null`. - -### Verdict table - -| Situation | Recommendation | -| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Use the experiment's `decide` action with `success=true`, `variant=`, and a `message` rationale. | -| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | -| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | -| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | -| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | - -For multi-variant tests, the `decide`-call shape, and special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), see [references/experiment-fields.md](references/experiment-fields.md) §Lifecycle hand-off. `message` is required on every `decide` call. - ---- - -## Going deeper - -Open the relevant reference on demand: - -| User asks about… | Open | -| ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | -| SRM failing, Retro A/A failing, exposures insufficient, or any Step 1 fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | -| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | -| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | -| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | -| "What does this segment-by-segment result mean?" (when platform support exists) | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | -| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | -| "Which field in the experiment-details response has X?" | [references/experiment-fields.md](references/experiment-fields.md) | - ---- - -## Output - -Default to this shape unless the user asks for something else: - -1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. -2. **Why**, walking through the decision tree steps that mattered (skip the steps that were clearly fine). -3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, with the polarity-corrected reading of each. Include the absolute-impact translation for any win. -4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, etc. -5. **Suggested next action** — the experiment-decide action to take, or the deeper investigation to run. - -If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md b/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md deleted file mode 100644 index 1e65de1..0000000 --- a/plugins/mixpanel-mcp/skills/experiment-results/references/experiment-fields.md +++ /dev/null @@ -1,158 +0,0 @@ -# Experiment-Details Field Map - -Quick reference for which experiment-details response field drives each interpretation. Always request the details with `compute_exposures=true, compute_metrics=true`. - -This reference is **read-only domain knowledge** for the agent. It does NOT define thresholds — every "fail condition" listed below is a _characterization_ of how the platform itself already classifies the field, not a threshold this skill should re-apply. - ---- - -## Identity & lifecycle - -``` -id, name, description, hypothesis, status, start_date, end_date -creator_email, tags, url, workspace_id -feature_flag_id → for feature-flag-based experiments -settings.controlKey → variant key treated as control (often "control"; may be "") -``` - -`status` is one of `"concluded" | "success" | "fail"` (the UI may additionally show `SUCCESS_DEFERRED` for the special variant constant — see "Decision metadata" below). - ---- - -## Trustworthiness - -``` -live_srm_analysis → SRM verdict (consume — don't recompute) - .p_value - .chi_square -live_exposures[] → per-variant exposure counts (live) -exposures_cache[] → per-variant exposure counts (cached fallback) -exposures_cache.$srm_analysis → cached SRM analysis -exposures_cache.$last_computed → when the cache was last refreshed -settings.srm.enabled → whether the SRM check ran -settings.srm.targetAllocations → expected per-variant allocation (percent) -settings.preExperimentBias → whether Retro A/A was enabled -settings.excludeQA → whether QA traffic was filtered -live_results_errors → non-null = live computation failed; surface and fall back to cache -``` - ---- - -## Per-metric per-variant results - -``` -live_metrics[][] - .value → metric value for this variant - .sampleSize → sample size for this variant on this metric - .lift → (treatment - control) / control (0 for control row) - .liftConfidence → confidence LEVEL used (e.g. 0.95) — NOT the CI width - .significance → "YES_POSITIVE" | "YES_NEGATIVE" | "NO" (sign-of-lift, NOT polarity) - -results_cache.metrics[][] → cached fallback, same shape -``` - ---- - -## Bucketed summary - -``` -results_cache.summary.positive[] → items with significance == "YES_POSITIVE" (lift > 0, sig) -results_cache.summary.negative[] → items with significance == "YES_NEGATIVE" (lift < 0, sig) -results_cache.summary.no[] → items with significance == "NO" - -Each item: - .metricId - .variant - .value - .lift - .liftConfidence - .sampleSize - .significance -``` - -**Pre-process the summary**: filter rows where `variant == settings.controlKey` (control-vs-control is mechanical noise), then apply the polarity recipe before drawing any conclusion. - ---- - -## Metric catalog (for polarity lookups) - -``` -metrics[] - .id, .name - .type ("primary" | "guardrail" | "secondary") - .direction ("up" | "down") → always set; defaults to "up" if the source metric was unset -``` - -Build a lookup `metric_id → (type, direction)` and join to summary rows during interpretation. - ---- - -## Settings that change interpretation - -``` -settings.confidenceLevel → significance threshold (e.g. 0.95) -settings.testingModel → "frequentist" or "sequential" -settings.endCondition → "sample_size" or "days" -settings.sampleSize / .endAfterDays → planned end target -settings.multipleTestingCorrection → "off" | "bonferroni" | "benjamini-hochberg" -settings.cuped.enabled → CUPED variance reduction applied -settings.cuped.preExposureDatePreset → pre-exposure window -settings.winsorization.enabled → outlier capping applied -settings.winsorization.percentile → cap percentile (default 95; lower values are extreme) -``` - ---- - -## Decision metadata (post-decide) - -``` -results_cache.message → decision rationale -results_cache.variant → shipped variant key (or special constant) -status → "concluded" | "success" | "fail" -``` - -Special variant constants for `success=true`: - -- `__no_variant_shipped__` — ship the change without picking a variant. -- `__defer_variant_decision__` — defer (UI shows `SUCCESS_DEFERRED`). - -For a kill, pass `success=false`. - ---- - -## Lifecycle hand-off - -To ship/kill, update the experiment with the `decide` action and these fields: - -``` -action → "decide" -success → true | false -variant → "" # required when success=true -message → "" -``` - -`message` is required on every `decide` call. - ---- - -## Misconfig field map (cross-link) - -For _how_ to react to each of these, see [health-check-interpretation.md](health-check-interpretation.md) §7. - -- `settings.multipleTestingCorrection in {"off", null}` with 2+ primaries × 1+ non-control variants -- `settings.winsorization.enabled == true` with `percentile` very low (< ~80) or very high (> ~99) -- `settings.srm == null` OR `settings.srm.enabled == false` (often intentional — only flag if results look suspicious) -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" -- `settings.confidenceLevel != 0.95` -- `metrics[]` entries with `name == ""` -- A primary metric in `metrics[]` but missing from `live_metrics` AND `results_cache.metrics` - ---- - -## When to reach for sibling capabilities - -- **Setup quality questions** ("was this experiment powered correctly?", "what sample size did we need?") → defer to the `experiment-setup` skill. -- **Raw data for triggered or segmentation analysis** → run a query on the metric with appropriate filters. -- **Acting on the recommendation** (ship, kill, extend) → update the experiment with the appropriate action. -- **Feature-flag rollout history** for SRM root cause → inspect the linked flag's state. -- **Session replays** for behavioral explanation of a quantitative result → see [session-replay-analysis.md](session-replay-analysis.md). diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md new file mode 100644 index 0000000..c2d7591 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -0,0 +1,127 @@ +--- +name: interpret-experiment +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +license: Apache-2.0 +--- + +# Interpret Experiment + +You are helping a user read, interpret, or make a ship/iterate/kill/wait decision on a Mixpanel experiment. This skill consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +--- + +# Glossary + +Concepts the rest of this skill uses without redefining. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. The platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — a metric that must not regress; a guardrail loss vetoes a ship even when primaries win. + - **Secondary** — exploratory only, never decisional, no correction applied. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. +- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +# Components + +The pieces every interpretation uses. Defined here once so they don't drift across the steps and references. + +## Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). + +The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. + +## Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +## Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [references/why-no-statsig.md](references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [references/why-no-statsig.md](references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [references/health-check-interpretation.md](references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [references/lifecycle-handoff.md](references/lifecycle-handoff.md). + +--- + +# Steps + +Top-down: what to do, in order. + +## 1. Fetch the experiment + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +## 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [references/why-no-statsig.md](references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [references/per-metric-interpretation.md](references/per-metric-interpretation.md). + +### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +## 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [references/health-check-interpretation.md](references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [references/why-no-statsig.md](references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [references/session-replay-analysis.md](references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | + +## 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md similarity index 73% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md index 9ec66df..e9082fa 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -1,8 +1,8 @@ # Health-Check Interpretation -Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**This skill never recomputes thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. +**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. --- @@ -134,17 +134,65 @@ If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in ho --- -## 7. Misconfigurations to flag during Step 1 +## 7. Misconfigurations -These don't always invalidate results, but they change how to _read_ them. Surface them as warnings. +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. -- `settings.multipleTestingCorrection in {"off", null}` AND there are 2+ primary metrics across 1+ non-control variants → without correction, any single significant primary may be a false positive. **Don't assume the result is broken** — look at all primary results in aggregate. If most or all primaries point the same direction (all positive or all negative), there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk**, and the user can enable correction (Benjamini-Hochberg or Bonferroni) and re-analyze. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). -- `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` very low (e.g. < ~80) or unusually high (e.g. > ~99) → extreme outlier capping. The platform's default is 95; a percentile near 50 caps almost all data and likely indicates misconfiguration. -- `settings.srm == null` OR `settings.srm.enabled == false` → the SRM check didn't run. **SRM is often deliberately disabled** (e.g. when feature-flag rollouts intentionally split traffic unevenly), so do not try to compute it yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) — then suggest the user re-enable SRM and re-analyze. -- `settings.cuped.enabled == true` AND the experiment cohort is "new users only" → CUPED requires pre-exposure data, which new-user experiments lack, so CUPED simply has no effect. **This does NOT invalidate results** — variance reduction just didn't happen. Mention it as informational. -- `settings.confidenceLevel != 0.95` → call out explicitly. `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Combine with metric count for a sense of family-wise error rate. -- `metrics[]` contains entries with `name == ""` → likely a broken or placeholder metric reference. Flag and skip during analysis. -- A primary metric appears in `metrics[]` but is **missing from `live_metrics` AND `results_cache.metrics`** → no result was computed for that primary. Surface prominently — this is "no measurement," not "no effect." Recommend the user re-sync results. +### Multiple-testing correction off with several primaries + +**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. + +**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). + +**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. + +### Extreme winsorization percentile + +**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). + +**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. + +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. + +### SRM check disabled + +**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. + +**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. + +**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". + +**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. + +**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Condition**: `settings.confidenceLevel != 0.95`. + +**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. + +**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Condition**: `metrics[]` contains entries with `name == ""`. + +**Interpretation**: likely a broken or placeholder metric reference. + +**Action**: flag and skip during analysis. + +### Primary metric with no computed result + +**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. + +**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** + +**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..4d8189d --- /dev/null +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md similarity index 87% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md index 1e8678c..3f272ad 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -1,6 +1,6 @@ # Per-Metric Interpretation -Open this when the user wants you to translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ **Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. @@ -19,28 +19,22 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any --- -## Polarity recipe (repeat from the spine — critical) +## Polarity recipe -`metric.direction` is `"up"` or `"down"` (defaults to `"up"`). +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: -- `lift is None` or `lift == 0` → **neutral** (treat as no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. - -A metric in `summary.positive` with `direction: "down"` is a **regression**. A metric in `summary.negative` with `direction: "down"` is a **win**. A `-1% interstitials_shown` lift in `summary.negative` with `direction: "down"` is plausibly a **win** (less interruption). +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). --- -## Reading the p-value correctly +## Reading the p-value in this platform -The p-value is the probability of observing a difference at least as extreme as the one measured, **assuming the null hypothesis (no real difference) is true**. It is NOT: +Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -- ❌ The probability that the treatment works. -- ❌ The probability the result will replicate. -- ❌ A measure of effect size — a tiny lift can be highly significant on a huge sample. -- ❌ Proof of "no effect" when above threshold (see "Inconclusive results"). +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. -Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95% confidence. The confidence level is set on `settings.confidenceLevel`. If it differs from 0.95, call it out in the verdict (`0.9` inflates false positives; `0.99` is conservative). +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. --- @@ -50,7 +44,6 @@ Mixpanel uses Welch's t-test (z-test for large samples). Default α = 0.05 at 95 lift = (treatment_mean - control_mean) / control_mean ``` -- `liftConfidence` is the **confidence level used** (e.g. 0.95). It is NOT the confidence-interval width. - **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. - If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." @@ -125,7 +118,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration (see `health-check-interpretation.md` §Misconfig). +- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md similarity index 94% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md index fcf9cfd..e0c43d2 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Open this when the user has per-segment results in hand and wants to read them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. > **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. @@ -49,7 +49,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to [health-check-interpretation.md](health-check-interpretation.md) §SRM — it's usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. --- diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md similarity index 95% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md index ea9f22b..b0c8f58 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -1,6 +1,6 @@ # Segment-of-Interest Selection -Open this when the user wants to break results down by user segments — _"slice this by platform"_, _"which segments should I look at?"_, _"are new users responding differently?"_. The goal is to pick 3–5 segments that are **likely to reveal a real effect difference**, before slicing every available dimension and ending up p-hacking. +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. diff --git a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md similarity index 96% rename from plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md index b758b8e..59ad25e 100644 --- a/plugins/mixpanel-mcp-eu/skills/experiment-results/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md @@ -1,6 +1,6 @@ # Session-Replay Analysis Guidance -Open this when the user wants to use session replays to explain a quantitative experiment result — _"why is conversion down in treatment?"_, _"what are users actually doing in the treatment?"_, _"can replays explain the regression?"_. The goal is to turn a number into a behavior story. +Turn a quantitative experiment result into a behavior story using session replays. > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. diff --git a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md similarity index 94% rename from plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md rename to plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md index 142089c..a4e69d4 100644 --- a/plugins/mixpanel-mcp-in/skills/experiment-results/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -1,8 +1,8 @@ # Why Hasn't This Reached Statistical Significance Yet? -Open this when the user asks why a primary metric is still inconclusive — _"why isn't this stat-sig yet?"_, _"should I wait or stop?"_, _"is this just underpowered?"_. The goal is to help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null**, _without_ recomputing the platform's verdicts. +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the **setup-side skill** — point the user there for the formulas. This skill explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- From ec34b1297e11903014d861be330f1bb571cf8284 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 18:48:06 +0000 Subject: [PATCH 06/11] interpret-experiment: phase-1 review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses Phase 1 of the hardcore /review-skill pass. - Drop stale "Step 1 of the Decision Tree" cross-references in SKILL.md Glossary, why-no-statsig.md, and segment-of-interest-selection.md. The new spine numbers the trustworthiness gate as step 2, but the name "trustworthiness gate" is what's stable — use the name. - Drop the embedded ~30s retry interval in health-check-interpretation.md §5. Retry policy belongs to the tool layer; "retry once, then surface" is enough for the skill. - Hedge five unsourced defaults (Welch's t-test choice, 95% winsorization default, ~350 per-variant exposure floor cited in three places, Bonferroni correction on multi-variant SRM). Each one becomes "the platform's configured/default X — verify in product" instead of a flat assertion. Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude --- .../skills/interpret-experiment/SKILL.md | 2 +- .../references/health-check-interpretation.md | 10 +++++----- .../references/per-metric-interpretation.md | 2 +- .../references/segment-breakdown-interpretation.md | 8 ++++---- .../references/segment-of-interest-selection.md | 4 ++-- .../interpret-experiment/references/why-no-statsig.md | 4 ++-- .../skills/interpret-experiment/SKILL.md | 2 +- .../references/health-check-interpretation.md | 10 +++++----- .../references/per-metric-interpretation.md | 2 +- .../references/segment-breakdown-interpretation.md | 8 ++++---- .../references/segment-of-interest-selection.md | 4 ++-- .../interpret-experiment/references/why-no-statsig.md | 4 ++-- .../mixpanel-mcp/skills/interpret-experiment/SKILL.md | 2 +- .../references/health-check-interpretation.md | 10 +++++----- .../references/per-metric-interpretation.md | 2 +- .../references/segment-breakdown-interpretation.md | 8 ++++---- .../references/segment-of-interest-selection.md | 4 ++-- .../interpret-experiment/references/why-no-statsig.md | 4 ++-- 18 files changed, 45 insertions(+), 45 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md index c2d7591..c205f29 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining. - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. -- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. --- diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md index e9082fa..a0658e2 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. -3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. @@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. -2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. @@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). -**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md index 3f272ad..d8877fb 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md index e0c43d2..f5623e1 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar ## Sample-size floor per segment -Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. -- Segments below the floor → mark "insufficient sample, treat as directional only." -- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. --- @@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt Don't recommend a segment-scoped ship unless **all** of these hold: 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). -2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md index b0c8f58..4db49ac 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c - Bot-suspicious countries (`bot_traffic` cause from health-check). - A specific app version range that shipped a flag-evaluation change. -This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. ### 5. Segments the platform de facto requires @@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th For each segment you want to break down on: -1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md index a4e69d4..7cc432a 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -13,7 +13,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. -- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md index c2d7591..c205f29 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining. - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. -- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md index e9082fa..a0658e2 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. -3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. @@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. -2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. @@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). -**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md index 3f272ad..d8877fb 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md index e0c43d2..f5623e1 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar ## Sample-size floor per segment -Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. -- Segments below the floor → mark "insufficient sample, treat as directional only." -- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. --- @@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt Don't recommend a segment-scoped ship unless **all** of these hold: 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). -2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md index b0c8f58..4db49ac 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c - Bot-suspicious countries (`bot_traffic` cause from health-check). - A specific app version range that shipped a flag-evaluation change. -This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. ### 5. Segments the platform de facto requires @@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th For each segment you want to break down on: -1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md index a4e69d4..7cc432a 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -13,7 +13,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. -- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md index c2d7591..c205f29 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -28,7 +28,7 @@ Concepts the rest of this skill uses without redefining. - **CUPED.** Variance reduction using pre-exposure baseline. Cuts required sample 30–70% when it applies. Inert on new-user-only cohorts. - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment was sized to detect. Set during experiment setup. -- **Trustworthiness gate.** The pre-flight check in Step 1 of the Decision Tree: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md index e9082fa..a0658e2 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -45,7 +45,7 @@ Users were assigned to variants in proportions that disagree with the configured 1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. -3. For multi-variant tests, the platform's SRM threshold is Bonferroni-corrected — the effective per-variant threshold may be tighter than the headline. Trust the bucket flag, not raw p-value math. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. 4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. 5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. @@ -115,8 +115,8 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Retry the experiment-details request — transient backend load may resolve. Wait ~30s between retries. -2. If repeated failures: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. 4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. @@ -150,9 +150,9 @@ These don't always invalidate results, but they change how to _read_ them. Surfa **Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). -**Interpretation**: outlier capping is far from the platform's default of 95. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to 95 unless they have a specific reason. +**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md index 3f272ad..d8877fb 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel uses Welch's t-test (z-test for large samples) at α = 0.05 / 95% confidence by default. The confidence level is set on `settings.confidenceLevel`; if it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md index e0c43d2..f5623e1 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -32,10 +32,10 @@ Surprisingly easy to forget when you're scanning a wide table — re-apply polar ## Sample-size floor per segment -Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. As a rule of thumb, the same ~350-per-variant floor used for overall trustworthiness applies per segment. +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. -- Segments below the floor → mark "insufficient sample, treat as directional only." -- A "significant" lift on a 50-user-per-variant segment is almost always noise. Say so. +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. - If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. --- @@ -58,7 +58,7 @@ When you spot Simpson's paradox, route the user to the **SRM** section of [healt Don't recommend a segment-scoped ship unless **all** of these hold: 1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). -2. The segment's per-variant sample clears the ~350 floor by a comfortable margin. +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. 3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. 4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. 5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md index b0c8f58..4db49ac 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-of-interest-selection.md @@ -64,7 +64,7 @@ If overall SRM is borderline (or failing in one variant only), per-segment SRM c - Bot-suspicious countries (`bot_traffic` cause from health-check). - A specific app version range that shipped a flag-evaluation change. -This is diagnostic segmentation, not interpretation segmentation. Use it when Step 1 of the Decision Tree has already flagged trouble. +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. ### 5. Segments the platform de facto requires @@ -82,7 +82,7 @@ Don't include all three blindly — pick the one(s) most likely to vary given th For each segment you want to break down on: -1. **Does each segment value have ~350+ exposed users per variant?** Below that floor, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. 2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. 3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. 4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md index a4e69d4..7cc432a 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -13,7 +13,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the trustworthiness gate (Step 1 of the Decision Tree). If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -63,7 +63,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. -- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs its own ~350+ sample for the per-comparison stats to be reliable. Adding arms costs power per-comparison. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. From 3d1a0913849bcf382c8296d11fcbb65bab532b7d Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 18:55:57 +0000 Subject: [PATCH 07/11] interpret-experiment: phase-2 review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses Phase 2 of the hardcore /review-skill pass. Removes the field-path schema leaks gslopez originally flagged on PR #23, which had survived the phase-1 cleanup. - Rewrite every section header in health-check-interpretation.md (sections 1-6 + the seven §7 misconfig sub-sections) from "Verdict to consume: " to plain-language intent. Same for the Condition/Interpretation/Action scaffolding in §7 — collapsed to "When: " + a free paragraph, dropping the labels the reader doesn't need. - Rewrite every "What to look at" bullet in why-no-statsig.md (reasons 1-5) from field-path triplets to intent. Same for the "First, rule out a broken result" checks and the EXTEND action. - Remove the remaining settings.* / live_* / results_cache.* references in per-metric-interpretation.md (baseline-fetch, variance/outlier discussion, multiple-comparisons section, testing-model section) and SKILL.md's polarity recipe (multiple-testing correction). - Remove field-path leaks from lifecycle-handoff.md (decision-record) and session-replay-analysis.md (example user-facing quote). - Add a one-sentence disambiguation guard at the top of SKILL.md Step 1: if the user hasn't named a specific experiment, ask before fetching. - Expand "SRM" on first mention in the description and replace "hasn't reached statistical significance" with "isn't showing a clear winner yet" for non-expert legibility. Glossary inside SKILL.md still does the heavy definition. After this commit, `grep -rE 'live_|results_cache|exposures_cache|settings\.<…>' plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits. Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude --- .../skills/interpret-experiment/SKILL.md | 6 +- .../references/health-check-interpretation.md | 80 ++++++++----------- .../references/lifecycle-handoff.md | 2 +- .../references/per-metric-interpretation.md | 18 ++--- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 24 +++--- .../skills/interpret-experiment/SKILL.md | 6 +- .../references/health-check-interpretation.md | 80 ++++++++----------- .../references/lifecycle-handoff.md | 2 +- .../references/per-metric-interpretation.md | 18 ++--- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 24 +++--- .../skills/interpret-experiment/SKILL.md | 6 +- .../references/health-check-interpretation.md | 80 ++++++++----------- .../references/lifecycle-handoff.md | 2 +- .../references/per-metric-interpretation.md | 18 ++--- .../references/session-replay-analysis.md | 2 +- .../references/why-no-statsig.md | 24 +++--- 18 files changed, 180 insertions(+), 216 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md index c205f29..18b15f7 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- @@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ## Data-source fallback @@ -74,6 +74,8 @@ Top-down: what to do, in order. ## 1. Fetch the experiment +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. + Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md index a0658e2..1edc9fa 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai ## 1. SRM (Sample Ratio Mismatch) -**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. ### What it means -Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. ### Likely causes, ordered most → least likely @@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. -4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. ### Recommended actions - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. -- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. ### Investigation checklist -1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. -5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured ## 2. Retro A/A (pre-experiment bias) failure -**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. ### What it means @@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that ## 3. Insufficient exposures -**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. ### Investigation checklist -1. Check `live_exposures` totals — which variant is undersampled? +1. Check per-variant exposure totals — which variant is undersampled? 2. Inspect feature-flag rollout — was rollout dialed back? 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve ## 4. Frequentist peeking -**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). ### What it means @@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Confirm `settings.testingModel == "frequentist"`. -2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. -4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) @@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl ## 5. Live computation timeout / broken data -**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. ### Investigation checklist 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. -4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. --- ## 6. Experiment ran < 3 days -**Verdict to compute (this one is local)**: `end_date - start_date`. +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ -If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. --- @@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. +**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. -**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). - -**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. +Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. ### Extreme winsorization percentile -**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). - -**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. +Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. +**When**: the experiment's SRM check is off. -**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. - -**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. +**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". - -**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. +**When**: CUPED is enabled AND the experiment cohort is "new users only". -**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**Condition**: `settings.confidenceLevel != 0.95`. +**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). -**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. - -**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. +`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**Condition**: `metrics[]` contains entries with `name == ""`. - -**Interpretation**: likely a broken or placeholder metric reference. +**When**: the experiment includes metric entries with empty names. -**Action**: flag and skip during analysis. +Likely a broken or placeholder metric reference. Flag and skip during analysis. ### Primary metric with no computed result -**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. - -**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** +**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). -**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md index 4d8189d..3a9e24c 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/lifecycle-handoff.md @@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding -The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md index d8877fb..576ef9f 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. @@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: -1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). 2. Lift from the winning row. -3. Absolute lift: `baseline_value × lift`. Examples: +3. Absolute lift: `baseline × lift`. Examples: - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." -### Fallback when `value` / `sampleSize` are null +### Fallback when the baseline value or sample size is missing -Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: @@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation -- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- @@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | -If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. --- @@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig ## Frequentist vs Sequential — what affects per-metric reading -Check `settings.testingModel`: +Check the experiment's testing model: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md index 59ad25e..7282bb4 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per Replay analysis is qualitative. Be honest about that. -- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md index 7cc432a..dbda2af 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin Also check: -- `lift is None` on the primary → no measurement, not "no effect." -- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." -- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. --- @@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually ### 1. Not enough sample yet (not enough exposures) -**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. @@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 2. Observed effect is smaller than the MDE -**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: @@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 3. Variance is too high (metric is too noisy) -**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. -- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. @@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 4. Traffic split is starving the variant -**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. +**What to check**: the configured traffic split against the actual per-variant exposure counts. - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. @@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM ### 5. Exposure config is filtering more users than the user expects -**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. -- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md index c205f29..18b15f7 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- @@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ## Data-source fallback @@ -74,6 +74,8 @@ Top-down: what to do, in order. ## 1. Fetch the experiment +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. + Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md index a0658e2..1edc9fa 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai ## 1. SRM (Sample Ratio Mismatch) -**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. ### What it means -Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. ### Likely causes, ordered most → least likely @@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. -4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. ### Recommended actions - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. -- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. ### Investigation checklist -1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. -5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured ## 2. Retro A/A (pre-experiment bias) failure -**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. ### What it means @@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that ## 3. Insufficient exposures -**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. ### Investigation checklist -1. Check `live_exposures` totals — which variant is undersampled? +1. Check per-variant exposure totals — which variant is undersampled? 2. Inspect feature-flag rollout — was rollout dialed back? 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve ## 4. Frequentist peeking -**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). ### What it means @@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Confirm `settings.testingModel == "frequentist"`. -2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. -4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) @@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl ## 5. Live computation timeout / broken data -**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. ### Investigation checklist 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. -4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. --- ## 6. Experiment ran < 3 days -**Verdict to compute (this one is local)**: `end_date - start_date`. +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ -If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. --- @@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. +**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. -**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). - -**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. +Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. ### Extreme winsorization percentile -**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). - -**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. +Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. +**When**: the experiment's SRM check is off. -**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. - -**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. +**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". - -**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. +**When**: CUPED is enabled AND the experiment cohort is "new users only". -**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**Condition**: `settings.confidenceLevel != 0.95`. +**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). -**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. - -**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. +`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**Condition**: `metrics[]` contains entries with `name == ""`. - -**Interpretation**: likely a broken or placeholder metric reference. +**When**: the experiment includes metric entries with empty names. -**Action**: flag and skip during analysis. +Likely a broken or placeholder metric reference. Flag and skip during analysis. ### Primary metric with no computed result -**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. - -**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** +**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). -**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md index 4d8189d..3a9e24c 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/lifecycle-handoff.md @@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding -The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md index d8877fb..576ef9f 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. @@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: -1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). 2. Lift from the winning row. -3. Absolute lift: `baseline_value × lift`. Examples: +3. Absolute lift: `baseline × lift`. Examples: - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." -### Fallback when `value` / `sampleSize` are null +### Fallback when the baseline value or sample size is missing -Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: @@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation -- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- @@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | -If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. --- @@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig ## Frequentist vs Sequential — what affects per-metric reading -Check `settings.testingModel`: +Check the experiment's testing model: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md index 59ad25e..7282bb4 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per Replay analysis is qualitative. Be honest about that. -- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md index 7cc432a..dbda2af 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin Also check: -- `lift is None` on the primary → no measurement, not "no effect." -- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." -- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. --- @@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually ### 1. Not enough sample yet (not enough exposures) -**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. @@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 2. Observed effect is smaller than the MDE -**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: @@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 3. Variance is too high (metric is too noisy) -**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. -- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. @@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 4. Traffic split is starving the variant -**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. +**What to check**: the configured traffic split against the actual per-variant exposure counts. - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. @@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM ### 5. Exposure config is filtering more users than the user expects -**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. -- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md index c205f29..18b15f7 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read, interpret, or make a ship/iterate/kill/wait call on an experiment, asks why an experiment hasn't reached statistical significance, asks what an SRM or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. license: Apache-2.0 --- @@ -48,7 +48,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). -The platform auto-applies multiple-testing correction when `settings.multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` — **don't re-correct**. +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ## Data-source fallback @@ -74,6 +74,8 @@ Top-down: what to do, in order. ## 1. Fetch the experiment +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. + Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md index a0658e2..1edc9fa 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -18,11 +18,11 @@ These two principles drive the recommendations below. Lead with them when explai ## 1. SRM (Sample Ratio Mismatch) -**Verdict to consume**: `live_srm_analysis` (or `exposures_cache.$srm_analysis`). The platform tags failing SRMs already; do not compute chi-square yourself. +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. ### What it means -Users were assigned to variants in proportions that disagree with the configured `settings.srm.targetAllocations`. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. ### Likely causes, ordered most → least likely @@ -31,23 +31,23 @@ Users were assigned to variants in proportions that disagree with the configured 1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. 2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. 3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. -4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the `$experiment_started` event fires exactly once per user per variant assignment. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. 5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. ### Recommended actions - **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. - **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. -- **investigate_exposure_logging** — Inspect `$experiment_started` event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. - **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. ### Investigation checklist -1. Compare `live_exposures` ratio to `settings.srm.targetAllocations` — which variant is over/under-represented? +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? 2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. 3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. -4. Verify SDK version and bucketing logic. Query `$experiment_started` events grouped by variant to confirm exposure events are flowing correctly. -5. Check for bot/QA traffic — bots often skew toward control. If `settings.excludeQA` is unset or false, recommend enabling it. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. 6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. 7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** @@ -55,7 +55,7 @@ Users were assigned to variants in proportions that disagree with the configured ## 2. Retro A/A (pre-experiment bias) failure -**Verdict to consume**: the analysis the platform attached when `settings.preExperimentBias` is enabled. +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. ### What it means @@ -76,14 +76,14 @@ The same statistical comparison run on the **pre-exposure** period revealed that ## 3. Insufficient exposures -**Verdict to consume**: `live_exposures` per variant, plus any platform-attached "insufficient" flag. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. ### Investigation checklist -1. Check `live_exposures` totals — which variant is undersampled? +1. Check per-variant exposure totals — which variant is undersampled? 2. Inspect feature-flag rollout — was rollout dialed back? 3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). -4. If the experiment is still ACTIVE: extend duration via an experiment update with a new `endAfterDays`. +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. @@ -92,7 +92,7 @@ If the user wants to talk about _why_ a primary metric is still inconclusive eve ## 4. Frequentist peeking -**Verdict to consume**: `settings.testingModel == "frequentist"`, plus `end_date` vs `start_date + endAfterDays` (or `sampleSize` vs `live_exposures.$overall`, depending on `settings.endCondition`). +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). ### What it means @@ -100,10 +100,10 @@ A frequentist test that ends before reaching its configured target has an **infl ### Investigation checklist -1. Confirm `settings.testingModel == "frequentist"`. -2. Compare `end_date` against `start_date + endAfterDays` (or whether `sampleSize` was reached, whichever is the configured `endCondition`). +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). 3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. -4. If the user wants to keep current results: caveat strongly. Recommend `testingModel: "sequential"` for the next experiment so they can stop early without penalty. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. (Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) @@ -111,26 +111,26 @@ A frequentist test that ends before reaching its configured target has an **infl ## 5. Live computation timeout / broken data -**Verdict to consume**: `live_results_errors` non-null with `live_*` fields null. +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. ### Investigation checklist 1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. 2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. 3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. -4. If `results_cache` is recent (`$last_computed` within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or null, the user must resolve the backend issue before any meaningful interpretation. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. --- ## 6. Experiment ran < 3 days -**Verdict to compute (this one is local)**: `end_date - start_date`. +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: > _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ -If `endCondition: "sample_size"` with a tiny target (e.g. 10k) was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. --- @@ -140,59 +140,45 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**Condition**: `settings.multipleTestingCorrection` is `"off"` or `null` AND there are 2+ primary metrics across 1+ non-control variants. +**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. -**Interpretation**: any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). - -**Action**: look at all primary results in aggregate. If most point the same direction, there is likely a real effect. If only one or two of many are significant, the result is **inconclusive due to false-positive risk** — the user can enable Benjamini-Hochberg or Bonferroni and re-analyze. +Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. ### Extreme winsorization percentile -**Condition**: `settings.winsorization.enabled == true` AND `settings.winsorization.percentile` is very low (e.g. < ~80) or unusually high (e.g. > ~99). - -**Interpretation**: outlier capping is far from the configured platform default (typically 95 — verify in product). A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. +**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). -**Action**: ask the user to confirm the percentile was intentional; recommend resetting to the platform default unless they have a specific reason. +Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**Condition**: `settings.srm == null` OR `settings.srm.enabled == false`. +**When**: the experiment's SRM check is off. -**Interpretation**: the SRM check didn't run. **Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. - -**Action**: only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios). When you do flag, recommend re-enabling SRM and re-analyzing. +**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**Condition**: `settings.cuped.enabled == true` AND the experiment cohort is "new users only". - -**Interpretation**: CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. +**When**: CUPED is enabled AND the experiment cohort is "new users only". -**Action**: mention as informational; no remediation needed for this experiment. For future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**Condition**: `settings.confidenceLevel != 0.95`. +**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). -**Interpretation**: `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. - -**Action**: call out explicitly in the verdict. Combine with metric count to estimate the family-wise error rate. +`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**Condition**: `metrics[]` contains entries with `name == ""`. - -**Interpretation**: likely a broken or placeholder metric reference. +**When**: the experiment includes metric entries with empty names. -**Action**: flag and skip during analysis. +Likely a broken or placeholder metric reference. Flag and skip during analysis. ### Primary metric with no computed result -**Condition**: a primary metric appears in `metrics[]` but is **missing from both** `live_metrics` and `results_cache.metrics`. - -**Interpretation**: no result was computed for that primary. **This is "no measurement," not "no effect."** +**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). -**Action**: surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md index 4d8189d..3a9e24c 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/lifecycle-handoff.md @@ -36,4 +36,4 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding -The decision record (`results_cache.message`, `results_cache.variant`, and `status` transitioning to `concluded` / `success` / `fail`) becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md index d8877fb..576ef9f 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -30,7 +30,7 @@ Apply the polarity recipe from the spine — see the **Components** section of ` ## Reading the p-value in this platform -Mixpanel runs a frequentist comparison at the configured `settings.confidenceLevel` — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. @@ -70,16 +70,16 @@ Pick the phrase that matches the four-question pattern. These are the words to u Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: -1. Baseline from the control variant: `live_metrics[metricId][controlKey].value` (or the `summary.no` row where `variant == controlKey`). +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). 2. Lift from the winning row. -3. Absolute lift: `baseline_value × lift`. Examples: +3. Absolute lift: `baseline × lift`. Examples: - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. 4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." -### Fallback when `value` / `sampleSize` are null +### Fallback when the baseline value or sample size is missing -Common — happens whenever live computation timed out or `results_cache.metrics` was nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: @@ -117,8 +117,8 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation -- **CUPED** (`settings.cuped.enabled == true`): mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization** (`settings.winsorization.enabled == true`): extreme values capped at the configured percentiles, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much lower than the default 95 is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- @@ -130,7 +130,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | -If `settings.multipleTestingCorrection` is `"off"` AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. --- @@ -153,7 +153,7 @@ For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig ## Frequentist vs Sequential — what affects per-metric reading -Check `settings.testingModel`: +Check the experiment's testing model: - `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. - `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md index 59ad25e..7282bb4 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per Replay analysis is qualitative. Be honest about that. -- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in `live_metrics`."_ +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ - ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md index 7cc432a..dbda2af 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -17,9 +17,9 @@ Before answering "why no statsig?", run the **trustworthiness gate**. If anythin Also check: -- `lift is None` on the primary → no measurement, not "no effect." -- The primary is in `metrics[]` but missing from `live_metrics` and `results_cache.metrics` → "no measurement." -- `live_results_errors` is non-null → results are stale or partial; resolve before drawing power conclusions. +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. --- @@ -29,7 +29,7 @@ Walk through these in order. The first one that explains the picture is usually ### 1. Not enough sample yet (not enough exposures) -**What to look at**: `live_exposures` per variant vs `settings.sampleSize`; or `end_date - start_date` vs `start_date + settings.endAfterDays`; plus `settings.testingModel`. +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. - **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. - **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. @@ -39,7 +39,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 2. Observed effect is smaller than the MDE -**What to look at**: the lift on the primary in `live_metrics[primary][treatment].lift`, plus the MDE the user planned for (typically captured in the experiment's `description` or recovered via the setup-side skill's power math). +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). - Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. - Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: @@ -49,9 +49,9 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 3. Variance is too high (metric is too noisy) -**What to look at**: distribution type of the metric, plus `settings.cuped.enabled` and `settings.winsorization.enabled`. +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. -- **Gaussian** metric (revenue, time-on-page) with no winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization (default percentile 95) on the next run. +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. - **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. - **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. - **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. @@ -59,7 +59,7 @@ If exposures are falling short of plan because traffic dropped: surface that. Qu ### 4. Traffic split is starving the variant -**What to look at**: `settings.srm.targetAllocations` and `live_exposures` per variant. +**What to check**: the configured traffic split against the actual per-variant exposure counts. - Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. - Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. @@ -69,11 +69,11 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM ### 5. Exposure config is filtering more users than the user expects -**What to look at**: the exposure tracking method (`$experiment_started` event volume), any audience filters on the backing feature flag, and `settings.excludeQA`. +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. -- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query `$experiment_started` to confirm how many users actually got exposed. +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. -- `settings.excludeQA` was off and you suspect internal traffic is dominating one variant → enable it on the next run (results then are cleaner but also smaller). +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). **Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is an experiment update with an increased `endAfterDays` (or `sampleSize`, depending on `endCondition`). Don't fabricate the target number — derive it from the platform's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. --- From 55bc4ba2e95d1ddd8b16d747f994a29e3c0882f3 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 18:58:58 +0000 Subject: [PATCH 08/11] interpret-experiment: phase-3 editorial cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Final pass from the hardcore /review-skill audit. - per-metric-interpretation.md: collapse "Significance = NO does NOT mean 'no effect'" (was a 14-line duplicate of why-no-statsig.md's options list) into 3 lines with a back-reference. Same for "Frequentist vs Sequential — what affects per-metric reading" (was 8 lines duplicating health-check-interpretation.md §4) → 4 lines with a back-reference. - health-check-interpretation.md §7 Misconfigurations: drop the When/Interpretation/Action triple-label scaffolding. Each of the 7 sub- sections is now a single bold "condition" sentence opening a single paragraph of consequence + action. Same information, ~25 lines lighter. Skill total: 988 → 950 lines (-38). health-check-interpretation.md: 206 → 178 (-14%). per-metric-interpretation.md: 181 → 169. Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude --- .../references/health-check-interpretation.md | 28 +++++-------------- .../references/per-metric-interpretation.md | 22 ++++----------- .../references/health-check-interpretation.md | 28 +++++-------------- .../references/per-metric-interpretation.md | 22 ++++----------- .../references/health-check-interpretation.md | 28 +++++-------------- .../references/per-metric-interpretation.md | 22 ++++----------- 6 files changed, 36 insertions(+), 114 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md index 1edc9fa..8875ca2 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. - -Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. ### Extreme winsorization percentile -**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). - -Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**When**: the experiment's SRM check is off. - -**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**When**: CUPED is enabled AND the experiment cohort is "new users only". - -CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). - -`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**When**: the experiment includes metric entries with empty names. - -Likely a broken or placeholder metric reference. Flag and skip during analysis. +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. ### Primary metric with no computed result -**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). - -No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md index 576ef9f..7907e90 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr --- -## "Significance = NO" does NOT mean "no effect" +## When a primary metric is inconclusive -A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -Options to suggest when a primary metric lands in `summary.no`: - -1. **Extend duration** (if the experiment is still ACTIVE). -2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). -3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. -4. **Enable CUPED** if the metric correlates with pre-exposure behavior. -5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. -6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. - -For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). --- ## Frequentist vs Sequential — what affects per-metric reading -Check the experiment's testing model: - -- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. -- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md index 1edc9fa..8875ca2 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. - -Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. ### Extreme winsorization percentile -**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). - -Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**When**: the experiment's SRM check is off. - -**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**When**: CUPED is enabled AND the experiment cohort is "new users only". - -CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). - -`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**When**: the experiment includes metric entries with empty names. - -Likely a broken or placeholder metric reference. Flag and skip during analysis. +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. ### Primary metric with no computed result -**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). - -No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md index 576ef9f..7907e90 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr --- -## "Significance = NO" does NOT mean "no effect" +## When a primary metric is inconclusive -A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -Options to suggest when a primary metric lands in `summary.no`: - -1. **Extend duration** (if the experiment is still ACTIVE). -2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). -3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. -4. **Enable CUPED** if the metric correlates with pre-exposure behavior. -5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. -6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. - -For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). --- ## Frequentist vs Sequential — what affects per-metric reading -Check the experiment's testing model: - -- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. -- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md index 1edc9fa..8875ca2 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -140,45 +140,31 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Multiple-testing correction off with several primaries -**When**: multiple-testing correction is off AND there are 2+ primary metrics across 1+ non-control variants. - -Any single significant primary may be a false positive. Family-wise error rate scales multiplicatively with metric count (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at all primary results in aggregate: if most point the same direction, there is likely a real effect; if only one or two of many are significant, the result is **inconclusive due to false-positive risk** — recommend the user enable Benjamini-Hochberg or Bonferroni and re-analyze. +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. ### Extreme winsorization percentile -**When**: Winsorization is enabled with a percentile far from the platform's default (typically 95). - -Outlier capping is far from the platform default. A percentile near 50 caps almost all data and almost certainly indicates a misconfiguration. Ask the user to confirm the percentile was intentional; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled -**When**: the experiment's SRM check is off. - -**Often deliberate** — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself, and do not treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios); when you do flag, recommend re-enabling SRM and re-analyzing. +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. ### CUPED on new-users-only cohort -**When**: CUPED is enabled AND the experiment cohort is "new users only". - -CUPED requires pre-exposure data, which new-user experiments lack — so CUPED simply had no effect. **This does NOT invalidate results.** Variance reduction just didn't happen. Mention as informational; for future experiments on the same surface, consider extending the cohort to include returning users so CUPED can apply. +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. ### Non-default confidence level -**When**: the experiment is configured for a confidence level other than the platform default (typically 0.95). - -`0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out explicitly in the verdict and combine with metric count to estimate the family-wise error rate. +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. ### Broken or placeholder metric entries -**When**: the experiment includes metric entries with empty names. - -Likely a broken or placeholder metric reference. Flag and skip during analysis. +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. ### Primary metric with no computed result -**When**: a metric is listed as primary on the experiment but has no result (neither live nor cached). - -No result was computed for that primary. **This is "no measurement," not "no effect."** Surface prominently; recommend the user re-sync results before drawing any conclusion that depends on this primary. +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md index 576ef9f..7907e90 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -134,31 +134,19 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr --- -## "Significance = NO" does NOT mean "no effect" +## When a primary metric is inconclusive -A row in `summary.no` means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level. **Important when the user is about to call something a null result.** +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -Options to suggest when a primary metric lands in `summary.no`: - -1. **Extend duration** (if the experiment is still ACTIVE). -2. **Increase traffic allocation** (if there's headroom — never mid-Frequentist-test, which invalidates SRM). -3. **Use Sequential testing model** for the next experiment if continuous monitoring fits. -4. **Enable CUPED** if the metric correlates with pre-exposure behavior. -5. **Narrow the hypothesis** — test a stronger version, or scope to a more responsive segment. -6. **Accept the null** — if the experiment was well-powered for the MDE that matters, "no effect" is a real finding. - -For the full "why hasn't this hit statsig yet" walk-through, see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). --- ## Frequentist vs Sequential — what affects per-metric reading -Check the experiment's testing model: - -- `"frequentist"` — pre-defined sample size or duration. **Peeking inflates the false-positive rate.** If the user concluded before reaching the configured target, every per-metric significance verdict is suspect. Note: frequentist + `endCondition: "days"` is supported intentionally — do not flag the combination itself as a misconfiguration. -- `"sequential"` — designed for continuous monitoring. Stopping early when significance is reached is safe and intended. +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -Concluding a Frequentist experiment before it reaches its target is a peeking event. Flag it in the verdict. +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). --- From 5de808df788bc90352d0c066d636efcc1654b1c9 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 19:35:53 +0000 Subject: [PATCH 09/11] interpret-experiment: phase-4 polish MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Final micro-pass from the third hardcore /review-skill audit. Four surgical edits; ~6 lines of net change. - SKILL.md polarity recipe: drop the lingering `settings.controlKey` reference ("use settings.controlKey" → "the platform marks which variant is control"). Same fix in per-metric-interpretation.md's tier table for the surviving `multipleTestingCorrection` reference. - why-no-statsig.md output shape: drop the "which fields told you" phrasing, which undid phase-2 right at the moment of summary. The example numbers stay; the field-citation framing goes. - SKILL.md step 1: add one sentence to the disambiguation guard naming the identifier-matching convention (ID first, then case-insensitive name). - health-check-interpretation.md and per-metric-interpretation.md: drop the duplicate "Never recompute thresholds" preamble paragraph — the rule lives in SKILL.md and is loaded with the spine. The references no longer need to restate it. `grep -rE 'live_|results_cache|exposures_cache|settings\.<…>|multipleTestingCorrection' plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits. Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude --- plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md | 4 ++-- .../references/health-check-interpretation.md | 2 -- .../references/per-metric-interpretation.md | 4 +--- .../skills/interpret-experiment/references/why-no-statsig.md | 2 +- plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md | 4 ++-- .../references/health-check-interpretation.md | 2 -- .../references/per-metric-interpretation.md | 4 +--- .../skills/interpret-experiment/references/why-no-statsig.md | 2 +- plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md | 4 ++-- .../references/health-check-interpretation.md | 2 -- .../references/per-metric-interpretation.md | 4 +--- .../skills/interpret-experiment/references/why-no-statsig.md | 2 +- 12 files changed, 12 insertions(+), 24 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md index 18b15f7..396114c 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): - `direction == "up"` → **positive** if `lift > 0`, else **negative**. - `direction == "down"` → **positive** if `lift < 0`, else **negative**. -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -74,7 +74,7 @@ Top-down: what to do, in order. ## 1. Fetch the experiment -If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md index 8875ca2..1467468 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/health-check-interpretation.md @@ -2,8 +2,6 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. - --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md index 7907e90..e46381c 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/per-metric-interpretation.md @@ -2,8 +2,6 @@ Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. - --- ## The mental model @@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | Tier | How it influences the verdict | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md index dbda2af..6b3d932 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). 3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md index 18b15f7..396114c 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): - `direction == "up"` → **positive** if `lift > 0`, else **negative**. - `direction == "down"` → **positive** if `lift < 0`, else **negative**. -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -74,7 +74,7 @@ Top-down: what to do, in order. ## 1. Fetch the experiment -If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md index 8875ca2..1467468 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/health-check-interpretation.md @@ -2,8 +2,6 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. - --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md index 7907e90..e46381c 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/per-metric-interpretation.md @@ -2,8 +2,6 @@ Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. - --- ## The mental model @@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | Tier | How it influences the verdict | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md index dbda2af..6b3d932 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). 3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md index 18b15f7..396114c 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -46,7 +46,7 @@ Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): - `direction == "up"` → **positive** if `lift > 0`, else **negative**. - `direction == "down"` → **positive** if `lift < 0`, else **negative**. -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first (use `settings.controlKey`). +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -74,7 +74,7 @@ Top-down: what to do, in order. ## 1. Fetch the experiment -If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. +If the user hasn't named a specific experiment, ask which one before fetching. Don't guess from context — interpreting the wrong experiment burns more time than the clarifying question costs. Accept the experiment by name or by ID; try ID match first, then case-insensitive name match. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md index 8875ca2..1467468 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/health-check-interpretation.md @@ -2,8 +2,6 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. -**Never recompute thresholds.** Read the verdict fields described below; if a field is absent, say so — do not synthesize a verdict from raw numbers. - --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md index 7907e90..e46381c 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/per-metric-interpretation.md @@ -2,8 +2,6 @@ Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ -**Consume, don't recompute.** Read `lift`, `liftConfidence`, `value`, `sampleSize`, and the bucket-derived `significance` ("YES_POSITIVE" / "YES_NEGATIVE" / "NO") from the experiment-details response. Then translate. - --- ## The mental model @@ -126,7 +124,7 @@ Different metric types behave differently; cite the relevant nuance in your verd | Tier | How it influences the verdict | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Primary** | **Decisional.** The platform auto-applies correction when `multipleTestingCorrection` is `"bonferroni"` or `"benjamini-hochberg"` (across primaries × variants). | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | | **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | | **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md index dbda2af..6b3d932 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -110,6 +110,6 @@ When recommending EXTEND on an active experiment, the action is to update the ex ## Output shape 1. **The reason** (one of the five above), in one sentence. -2. **The evidence from the experiment-details response** — which fields told you (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%," etc.). +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). 3. **Recommendation** from the table above, with the specific experiment update or follow-up action. 4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. From 6ad6fe72921f4e7001cc707528faf5f4b67b1614 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 20:49:15 +0000 Subject: [PATCH 10/11] =?UTF-8?q?interpret-experiment:=20rename=20cross-re?= =?UTF-8?q?fs=20experiment-setup=20=E2=86=92=20design-experiment?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The setup-side skill was renamed from `experiment-setup` to `design-experiment` on its own PR (mixpanel/ai-plugins#24) to follow the verb-noun convention. Update this skill's cross-references to match. Sites updated: - SKILL.md description's negative-trigger sentence - references/why-no-statsig.md (two mentions) Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude --- plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md | 2 +- .../skills/interpret-experiment/references/why-no-statsig.md | 4 ++-- plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md | 2 +- .../skills/interpret-experiment/references/why-no-statsig.md | 4 ++-- plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md | 2 +- .../skills/interpret-experiment/references/why-no-statsig.md | 4 ++-- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md index 396114c..c370fc0 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md index 6b3d932..37ec069 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md index 396114c..c370fc0 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md index 6b3d932..37ec069 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md index 396114c..c370fc0 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/SKILL.md @@ -1,6 +1,6 @@ --- name: interpret-experiment -description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `experiment-setup` skill. +description: Interprets a Mixpanel experiment's results and health checks. Use when the user asks to read results, decide whether to ship / iterate / kill / keep waiting, asks why an experiment isn't showing a clear winner yet, asks what a Sample Ratio Mismatch (SRM) or pre-experiment-bias verdict means, or wants to break results down by segment. Consumes the already-computed verdicts the platform returns — never recomputes thresholds. Do NOT use for experiment setup questions (sizing, metric selection, hypothesis framing, advanced-feature config) — those belong to the `design-experiment` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md index 6b3d932..37ec069 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) is owned by the `experiment-setup` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) is owned by the `design-experiment` skill — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. --- @@ -93,7 +93,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `experiment-setup` skill for the power math. +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or send the user to the `design-experiment` skill for the power math. --- From 67dcb35d2d804b80904fa36d393b30cf15429e7a Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 21:53:00 +0000 Subject: [PATCH 11/11] Move platform-support disclaimer below the content in segment-breakdown-interpretation The disclaimer about per-segment platform support was the second paragraph, separating the file's purpose from its content with five lines of caveats. Moved to a "Platform support status" section at the end of the file so the reader hits the mental model immediately. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../references/segment-breakdown-interpretation.md | 8 ++++++-- .../references/segment-breakdown-interpretation.md | 8 ++++++-- .../references/segment-breakdown-interpretation.md | 8 ++++++-- 3 files changed, 18 insertions(+), 6 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md index f5623e1..98c7bbc 100644 --- a/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -2,8 +2,6 @@ Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. - --- ## The mental model @@ -93,3 +91,9 @@ This is the everyday case of mixed effects. 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md index f5623e1..98c7bbc 100644 --- a/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -2,8 +2,6 @@ Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. - --- ## The mental model @@ -93,3 +91,9 @@ This is the everyday case of mixed effects. 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md index f5623e1..98c7bbc 100644 --- a/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp/skills/interpret-experiment/references/segment-breakdown-interpretation.md @@ -2,8 +2,6 @@ Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. -> **Platform support status.** Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the same rules below. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. - --- ## The mental model @@ -93,3 +91,9 @@ This is the everyday case of mixed effects. 2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). 3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." 4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts.