From 80db3c1f0ccbc7271ed3467b8dcda423268934a5 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 9 Jun 2026 23:53:30 +0000 Subject: [PATCH 01/11] New skill: manage-experiment (umbrella over design + interpret) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Combines the soon-to-land `design-experiment` (PR #24) and `interpret-experiment` (PR #23) skills into a single `manage-experiment` umbrella, modeled on `manage-lexicon`. One front door for any experiment-related phrase; subcommands for the two phases. ## Why Today an agent has to disambiguate between two skills on every experiment turn, and borderline phrases ("audit my experiment", "check on experiment X") route inconsistently. Folding both into one umbrella with `design` and `interpret` subcommands: - Gives a single trigger surface for the agent to route to. - Lets ambiguous phrases use phase-derived disambiguation (DRAFT → design; ACTIVE/CONCLUDED → interpret). - Removes the cross-skill reference coupling that currently requires coordinated edits across two PRs whenever either skill renames. - Reserves room to add `launch` and `monitor` commands later without another rename. ## Shape ``` manage-experiment/ ├── SKILL.md (umbrella: shared glossary, command menu, project + experiment resolution) ├── commands/ │ ├── design.md (former design-experiment SKILL.md body) │ └── interpret.md (former interpret-experiment SKILL.md body) └── references/ (15 references, flat; no name collisions) ``` The two command files preserve their original prose verbatim except for: - Removing the now-redundant frontmatter and opening "You are helping…" paragraph. - Lifting umbrella-able glossary terms (Variant, Primary/Guardrail/ Secondary, Direction, Lift, MDE, CUPED, Winsorization, Multiple- testing correction) up to the umbrella; keeping phase-specific terms (Hypothesis, Power, Underpowered, Sequential vs Frequentist for design; Polarity, Significance, SRM, Retro A/A, Twyman's Law, Trustworthiness gate for interpret) in their command files. - Adjusting reference paths to `../references/`. The 15 reference files are unchanged except for two intra-skill links in `why-no-statsig.md` that previously pointed at the `design-experiment` skill — now point directly at `sizing.md` (where the formulas live). The 3 references to `manage-feature-flags` (in SKILL.md, design.md, and routing-xp-vs-ff.md) are correct cross-skill pointers and unchanged. ## Sequencing This PR is opened **before** PR #23 and PR #24 merge. Once those land, this PR will need a rebase against master that deletes the now-existing `design-experiment/` and `interpret-experiment/` directories. Per the plan, doing the umbrella as a separate PR keeps the review focused on the routing pattern rather than tangled into either skill's prose review. The `manage-feature-flags` skill (PR #25) cross-references both old skill names; once both PRs land, those cross-refs will need a sweep to point at `manage-experiment`. README index updated: one `manage-experiment` row replaces the two prospective `design-experiment` and `interpret-experiment` rows. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 21 ++- .../skills/manage-experiment/SKILL.md | 118 ++++++++++++ .../manage-experiment/commands/design.md | 161 ++++++++++++++++ .../manage-experiment/commands/interpret.md | 112 +++++++++++ .../references/advanced-features.md | 103 ++++++++++ .../references/health-check-interpretation.md | 176 ++++++++++++++++++ .../references/hypothesis-framing.md | 101 ++++++++++ .../references/lifecycle-handoff.md | 39 ++++ .../references/metric-selection.md | 75 ++++++++ .../references/per-metric-interpretation.md | 167 +++++++++++++++++ .../manage-experiment/references/pitfalls.md | 93 +++++++++ .../references/prior-experiments.md | 81 ++++++++ .../references/routing-xp-vs-ff.md | 85 +++++++++ .../segment-breakdown-interpretation.md | 99 ++++++++++ .../segment-of-interest-selection.md | 116 ++++++++++++ .../references/session-replay-analysis.md | 109 +++++++++++ .../manage-experiment/references/sizing.md | 109 +++++++++++ .../references/statistical-model.md | 102 ++++++++++ .../references/why-no-statsig.md | 115 ++++++++++++ .../skills/manage-experiment/SKILL.md | 118 ++++++++++++ .../manage-experiment/commands/design.md | 161 ++++++++++++++++ .../manage-experiment/commands/interpret.md | 112 +++++++++++ .../references/advanced-features.md | 103 ++++++++++ .../references/health-check-interpretation.md | 176 ++++++++++++++++++ .../references/hypothesis-framing.md | 101 ++++++++++ .../references/lifecycle-handoff.md | 39 ++++ .../references/metric-selection.md | 75 ++++++++ .../references/per-metric-interpretation.md | 167 +++++++++++++++++ .../manage-experiment/references/pitfalls.md | 93 +++++++++ .../references/prior-experiments.md | 81 ++++++++ .../references/routing-xp-vs-ff.md | 85 +++++++++ .../segment-breakdown-interpretation.md | 99 ++++++++++ .../segment-of-interest-selection.md | 116 ++++++++++++ .../references/session-replay-analysis.md | 109 +++++++++++ .../manage-experiment/references/sizing.md | 109 +++++++++++ .../references/statistical-model.md | 102 ++++++++++ .../references/why-no-statsig.md | 115 ++++++++++++ .../skills/manage-experiment/SKILL.md | 118 ++++++++++++ .../manage-experiment/commands/design.md | 161 ++++++++++++++++ .../manage-experiment/commands/interpret.md | 112 +++++++++++ .../references/advanced-features.md | 103 ++++++++++ .../references/health-check-interpretation.md | 176 ++++++++++++++++++ .../references/hypothesis-framing.md | 101 ++++++++++ .../references/lifecycle-handoff.md | 39 ++++ .../references/metric-selection.md | 75 ++++++++ .../references/per-metric-interpretation.md | 167 +++++++++++++++++ .../manage-experiment/references/pitfalls.md | 93 +++++++++ .../references/prior-experiments.md | 81 ++++++++ .../references/routing-xp-vs-ff.md | 85 +++++++++ .../segment-breakdown-interpretation.md | 99 ++++++++++ .../segment-of-interest-selection.md | 116 ++++++++++++ .../references/session-replay-analysis.md | 109 +++++++++++ .../manage-experiment/references/sizing.md | 109 +++++++++++ .../references/statistical-model.md | 102 ++++++++++ .../references/why-no-statsig.md | 115 ++++++++++++ 55 files changed, 5895 insertions(+), 9 deletions(-) create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md diff --git a/README.md b/README.md index 17a8229..4db4025 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,18 @@ Plugins that give AI agents Mixpanel expertise. Built on the [Agent Skills](http ## Skills -| Skill | Description | -|---|---| -| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | -| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks *why* a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | -| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | -| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | +| Skill | Description | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| [`create-dashboard`](plugins/mixpanel-mcp/skills/create-dashboard/) | Creates a well-designed Mixpanel dashboard with validated data, text cards, and narrative layout. | +| [`deep-research`](plugins/mixpanel-mcp/skills/deep-research/) | Conducts a structured metric investigation in Mixpanel. Use when a user asks _why_ a metric changed, what's driving a trend, or requests a deep dive or root cause analysis. | +| [`manage-experiment`](plugins/mixpanel-mcp/skills/manage-experiment/) | Coaches an agent through any phase of a Mixpanel experiment — designing before launch (hypothesis, metrics, sizing, statistical model, advanced features, pre-launch checks) and interpreting after launch (read results, ship / iterate / kill / wait, health checks, segment breakdowns, session replays). | +| [`manage-lexicon`](plugins/mixpanel-mcp/skills/manage-lexicon/) | Audits, scores, enriches, and cleans up Lexicon metadata (events and properties) for a Mixpanel project. Supports scoring health, bulk-filling descriptions/tags, resetting metadata, triaging data quality issues, and managing tags. | +| [`tracking-implementation`](plugins/mixpanel-mcp/skills/tracking-implementation/) | Guides an agent through Mixpanel analytics implementation. Supports Quick Start, Full Implementation, Add Tracking, and Audit modes. | ### Internal skills -| Skill | Description | -|---|---| +| Skill | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`review-skill`](.claude/skills/review-skill/) | Reviews a skill against a weighted quality rubric (8 dimensions, 27 checks) and produces a score with actionable issues. Run `/review-skill ` before requesting a code review. | ## Getting Started @@ -30,21 +31,23 @@ claude plugin marketplace add mixpanel/ai-plugins 2. Install the plugin for your region: **US** + ```bash claude plugin install mixpanel-mcp ``` **EU** + ```bash claude plugin install mixpanel-mcp-eu ``` **India** + ```bash claude plugin install mixpanel-mcp-in ``` - ### Cursor Install the plugin from the Cursor marketplace, or have a team admin import this GitHub repository as a team marketplace (Dashboard → Settings → Plugins → Import). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md new file mode 100644 index 0000000..f2e1be1 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -0,0 +1,118 @@ +--- +name: manage-experiment +description: > + Coach the user through any phase of a Mixpanel experiment — design before + launch (hypothesis framing, metric selection, sizing, statistical model + choice, advanced statistical features like CUPED / Winsorization / + Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret + after launch (read results, decide ship / iterate / kill / wait, interpret + health checks like SRM and Retro A/A, break results down by segment, use + session replays to explain a result). Use when the user mentions + experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or + any phrasing like "set up an experiment", "design an A/B test", "how did + experiment X do", "should we ship", "why isn't this significant yet", + "should this be sequential or fixed-horizon", "what's my MDE", "is this + experiment configured correctly", "audit my experiment". Do NOT use for + plain feature-flag rollouts with no measurement criterion — that belongs + to the `manage-feature-flags` skill. +license: Apache-2.0 +--- + +# Manage Experiment + +This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. + +The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). + +--- + +# Components + +The pieces the skill is built from. The Steps section below tells you how to use them. + +## Canonical commands + +Each command lives in its own file under `commands/` and is loaded on demand. Match commands explicitly (user names them) or implicitly (message matches a trigger phrase below). + +| Command | File | Match if message contains any of | +| ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | + +If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. + +## Command menu + +Shown when no command was detected or inferred. + +``` +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + Manage Experiment — [Project Name] ([project_id]) +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks + 2. Interpret — Read results, ship / iterate / kill / wait + 3. Exit +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +``` + +## Shared glossary + +Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. + - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. +- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. + +## Behaviour rules + +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. +3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. + +--- + +# Steps + +Follow these steps in order. + +## 1. Set project + +Resolve which Mixpanel project the user wants to operate on. + +- **User named a project (name or ID):** list all projects in the workspace. Match by ID first, then by case-insensitive name. If one match → `✅ [Project Name] ([project_id])`, proceed. +- **Multiple name matches:** show the matches in a numbered list, ask the user to pick. +- **No match:** tell the user what wasn't found, offer to `list` (which re-fetches the project list and shows the table). +- **User named nothing:** ask which project. `list` → fetch projects → show table. + +If the project listing fails with tool-not-found, tell the user to connect the Mixpanel MCP and stop. + +## 2. Set experiment (if one is named) + +If the user named an experiment, resolve it now — try ID first, then case-insensitive name match. Multiple matches → numbered picker. No match → tell the user what wasn't found. + +If the user is starting a new experiment from scratch (no existing experiment to name), skip this step — `design` will handle setup. + +## 3. Pick the command + +- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +- **Ambiguous or none:** show the Command menu, take the user's choice. + +## 4. Load and execute the command + +If the command file is not already in context, read `commands/[command].md`. Follow the instructions in that file. Reuse the project and experiment context resolved in steps 1–2 — never re-ask. + +## 5. Complete + +Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md new file mode 100644 index 0000000..b5f7046 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -0,0 +1,161 @@ +# Command: design + +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (design-specific) + +- **Hypothesis.** A falsifiable, directional claim with a stated mechanism, bounded in time. Shape: _"If ``, then `` will `` by ≥``, because ``."_ Every other decision flows from this. +- **Power.** The probability the experiment detects a true effect of size MDE. Default 80%. +- **Underpowered.** Achievable MDE on available traffic exceeds the user's expected lift. Most likely outcome is "inconclusive"; reachable significance is biased upward (winner's curse). +- **Sequential vs Frequentist testing.** Sequential makes peeking safe (boundary-based stopping); Frequentist requires a fixed sample committed up front. Most users should default to Sequential. + +--- + +## Components (design-specific) + +### Sizing formulas + +Required sample per variant (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Inverted for traffic-bound teams — the smallest effect detectable on available traffic (Kohavi's inversion): + +``` +MDE = 4σ / √n +``` + +The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). + +### The >5% guardrail hard-gate + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). + +### Pre-launch pitfall catalogue + +Before creating the experiment, run the deterministic pre-launch checks against the configuration. Surface results in triage order: **blockers** (an experiment that can't reach statistical power), **warnings** (configuration smells that degrade trustworthiness), then **fyi**. The two blockers today are: insufficient duration for the configured MDE on available traffic; and a cohort too small to supply enough eligible users. The full catalogue, severities, and rationale live in [../references/pitfalls.md](../references/pitfalls.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Route and check for prior work + +**Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). + +**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). + +### 2. Write the hypothesis + +A good hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**: + +> **If** ``, **then** `` will `` by ≥``, **because** ``. + +If the user is vague, hold them to five commitments: the change, the primary metric, the direction, the MDE, the mechanism. The "because" forces them to check whether the metric they picked is actually downstream of the change — the most common source of "experiment didn't work" post-mortems. The rubric, common misalignment patterns, and worked good/bad examples are in [../references/hypothesis-framing.md](../references/hypothesis-framing.md). + +### 3. Pick metrics that test the hypothesis + +The hypothesis names a specific outcome. The primary metric must measure that outcome — same population, same denominator, same timeframe. + +- **Primaries** (1–3 max) come from the hypothesis's outcome clause. Each additional primary inflates the family-wise false-positive rate. +- **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +- **Secondaries** are diagnostic only. + +Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). + +### 4. Size the experiment with real data + +Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. + +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). + +Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). + +### 5. Pick testing model + end condition + +Four choices, each with a default that's right for most users: + +- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. +- **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. +- **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. + +Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and the four valid model × end-condition combinations are in [../references/statistical-model.md](../references/statistical-model.md). + +### 6. Decide on advanced features + +- **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. + +When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). + +### 7. Run the pre-launch pitfall check + +Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). + +### 8. Confirm with the user, then create + +Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Insufficient duration — adequate +✅ Cohort too small — adequate +⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship +``` + +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. + +After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. + +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. + +--- + +## Going deeper + +| User asks about… | Open | +| ----------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| "Is this an experiment or just a feature flag?" | [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md) | +| "Help me write the hypothesis" / "Is this hypothesis good?" | [../references/hypothesis-framing.md](../references/hypothesis-framing.md) | +| "Which metrics should I pick?" / "Primary vs guardrail vs secondary?" | [../references/metric-selection.md](../references/metric-selection.md) | +| "What sample size do I need?" / "What MDE can I detect?" / "How long to run?" | [../references/sizing.md](../references/sizing.md) | +| "Sequential vs frequentist?" / "Confidence level?" / "Correction method?" | [../references/statistical-model.md](../references/statistical-model.md) | +| "Should I enable CUPED / Winsorization?" | [../references/advanced-features.md](../references/advanced-features.md) | +| "Was anything similar tested before?" | [../references/prior-experiments.md](../references/prior-experiments.md) | +| "What can go wrong before launch?" / "Run the pre-launch check" | [../references/pitfalls.md](../references/pitfalls.md) | + +--- + +## Output style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md new file mode 100644 index 0000000..77d78ff --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md @@ -0,0 +1,112 @@ +# Command: interpret + +Interpret a Mixpanel experiment's results and health checks. This command consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (interpret-specific) + +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +## Components (interpret-specific) + +### Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +### Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +### Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [../references/why-no-statsig.md](../references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Fetch the experiment + +The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +### 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +#### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [../references/health-check-interpretation.md](../references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +#### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [../references/why-no-statsig.md](../references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md). + +#### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +#### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +#### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +### 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [../references/health-check-interpretation.md](../references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [../references/why-no-statsig.md](../references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [../references/segment-of-interest-selection.md](../references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [../references/segment-breakdown-interpretation.md](../references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [../references/session-replay-analysis.md](../references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md) | + +### 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md new file mode 100644 index 0000000..3f6f7dd --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**How to enable.** Turn CUPED on for the experiment and pick a pre-exposure window length (see presets below). + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- **2 weeks** — fast-moving metrics with no strong weekly seasonality. +- **4 weeks** — most metrics with weekly seasonality (default sweet spot). +- **60 days** — deeply seasonal metrics like spend. +- **90 days** — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The platform persists the setting on the experiment; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**How to enable.** Turn Winsorization on for the experiment and pick a percentile. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in [statistical-model.md](statistical-model.md). The short version: + +- Enable when there are ≥2 primaries OR ≥2 non-control variants. +- Default to Benjamini-Hochberg. More powerful with correlated primaries. +- Use Bonferroni when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Turn off only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behaviour of existing users? +│ ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (platform default percentile, typically 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behaviour of existing users? + ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON (Benjamini-Hochberg default; Bonferroni for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md new file mode 100644 index 0000000..8c115af --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md new file mode 100644 index 0000000..7088e3c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has an explicit direction (not the platform default). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has an explicit direction. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (Benjamini-Hochberg default, Bonferroni for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail left at default direction `up` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md new file mode 100644 index 0000000..5ce40a1 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md @@ -0,0 +1,93 @@ +# Pre-launch pitfalls + +Catalogue of the deterministic checks to run before the user creates an experiment. Detection logic lives in the platform's pre-launch validation capability; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +## Triage order + +Surface pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two today: **insufficient duration** for the configured MDE on available traffic, and **cohort too small** to supply enough eligible users. Both mean the experiment literally cannot reach statistical power. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most pitfalls fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in the experiment's description. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +### Insufficient duration for the configured MDE — blocker + +**Expected exposures over the planned window cover less than 50% of the required per-arm sample.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. The most likely outcome is "inconclusive," and a non-trivial fraction of those inconclusive results will be noise crossing the significance threshold rather than a real effect (the winner's-curse problem). Extend planned duration to cover the required sample, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (cuts required sample 30–70%). + +### Cohort too small — blocker + +**Eligible cohort size is smaller than (number of arms × per-arm target).** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users. Either expand the cohort to comfortably exceed (number of arms × per-arm target) eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target to what the cohort can supply (and accept the larger achievable MDE). + +### Pre-experiment bias likely — warning + +**Retro A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts inherit it — the team sees "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. Enable CUPED with a 2–4 week pre-exposure window; it specifically regresses out the pre-exposure baseline difference. + +### High variance, no Winsorization — warning + +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. + +### Multiple primaries, no correction — warning + +**≥2 primary metrics configured AND multiple-testing correction is off.** Family-wise false-positive rate compounds with each additional primary: at 3 primaries ~14.3%, at 5 ~22.6% — more than one in five "wins" is noise. Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. + +### Marginally underpowered duration — warning + +**Expected exposures cover 50–100% of the required per-arm sample.** The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate. Less urgent than the insufficient-duration blocker. + +### Missing guardrails — warning + +**Zero guardrail metrics configured.** Without guardrails, there's no >5% hard-gate to block a ship on a regression. The team is implicitly trusting that the primary captures every relevant impact — rarely true. Add at least one guardrail covering the most likely failure mode of the change: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### Hypothesis ↔ metric mismatch — warning + +**The hypothesis mentions a canonical metric noun (conversion, retention, revenue, signup, engagement, click, purchase) but no primary's name appears to measure that outcome.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count." Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### Primary lacks a leading indicator — warning + +**Primaries include a retention-type metric AND no leading-indicator secondary (conversion or funnel type) is configured.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Add a leading-indicator secondary measured within the experiment runtime; the retention primary stays as the ship decision, the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here. The platform reports which check fired; the agent renders the human-readable message. When the platform's thresholds change (e.g., the 50% / 100% bounds on the underpowered checks), the recommendation language in this document needs to track — the agent will quote stale numbers otherwise. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit seasonality-alignment check is a reasonable add. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md new file mode 100644 index 0000000..60d083c --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +The first thing to do when a user proposes an experiment on a feature is look up prior experiments on that feature. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's description, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..d077dba --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the `manage-feature-flags` skill: + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md new file mode 100644 index 0000000..7c41c9a --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: keep per-variant sample size above the platform's reliability floor (verify in product — historically ~350–400). Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The platform's default per-variant target is fine for most tests; ~1,000 is the practical floor; the platform floor is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md new file mode 100644 index 0000000..771a208 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md @@ -0,0 +1,102 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. + +## Testing model: sequential vs frequentist + +**Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick sequential when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick frequentist when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in the experiment's description so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample-based vs date-based + +### Pick sample-based when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick date-based when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A sample-based end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **frequentist + date-based** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **frequentist + sample-based + peeking** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default 0.95 (α = 0.05; verify in product). Change only with intent. + +- **0.99** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in the experiment's description. +- **0.90** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from the default belongs in the description. The post-launch interpretation step uses this setting to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **Bonferroni** — divides α by the number of tests (primaries × non-control variants). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **Benjamini-Hochberg** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to Benjamini-Hochberg** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use Bonferroni when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Turn correction off **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on the confidence level: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..22ca2fe --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md new file mode 100644 index 0000000..f2e1be1 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -0,0 +1,118 @@ +--- +name: manage-experiment +description: > + Coach the user through any phase of a Mixpanel experiment — design before + launch (hypothesis framing, metric selection, sizing, statistical model + choice, advanced statistical features like CUPED / Winsorization / + Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret + after launch (read results, decide ship / iterate / kill / wait, interpret + health checks like SRM and Retro A/A, break results down by segment, use + session replays to explain a result). Use when the user mentions + experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or + any phrasing like "set up an experiment", "design an A/B test", "how did + experiment X do", "should we ship", "why isn't this significant yet", + "should this be sequential or fixed-horizon", "what's my MDE", "is this + experiment configured correctly", "audit my experiment". Do NOT use for + plain feature-flag rollouts with no measurement criterion — that belongs + to the `manage-feature-flags` skill. +license: Apache-2.0 +--- + +# Manage Experiment + +This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. + +The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). + +--- + +# Components + +The pieces the skill is built from. The Steps section below tells you how to use them. + +## Canonical commands + +Each command lives in its own file under `commands/` and is loaded on demand. Match commands explicitly (user names them) or implicitly (message matches a trigger phrase below). + +| Command | File | Match if message contains any of | +| ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | + +If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. + +## Command menu + +Shown when no command was detected or inferred. + +``` +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + Manage Experiment — [Project Name] ([project_id]) +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks + 2. Interpret — Read results, ship / iterate / kill / wait + 3. Exit +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +``` + +## Shared glossary + +Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. + - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. +- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. + +## Behaviour rules + +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. +3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. + +--- + +# Steps + +Follow these steps in order. + +## 1. Set project + +Resolve which Mixpanel project the user wants to operate on. + +- **User named a project (name or ID):** list all projects in the workspace. Match by ID first, then by case-insensitive name. If one match → `✅ [Project Name] ([project_id])`, proceed. +- **Multiple name matches:** show the matches in a numbered list, ask the user to pick. +- **No match:** tell the user what wasn't found, offer to `list` (which re-fetches the project list and shows the table). +- **User named nothing:** ask which project. `list` → fetch projects → show table. + +If the project listing fails with tool-not-found, tell the user to connect the Mixpanel MCP and stop. + +## 2. Set experiment (if one is named) + +If the user named an experiment, resolve it now — try ID first, then case-insensitive name match. Multiple matches → numbered picker. No match → tell the user what wasn't found. + +If the user is starting a new experiment from scratch (no existing experiment to name), skip this step — `design` will handle setup. + +## 3. Pick the command + +- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +- **Ambiguous or none:** show the Command menu, take the user's choice. + +## 4. Load and execute the command + +If the command file is not already in context, read `commands/[command].md`. Follow the instructions in that file. Reuse the project and experiment context resolved in steps 1–2 — never re-ask. + +## 5. Complete + +Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md new file mode 100644 index 0000000..b5f7046 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -0,0 +1,161 @@ +# Command: design + +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (design-specific) + +- **Hypothesis.** A falsifiable, directional claim with a stated mechanism, bounded in time. Shape: _"If ``, then `` will `` by ≥``, because ``."_ Every other decision flows from this. +- **Power.** The probability the experiment detects a true effect of size MDE. Default 80%. +- **Underpowered.** Achievable MDE on available traffic exceeds the user's expected lift. Most likely outcome is "inconclusive"; reachable significance is biased upward (winner's curse). +- **Sequential vs Frequentist testing.** Sequential makes peeking safe (boundary-based stopping); Frequentist requires a fixed sample committed up front. Most users should default to Sequential. + +--- + +## Components (design-specific) + +### Sizing formulas + +Required sample per variant (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Inverted for traffic-bound teams — the smallest effect detectable on available traffic (Kohavi's inversion): + +``` +MDE = 4σ / √n +``` + +The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). + +### The >5% guardrail hard-gate + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). + +### Pre-launch pitfall catalogue + +Before creating the experiment, run the deterministic pre-launch checks against the configuration. Surface results in triage order: **blockers** (an experiment that can't reach statistical power), **warnings** (configuration smells that degrade trustworthiness), then **fyi**. The two blockers today are: insufficient duration for the configured MDE on available traffic; and a cohort too small to supply enough eligible users. The full catalogue, severities, and rationale live in [../references/pitfalls.md](../references/pitfalls.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Route and check for prior work + +**Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). + +**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). + +### 2. Write the hypothesis + +A good hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**: + +> **If** ``, **then** `` will `` by ≥``, **because** ``. + +If the user is vague, hold them to five commitments: the change, the primary metric, the direction, the MDE, the mechanism. The "because" forces them to check whether the metric they picked is actually downstream of the change — the most common source of "experiment didn't work" post-mortems. The rubric, common misalignment patterns, and worked good/bad examples are in [../references/hypothesis-framing.md](../references/hypothesis-framing.md). + +### 3. Pick metrics that test the hypothesis + +The hypothesis names a specific outcome. The primary metric must measure that outcome — same population, same denominator, same timeframe. + +- **Primaries** (1–3 max) come from the hypothesis's outcome clause. Each additional primary inflates the family-wise false-positive rate. +- **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +- **Secondaries** are diagnostic only. + +Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). + +### 4. Size the experiment with real data + +Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. + +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). + +Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). + +### 5. Pick testing model + end condition + +Four choices, each with a default that's right for most users: + +- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. +- **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. +- **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. + +Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and the four valid model × end-condition combinations are in [../references/statistical-model.md](../references/statistical-model.md). + +### 6. Decide on advanced features + +- **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. + +When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). + +### 7. Run the pre-launch pitfall check + +Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). + +### 8. Confirm with the user, then create + +Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Insufficient duration — adequate +✅ Cohort too small — adequate +⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship +``` + +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. + +After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. + +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. + +--- + +## Going deeper + +| User asks about… | Open | +| ----------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| "Is this an experiment or just a feature flag?" | [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md) | +| "Help me write the hypothesis" / "Is this hypothesis good?" | [../references/hypothesis-framing.md](../references/hypothesis-framing.md) | +| "Which metrics should I pick?" / "Primary vs guardrail vs secondary?" | [../references/metric-selection.md](../references/metric-selection.md) | +| "What sample size do I need?" / "What MDE can I detect?" / "How long to run?" | [../references/sizing.md](../references/sizing.md) | +| "Sequential vs frequentist?" / "Confidence level?" / "Correction method?" | [../references/statistical-model.md](../references/statistical-model.md) | +| "Should I enable CUPED / Winsorization?" | [../references/advanced-features.md](../references/advanced-features.md) | +| "Was anything similar tested before?" | [../references/prior-experiments.md](../references/prior-experiments.md) | +| "What can go wrong before launch?" / "Run the pre-launch check" | [../references/pitfalls.md](../references/pitfalls.md) | + +--- + +## Output style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md new file mode 100644 index 0000000..77d78ff --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md @@ -0,0 +1,112 @@ +# Command: interpret + +Interpret a Mixpanel experiment's results and health checks. This command consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (interpret-specific) + +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +## Components (interpret-specific) + +### Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +### Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +### Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [../references/why-no-statsig.md](../references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Fetch the experiment + +The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +### 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +#### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [../references/health-check-interpretation.md](../references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +#### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [../references/why-no-statsig.md](../references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md). + +#### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +#### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +#### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +### 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [../references/health-check-interpretation.md](../references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [../references/why-no-statsig.md](../references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [../references/segment-of-interest-selection.md](../references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [../references/segment-breakdown-interpretation.md](../references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [../references/session-replay-analysis.md](../references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md) | + +### 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md new file mode 100644 index 0000000..3f6f7dd --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**How to enable.** Turn CUPED on for the experiment and pick a pre-exposure window length (see presets below). + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- **2 weeks** — fast-moving metrics with no strong weekly seasonality. +- **4 weeks** — most metrics with weekly seasonality (default sweet spot). +- **60 days** — deeply seasonal metrics like spend. +- **90 days** — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The platform persists the setting on the experiment; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**How to enable.** Turn Winsorization on for the experiment and pick a percentile. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in [statistical-model.md](statistical-model.md). The short version: + +- Enable when there are ≥2 primaries OR ≥2 non-control variants. +- Default to Benjamini-Hochberg. More powerful with correlated primaries. +- Use Bonferroni when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Turn off only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behaviour of existing users? +│ ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (platform default percentile, typically 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behaviour of existing users? + ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON (Benjamini-Hochberg default; Bonferroni for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md new file mode 100644 index 0000000..8c115af --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md new file mode 100644 index 0000000..7088e3c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has an explicit direction (not the platform default). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has an explicit direction. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (Benjamini-Hochberg default, Bonferroni for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail left at default direction `up` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md new file mode 100644 index 0000000..5ce40a1 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md @@ -0,0 +1,93 @@ +# Pre-launch pitfalls + +Catalogue of the deterministic checks to run before the user creates an experiment. Detection logic lives in the platform's pre-launch validation capability; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +## Triage order + +Surface pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two today: **insufficient duration** for the configured MDE on available traffic, and **cohort too small** to supply enough eligible users. Both mean the experiment literally cannot reach statistical power. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most pitfalls fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in the experiment's description. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +### Insufficient duration for the configured MDE — blocker + +**Expected exposures over the planned window cover less than 50% of the required per-arm sample.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. The most likely outcome is "inconclusive," and a non-trivial fraction of those inconclusive results will be noise crossing the significance threshold rather than a real effect (the winner's-curse problem). Extend planned duration to cover the required sample, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (cuts required sample 30–70%). + +### Cohort too small — blocker + +**Eligible cohort size is smaller than (number of arms × per-arm target).** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users. Either expand the cohort to comfortably exceed (number of arms × per-arm target) eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target to what the cohort can supply (and accept the larger achievable MDE). + +### Pre-experiment bias likely — warning + +**Retro A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts inherit it — the team sees "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. Enable CUPED with a 2–4 week pre-exposure window; it specifically regresses out the pre-exposure baseline difference. + +### High variance, no Winsorization — warning + +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. + +### Multiple primaries, no correction — warning + +**≥2 primary metrics configured AND multiple-testing correction is off.** Family-wise false-positive rate compounds with each additional primary: at 3 primaries ~14.3%, at 5 ~22.6% — more than one in five "wins" is noise. Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. + +### Marginally underpowered duration — warning + +**Expected exposures cover 50–100% of the required per-arm sample.** The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate. Less urgent than the insufficient-duration blocker. + +### Missing guardrails — warning + +**Zero guardrail metrics configured.** Without guardrails, there's no >5% hard-gate to block a ship on a regression. The team is implicitly trusting that the primary captures every relevant impact — rarely true. Add at least one guardrail covering the most likely failure mode of the change: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### Hypothesis ↔ metric mismatch — warning + +**The hypothesis mentions a canonical metric noun (conversion, retention, revenue, signup, engagement, click, purchase) but no primary's name appears to measure that outcome.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count." Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### Primary lacks a leading indicator — warning + +**Primaries include a retention-type metric AND no leading-indicator secondary (conversion or funnel type) is configured.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Add a leading-indicator secondary measured within the experiment runtime; the retention primary stays as the ship decision, the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here. The platform reports which check fired; the agent renders the human-readable message. When the platform's thresholds change (e.g., the 50% / 100% bounds on the underpowered checks), the recommendation language in this document needs to track — the agent will quote stale numbers otherwise. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit seasonality-alignment check is a reasonable add. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md new file mode 100644 index 0000000..60d083c --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +The first thing to do when a user proposes an experiment on a feature is look up prior experiments on that feature. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's description, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..d077dba --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the `manage-feature-flags` skill: + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md new file mode 100644 index 0000000..7c41c9a --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: keep per-variant sample size above the platform's reliability floor (verify in product — historically ~350–400). Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The platform's default per-variant target is fine for most tests; ~1,000 is the practical floor; the platform floor is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md new file mode 100644 index 0000000..771a208 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md @@ -0,0 +1,102 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. + +## Testing model: sequential vs frequentist + +**Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick sequential when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick frequentist when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in the experiment's description so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample-based vs date-based + +### Pick sample-based when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick date-based when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A sample-based end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **frequentist + date-based** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **frequentist + sample-based + peeking** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default 0.95 (α = 0.05; verify in product). Change only with intent. + +- **0.99** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in the experiment's description. +- **0.90** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from the default belongs in the description. The post-launch interpretation step uses this setting to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **Bonferroni** — divides α by the number of tests (primaries × non-control variants). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **Benjamini-Hochberg** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to Benjamini-Hochberg** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use Bonferroni when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Turn correction off **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on the confidence level: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..22ca2fe --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md new file mode 100644 index 0000000..f2e1be1 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -0,0 +1,118 @@ +--- +name: manage-experiment +description: > + Coach the user through any phase of a Mixpanel experiment — design before + launch (hypothesis framing, metric selection, sizing, statistical model + choice, advanced statistical features like CUPED / Winsorization / + Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret + after launch (read results, decide ship / iterate / kill / wait, interpret + health checks like SRM and Retro A/A, break results down by segment, use + session replays to explain a result). Use when the user mentions + experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or + any phrasing like "set up an experiment", "design an A/B test", "how did + experiment X do", "should we ship", "why isn't this significant yet", + "should this be sequential or fixed-horizon", "what's my MDE", "is this + experiment configured correctly", "audit my experiment". Do NOT use for + plain feature-flag rollouts with no measurement criterion — that belongs + to the `manage-feature-flags` skill. +license: Apache-2.0 +--- + +# Manage Experiment + +This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. + +The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). + +--- + +# Components + +The pieces the skill is built from. The Steps section below tells you how to use them. + +## Canonical commands + +Each command lives in its own file under `commands/` and is loaded on demand. Match commands explicitly (user names them) or implicitly (message matches a trigger phrase below). + +| Command | File | Match if message contains any of | +| ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | + +If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. + +## Command menu + +Shown when no command was detected or inferred. + +``` +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + Manage Experiment — [Project Name] ([project_id]) +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks + 2. Interpret — Read results, ship / iterate / kill / wait + 3. Exit +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +``` + +## Shared glossary + +Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. + +- **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. +- **Primary / Guardrail / Secondary metric.** + - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. + - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. + - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. +- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. + +## Behaviour rules + +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. +3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. + +--- + +# Steps + +Follow these steps in order. + +## 1. Set project + +Resolve which Mixpanel project the user wants to operate on. + +- **User named a project (name or ID):** list all projects in the workspace. Match by ID first, then by case-insensitive name. If one match → `✅ [Project Name] ([project_id])`, proceed. +- **Multiple name matches:** show the matches in a numbered list, ask the user to pick. +- **No match:** tell the user what wasn't found, offer to `list` (which re-fetches the project list and shows the table). +- **User named nothing:** ask which project. `list` → fetch projects → show table. + +If the project listing fails with tool-not-found, tell the user to connect the Mixpanel MCP and stop. + +## 2. Set experiment (if one is named) + +If the user named an experiment, resolve it now — try ID first, then case-insensitive name match. Multiple matches → numbered picker. No match → tell the user what wasn't found. + +If the user is starting a new experiment from scratch (no existing experiment to name), skip this step — `design` will handle setup. + +## 3. Pick the command + +- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +- **Ambiguous or none:** show the Command menu, take the user's choice. + +## 4. Load and execute the command + +If the command file is not already in context, read `commands/[command].md`. Follow the instructions in that file. Reuse the project and experiment context resolved in steps 1–2 — never re-ask. + +## 5. Complete + +Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md new file mode 100644 index 0000000..b5f7046 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -0,0 +1,161 @@ +# Command: design + +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (design-specific) + +- **Hypothesis.** A falsifiable, directional claim with a stated mechanism, bounded in time. Shape: _"If ``, then `` will `` by ≥``, because ``."_ Every other decision flows from this. +- **Power.** The probability the experiment detects a true effect of size MDE. Default 80%. +- **Underpowered.** Achievable MDE on available traffic exceeds the user's expected lift. Most likely outcome is "inconclusive"; reachable significance is biased upward (winner's curse). +- **Sequential vs Frequentist testing.** Sequential makes peeking safe (boundary-based stopping); Frequentist requires a fixed sample committed up front. Most users should default to Sequential. + +--- + +## Components (design-specific) + +### Sizing formulas + +Required sample per variant (two-sample, two-sided, 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Inverted for traffic-bound teams — the smallest effect detectable on available traffic (Kohavi's inversion): + +``` +MDE = 4σ / √n +``` + +The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). + +### The >5% guardrail hard-gate + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). + +### Pre-launch pitfall catalogue + +Before creating the experiment, run the deterministic pre-launch checks against the configuration. Surface results in triage order: **blockers** (an experiment that can't reach statistical power), **warnings** (configuration smells that degrade trustworthiness), then **fyi**. The two blockers today are: insufficient duration for the configured MDE on available traffic; and a cohort too small to supply enough eligible users. The full catalogue, severities, and rationale live in [../references/pitfalls.md](../references/pitfalls.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Route and check for prior work + +**Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). + +**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). + +### 2. Write the hypothesis + +A good hypothesis is a **falsifiable, directional claim with a stated mechanism, bounded in time**: + +> **If** ``, **then** `` will `` by ≥``, **because** ``. + +If the user is vague, hold them to five commitments: the change, the primary metric, the direction, the MDE, the mechanism. The "because" forces them to check whether the metric they picked is actually downstream of the change — the most common source of "experiment didn't work" post-mortems. The rubric, common misalignment patterns, and worked good/bad examples are in [../references/hypothesis-framing.md](../references/hypothesis-framing.md). + +### 3. Pick metrics that test the hypothesis + +The hypothesis names a specific outcome. The primary metric must measure that outcome — same population, same denominator, same timeframe. + +- **Primaries** (1–3 max) come from the hypothesis's outcome clause. Each additional primary inflates the family-wise false-positive rate. +- **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +- **Secondaries** are diagnostic only. + +Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). + +### 4. Size the experiment with real data + +Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. + +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). + +Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). + +### 5. Pick testing model + end condition + +Four choices, each with a default that's right for most users: + +- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. +- **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. +- **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. + +Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and the four valid model × end-condition combinations are in [../references/statistical-model.md](../references/statistical-model.md). + +### 6. Decide on advanced features + +- **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. + +When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). + +### 7. Run the pre-launch pitfall check + +Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). + +### 8. Confirm with the user, then create + +Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: + +``` +*Experiment Setup Summary* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction: up/down), … +• *Guardrails:* (direction: …), … +• *Variants:* control 50% / treatment 50% (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* 0.95 +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days +• *Achievable MDE on current traffic:* % relative + +*Pitfall check:* +✅ Insufficient duration — adequate +✅ Cohort too small — adequate +⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship +``` + +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. + +After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. + +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. + +--- + +## Going deeper + +| User asks about… | Open | +| ----------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| "Is this an experiment or just a feature flag?" | [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md) | +| "Help me write the hypothesis" / "Is this hypothesis good?" | [../references/hypothesis-framing.md](../references/hypothesis-framing.md) | +| "Which metrics should I pick?" / "Primary vs guardrail vs secondary?" | [../references/metric-selection.md](../references/metric-selection.md) | +| "What sample size do I need?" / "What MDE can I detect?" / "How long to run?" | [../references/sizing.md](../references/sizing.md) | +| "Sequential vs frequentist?" / "Confidence level?" / "Correction method?" | [../references/statistical-model.md](../references/statistical-model.md) | +| "Should I enable CUPED / Winsorization?" | [../references/advanced-features.md](../references/advanced-features.md) | +| "Was anything similar tested before?" | [../references/prior-experiments.md](../references/prior-experiments.md) | +| "What can go wrong before launch?" / "Run the pre-launch check" | [../references/pitfalls.md](../references/pitfalls.md) | + +--- + +## Output style + +- Lead with the hypothesis. Every other decision flows from it. +- Use concrete numbers from real data ("baseline 4.2%, σ² = 0.040, required n ≈ 6,400/arm"), not vague guidance. +- Quote the user's MDE and metric names back so they catch typos. +- When underpowered, say so plainly and list remediations in order of cost. +- Don't moralise about peeking — switch them to sequential. +- Guardrail regressions are hard gates, not "slight concerns." + +When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md new file mode 100644 index 0000000..77d78ff --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md @@ -0,0 +1,112 @@ +# Command: interpret + +Interpret a Mixpanel experiment's results and health checks. This command consumes the verdicts the platform already returns. **Never recompute thresholds** (SRM, significance, sufficient-exposures, etc.). If a verdict field is missing, say so — do not synthesize one from raw values. + +The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. + +--- + +## Glossary (interpret-specific) + +- **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. +- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. +- **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. +- **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. +- **Trustworthiness gate.** The pre-flight check that runs before any results interpretation: SRM ok, Retro A/A clean, exposures sufficient, ≥3-day window, no misconfig. Failing any of these means **do not interpret results yet** — route to the health-check reference. + +--- + +## Components (interpret-specific) + +### Polarity recipe (load-bearing — apply on every metric row) + +The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. + +Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): + +- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). +- `direction == "up"` → **positive** if `lift > 0`, else **negative**. +- `direction == "down"` → **positive** if `lift < 0`, else **negative**. + +A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. + +The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. + +### Data-source fallback + +Experiment-details has two parallel data paths — live (preferred) and cached. Always prefer live; if live computation failed, fall back to cache with a staleness caveat; if **both** are empty, say "no result was computed" and recommend a re-sync. **Never** silently treat missing data as "no effect." + +### Verdict table + +| Situation | Recommendation | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Trust ✓, primary polarity positive, guardrails ✓, magnitude meaningful | **SHIP.** Conclude the experiment via its decide lifecycle action, naming the winning variant and a rationale message. **Confirm with the user first — concluding is irreversible.** | +| Trust ✓, primary polarity positive, guardrail polarity negative | **ITERATE.** Investigate the regression; do not auto-ship. | +| Trust ✓, primary polarity neutral after target sample reached | **KILL or ITERATE.** Use the inconclusive-results playbook in [../references/why-no-statsig.md](../references/why-no-statsig.md). | +| Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | +| Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | + +For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Fetch the experiment + +The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. + +Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. + +Apply the **data-source fallback** rule from Components. If the live path fails and the cache is also empty, stop here and tell the user — there is nothing to interpret. + +### 2. Run the trustworthiness gate (the Decision Tree) + +Run steps 2a–2e in order. **Stop at the first failure** — do not proceed if a step flags a problem. The platform attaches verdict fields for each check; consume those verdicts rather than recomputing. + +#### 2a. Trustworthiness + +SRM ok? Retro A/A clean? Exposures sufficient? Minimum duration met (~3 days)? No misconfiguration? If any fail → STOP and open [../references/health-check-interpretation.md](../references/health-check-interpretation.md). The Misconfigurations section in that reference covers the warning-level signals (multiple-testing off, extreme winsorization, CUPED on new-users-only, etc.). + +#### 2b. Statistical significance + +Apply the **polarity recipe** from Components to each non-control variant × primary metric. If nothing is significant on primaries → see [../references/why-no-statsig.md](../references/why-no-statsig.md). For translating a single metric's lift / CI / p-value into a phrase, see [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md). + +#### 2c. Guardrail check + +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. + +#### 2d. Practical significance + +Convert lift into absolute terms — multiply by the control baseline. Statistically significant ≠ ships. The per-metric reference covers the baseline-fetch fallback when `value` or `sampleSize` is missing, and the **Twyman's Law** check for any lift > ~30%. + +#### 2e. Verdict + +Look up the situation in the **Verdict table** in Components. If the recommendation is SHIP or KILL, surface the proposed decide-action parameters and **wait for explicit user confirmation** before executing — concluding an experiment is irreversible. + +### 3. Going deeper (open references on demand) + +| User asks about… | Open | +| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| SRM failing, Retro A/A failing, exposures insufficient, or any trustworthiness fail | [../references/health-check-interpretation.md](../references/health-check-interpretation.md) | +| "Translate this lift / CI / p-value into English" | [../references/per-metric-interpretation.md](../references/per-metric-interpretation.md) | +| "Why hasn't this hit statsig yet? Should we wait or stop?" | [../references/why-no-statsig.md](../references/why-no-statsig.md) | +| "Which segments should I break this down on?" | [../references/segment-of-interest-selection.md](../references/segment-of-interest-selection.md) | +| "What does this segment-by-segment result mean?" | [../references/segment-breakdown-interpretation.md](../references/segment-breakdown-interpretation.md) | +| "Can session replays help explain this result?" | [../references/session-replay-analysis.md](../references/session-replay-analysis.md) | +| "How do I actually conclude this experiment? Multi-variant ship?" | [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md) | + +### 4. Output + +Default to this shape unless the user asks for something else: + +1. **Verdict** in one sentence — `SHIP`, `ITERATE`, `KILL`, `WAIT`, or `DO NOT DECIDE`. +2. **Why**, walking through the trustworthiness-gate steps that mattered (skip steps that were clearly fine). +3. **Per-metric breakdown** — winning primaries, losing primaries, guardrail status, each polarity-corrected. Include absolute-impact translation for any win. +4. **Caveats / what we don't know** — non-default confidence level, missing baselines, segments not yet checked, stale-cache caveat, etc. +5. **Suggested next action** — for SHIP / KILL, the proposed decide-action parameters **gated on user confirmation**; for ITERATE / WAIT, the investigation to run next. + +If experiment details are unavailable or return errors, say so — do not invent a verdict. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md new file mode 100644 index 0000000..3f6f7dd --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md @@ -0,0 +1,103 @@ +# Advanced features + +Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. + +## CUPED — variance reduction + +**What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. + +**How to enable.** Turn CUPED on for the experiment and pick a pre-exposure window length (see presets below). + +### When to enable + +- The primary metric correlates with users' pre-exposure behaviour on the same metric. Strong correlations: revenue, engagement (events per user), retention, time-on-platform. Weak correlations: anything one-time or onboarding-specific. +- **All experiment users existed before the experiment start** — i.e., not a new-user-only cohort. CUPED needs a pre-exposure observation period; new users don't have one. +- A 2–4 week pre-exposure window is available with stable behaviour. If the metric was launched 5 days ago, CUPED has nothing to read. + +### When NOT to enable + +- New-user-only experiments. No pre-exposure data exists. CUPED gives zero variance reduction and adds noise. +- Brand-new metrics without historical data. +- Metrics where pre-exposure behaviour is not predictive of post-exposure (e.g., one-time onboarding events: the user either did or didn't complete onboarding once; pre-exposure has nothing to say about it). +- Pre-exposure window short enough that the behaviour you'd "control for" is itself a transient spike (e.g., metric just had a viral moment last week). + +### Pre-exposure window presets + +- **2 weeks** — fast-moving metrics with no strong weekly seasonality. +- **4 weeks** — most metrics with weekly seasonality (default sweet spot). +- **60 days** — deeply seasonal metrics like spend. +- **90 days** — long-cycle metrics (renewal-driven revenue, etc.). + +### What changes downstream + +- Required sample size shrinks by the variance-reduction factor. A 50% variance reduction on a primary that needed 60k per arm shrinks the target to ~30k per arm. +- The point estimate of the lift is unchanged. CUPED is a variance-reduction technique, not a bias correction; the headline lift is the same, the confidence interval is narrower. +- The post-launch interpretation step needs to know CUPED was on, because the standard error formula differs. The platform persists the setting on the experiment; the interpretation step reads it automatically. + +## Winsorization — outlier handling + +**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. + +**How to enable.** Turn Winsorization on for the experiment and pick a percentile. + +### When to enable + +- Revenue or spend metrics with whales (one customer spends 100× the median; that customer assigned to treatment is enough to swing the headline). +- Time-on-page or session-duration metrics with users who fall asleep on the page (one session at 8 hours dwarfs 10,000 sessions at 30 seconds). +- Any Gaussian-distributed metric with a heavy right tail (count metrics, event volume per user, page view counts). + +### When NOT to enable + +- Bernoulli (conversion) metrics. Capping a 0/1 outcome is meaningless; the 95th percentile of a 0/1 distribution is also 0 or 1. +- Metrics where the tail behaviour **is** the hypothesis. If the test is "did this change move whale spending?", Winsorization throws away exactly the signal you're testing for. +- Metrics already winsorized upstream (in the metric definition / data pipeline) — double-winsorization adds nothing. + +### Percentile guidance + +The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. + +For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. + +### What changes downstream + +- Variance on the affected metric drops, often substantially. Required sample size shrinks accordingly. +- The point estimate of the mean shifts toward the centre of the distribution. This is the desired behaviour; the whole point is to stop a few outliers from anchoring the estimate. +- The post-launch interpretation step reports the winsorized mean and standard error. If the team also wants to know what the un-winsorized mean did (the "did whales react?" question), they'd need a separate secondary metric without Winsorization. + +## Multiple testing correction — Bonferroni vs Benjamini-Hochberg + +Covered in detail in [statistical-model.md](statistical-model.md). The short version: + +- Enable when there are ≥2 primaries OR ≥2 non-control variants. +- Default to Benjamini-Hochberg. More powerful with correlated primaries. +- Use Bonferroni when family-wise error control is required (regulatory, etc.) or when the primaries are independent. +- Turn off only with a single primary and a single non-control variant. + +## Decision flowchart + +``` +Primary metric is Bernoulli (conversion rate)? +├── Yes → Winsorization OFF. +│ Does it correlate with pre-exposure behaviour of existing users? +│ ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) +│ └── No → CUPED OFF +└── No (continuous / count / retention) + Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? + ├── Yes → Winsorization ON (platform default percentile, typically 95) + └── No → Winsorization OFF + Does it correlate with pre-exposure behaviour of existing users? + ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) + └── No → CUPED OFF + +Primary count ≥ 2 OR non-control variants ≥ 2? +├── Yes → Multiple testing correction ON (Benjamini-Hochberg default; Bonferroni for strict family-wise control) +└── No → Multiple testing correction OFF +``` + +## Common misconfigurations + +- ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. +- ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. +- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. +- ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md new file mode 100644 index 0000000..1467468 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md @@ -0,0 +1,176 @@ +# Health-Check Interpretation + +Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. + +--- + +## Kohavi framing — always cite when a health check fails + +> **Sample Ratio Mismatch is the #1 trustworthiness check (Kohavi).** When SRM is failing, do not trust the experiment's lift, p-values, or confidence intervals — the randomization assumption is broken, so the measured effect cannot be attributed to the treatment. +> +> **Twyman's Law**: any unusually clean or unusually large result is more likely a bug than a discovery. A spectacular lift on a failing-SRM experiment is not evidence of a great treatment; it's evidence the bucketing is broken. + +These two principles drive the recommendations below. Lead with them when explaining a failing check to the user. + +--- + +## 1. SRM (Sample Ratio Mismatch) + +**What the platform tells you**: the SRM verdict the experiment-details response carries (live, or cached when live isn't available). The platform tags failing SRMs already — consume the verdict, do not compute chi-square yourself. + +### What it means + +Users were assigned to variants in proportions that disagree with the configured target allocation. The disagreement is too large to be chance. Bucketing — the experimental machinery itself — is broken. Every downstream number (lift, p-value, CI) inherits that brokenness. + +### Likely causes, ordered most → least likely + +(Surface in this order — investigate the most probable first.) + +1. **bucketing_bug** — A bug in the variant-assignment code is sending more traffic to one variant than the configured split. Check the SDK or server-side bucketing logic that decides which variant each user sees. +2. **biased_assignment** — The assignment criterion correlates with the variant — e.g. assigning by user-id parity when user-ids aren't uniformly distributed, or bucketing on a property that drifts over the experiment window. +3. **bot_traffic** — Bot or crawler traffic is being exposed to one variant more than the other. Bots often hit only the default/control variant or follow patterns that skew allocation. +4. **exposure_tracking_bug** — Exposures are being logged for one variant but dropped or duplicated for another. Verify the exposure event fires exactly once per user per variant assignment. +5. **ramp_up_timing** — If the experiment was ramped (e.g. 10% → 50% → 100%) and the SRM alert fired during a ramp, the deviation may be a transient effect of the ramp schedule rather than a real bucketing problem. Re-check after a stable allocation period. + +### Recommended actions + +- **pause_and_investigate** — Pause the experiment before drawing any conclusions. SRM violates the experiment's core randomization assumption — any lift or regression measured against a mis-allocated split is unreliable. +- **restart_with_bot_filtering** — Restart with bot filtering enabled in your exposure tracking. Bot traffic is the most common SRM cause when the deviation is small and asymmetric. +- **investigate_exposure_logging** — Compare exposure event volume per variant against your feature-flag evaluation logs. A gap between flag evaluations and logged exposures is the classic signature of exposure-tracking bugs. +- **continue** — Only when the SRM is _not_ failing and the observed allocation is consistent with the configured split. + +### Investigation checklist + +1. Compare the actual per-variant exposure ratio to the configured target allocation — which variant is over/under-represented? +2. If feature-flag-based: check whether a property filter on the flag was added or changed mid-experiment. Inspect the flag's rollout rules and history. +3. For multi-variant tests, the platform may apply a per-comparison correction to the SRM threshold — the effective per-variant threshold may be tighter than the headline. Trust the platform's bucket flag, not raw p-value math. +4. Verify SDK version and bucketing logic. Query the exposure event grouped by variant to confirm exposure events are flowing correctly. +5. Check for bot/QA traffic — bots often skew toward control. If QA traffic isn't being excluded, recommend enabling that filter. +6. If exposures are very small (e.g. under ~1k total): SRM is unreliable on tiny samples. Wait for more data before acting. +7. If still failing: stop the experiment, fix bucketing, restart with fresh allocation. **Do NOT just re-conclude with the broken data.** + +--- + +## 2. Retro A/A (pre-experiment bias) failure + +**What the platform tells you**: the pre-experiment-bias analysis the platform attaches when that check is enabled in the experiment's settings. + +### What it means + +The same statistical comparison run on the **pre-exposure** period revealed that variant cohorts already differed _before_ the treatment started. Any "lift" measured during the experiment may just be reflecting that pre-existing gap, not the change. + +- Pre-experiment bias on a **primary** metric is a **stop-and-investigate** signal. +- Pre-experiment bias on a **secondary** metric is informational only. + +### Investigation checklist + +1. Identify which metric × variant pair triggered the failure (after the platform's correction). +2. Check whether bucketing was deterministic — non-deterministic assignment in the pre-period means users were assigned to different variants than they would have been in production. +3. Look for cohort skew: did one variant disproportionately receive heavy users? Query the metric pre-experiment grouped by variant to confirm. +4. Check for a recent product change that went out before the experiment — pre-period bias can reflect non-experimental treatment that disproportionately affected one cohort. +5. If isolated to a single metric × variant: consider dropping that metric from the analysis, or restart with new bucketing. + +--- + +## 3. Insufficient exposures + +**What the platform tells you**: per-variant exposure counts plus an "insufficient" flag when the count is too low to trust. Do not invent a per-variant threshold; route the user to extend or relaunch the experiment when the platform has flagged the issue. + +### Investigation checklist + +1. Check per-variant exposure totals — which variant is undersampled? +2. Inspect feature-flag rollout — was rollout dialed back? +3. Query the exposure event with a date breakdown to see if traffic dropped recently (seasonal? incident?). +4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. +5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. + +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. + +--- + +## 4. Frequentist peeking + +**What to check**: the experiment's testing model and whether it ended before reaching its configured end condition (sample size or duration, whichever was configured). + +### What it means + +A frequentist test that ends before reaching its configured target has an **inflated false-positive rate**. The math assumes a fixed sample size; peeking before that point and stopping on a favorable look is exactly what "p-hacking" looks like in production. + +### Investigation checklist + +1. Confirm the testing model is frequentist (sequential tests don't have this problem). +2. Compare the actual end date against the planned end (date- or sample-based, whichever the experiment was configured with). +3. If the conclusion was premature: results have inflated false-positive rate. Recommend a re-run. +4. If the user wants to keep current results: caveat strongly. Recommend a sequential testing model for the next experiment so they can stop early without penalty. + +(Sequential tests are designed for continuous monitoring — stopping early on significance is safe and intended for those, not a peeking violation.) + +--- + +## 5. Live computation timeout / broken data + +**What the platform tells you**: a non-null error block on the live results, with the live data path empty. + +### Investigation checklist + +1. Retry the experiment-details request once. If it fails again, surface the error and stop retrying — the tool layer owns retry policy. +2. On repeated failure: count metrics × variants × date range. Many metrics on a multi-variant experiment over a long window can exceed the query budget. +3. Recommend reducing scope: drop unused secondary metrics, narrow the date range, or temporarily archive metrics that aren't part of the decision. +4. If the cache is recent (within hours), surface those results with a "stale data" caveat and the timestamp. If the cache is days old or empty, the user must resolve the backend issue before any meaningful interpretation. + +--- + +## 6. Experiment ran < 3 days + +**What to compute (this one is local)**: the elapsed time between the experiment's start and end. + +Day-of-week, novelty, and cohort-skew effects dominate windows shorter than ~3 days regardless of sample size. **Refuse to interpret.** Tell the user explicitly: + +> _"This experiment ran less than 3 days. Day-of-week effects, novelty, and cohort skew dominate a window this short, so the results cannot be reliably interpreted — even if they look 'significant.' Recommend extending or relaunching with a longer planned duration."_ + +If the experiment was sample-size-bounded and a tiny target was reached in hours, increase the target and rerun. Reaching sample size quickly is not the same as a valid experiment window. + +--- + +## 7. Misconfigurations + +These don't always invalidate results, but they change how to _read_ them. Surface them as warnings during the trustworthiness gate. + +### Multiple-testing correction off with several primaries + +**Correction off AND 2+ primaries × 1+ non-control variants.** Any single significant primary may be a false positive — family-wise error rate scales multiplicatively (e.g. 15 primaries × 1 variant at α=0.05 → ~54% expected family-wise false positive rate). Look at primaries in aggregate: if most point the same direction, the effect is likely real; if only one or two of many are significant, recommend enabling Benjamini-Hochberg or Bonferroni and re-analyzing. + +### Extreme winsorization percentile + +**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. + +### SRM check disabled + +**SRM check is off.** Often deliberate — e.g. when a feature-flag rollout intentionally splits traffic unevenly. Do not compute SRM yourself or treat the absence as a bug. Only flag if results otherwise look suspicious (Twyman-sized lifts, implausible exposure ratios) and then recommend re-enabling SRM and re-analyzing. + +### CUPED on new-users-only cohort + +**CUPED enabled AND the cohort is "new users only".** CUPED needs pre-exposure data, so it had no effect here — but **results are still valid**, variance reduction just didn't happen. Mention as informational. For future experiments on this surface, suggest extending the cohort to include returning users so CUPED can apply. + +### Non-default confidence level + +**Confidence level differs from the platform default (typically 0.95).** `0.9` (α = 0.10) inflates false positives; `0.99` (α = 0.01) is conservative. Call out in the verdict and combine with metric count to estimate the family-wise error rate. + +### Broken or placeholder metric entries + +**Metric entries with empty names.** Likely broken or placeholder references. Flag and skip during analysis. + +### Primary metric with no computed result + +**A metric is listed as primary but has no result (live or cached).** This is **"no measurement," not "no effect."** Surface prominently; recommend re-syncing results before any conclusion that depends on this primary. + +--- + +## Output shape when a health check fails + +1. **What failed**, in one sentence (use the verdict the platform attached — do not re-derive). +2. **What that means for trust** — cite the Kohavi framing (SRM is #1) or Twyman's Law where it fits. +3. **Likely causes**, ordered most → least probable. +4. **Recommended action** from the small set above. +5. **Investigation checklist** the user can run. +6. **What NOT to do** — usually, "do not act on the current lift / p-value numbers." diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md new file mode 100644 index 0000000..8c115af --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md @@ -0,0 +1,101 @@ +# Hypothesis framing + +All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. + +## The shape + +> **If** ``, **then** `` will ``, **because** ``. + +| Property | Test | Failure mode | +| ------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | +| **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | +| **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | + +## When the user gives you a one-liner + +Ask them to commit to five things, in order. Don't proceed until you have all five. + +1. **The change** — what's different in treatment. A specific UI string, a routing change, a price, a copy variant. Vague ("the new onboarding") is not enough; "the new onboarding which moves the free-item offer to step 1" is. +2. **The primary outcome metric** — one specific event or rate, not a domain. "Engagement" is not a metric; "weekly active users with ≥1 report created" is. +3. **The expected direction** — up or down. (Goes straight into the metric's `direction` field.) +4. **The minimum effect size that would justify shipping** — this becomes the MDE. If the user can't name one, ask: "If the lift turned out to be 0.5%, would you ship?" Their answer reveals the MDE. +5. **The mechanism** — why you expect this to work. The mechanism is what binds the metric to the change. A change to onboarding screens shouldn't be measured by Day-30 retention if no one has gotten to Day 30 yet — the mechanism would say so explicitly. + +## Mechanism → metric class + +The mechanism predicts the _kind_ of metric that should move. Use this mapping as a sanity check: + +| Mechanism flavour | Likely primary-metric class | Anti-pattern | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| Reduces friction at a specific step | Step conversion rate (funnel-typed) | Headline retention metric | +| Surfaces a new option / increases discoverability | Click-through or first-use rate on the surfaced option (conversion) | Total events per user | +| Reorders information / changes salience | Time-to-task, completion rate on the salient step | Account-level revenue | +| Changes the cost of an action (price, paywall, friction) | Conversion-to-paid, refund rate, cancel rate (with `direction: "down"`) | DAU | +| Adds a new content / recommendation system | CTR on recommendations, downstream conversion | Aggregate engagement | +| Long-term retention play (referrals, loyalty) | Day-7 or Week-1 retention as leading proxy; lagging Day-30 stays a post-launch monitor, not a primary | Day-30 retention as primary on a 2-week experiment | + +When the user's mechanism and proposed metric live on different rows of this table, push back — that's the **hypothesis ↔ metric mismatch** pitfall. + +## Hypothesis ↔ metric alignment + +A hypothesis names a specific outcome. The primary metric must measure that outcome — **same population, same denominator, same timeframe**. Common misalignments: + +- Hypothesis predicts a **rate** change; primary metric is a **count** → switch to a rate metric, or use an exposure-rebalanced total. +- Hypothesis predicts effect on **paid users**; primary metric includes free users → add a cohort filter or scope the metric. +- Hypothesis predicts effect **within session**; primary metric is **per-user across sessions** → either narrow the metric or broaden the hypothesis. +- Hypothesis predicts effect **only on a new flow**; primary metric counts events that exist only in treatment → changed-denominator. The lift is artificially infinite. Pick a metric that exists for both arms. + +## When to push back + +Push back hard when: + +- The hypothesis is non-falsifiable. Until it can be tested with a yes/no answer from data, there's nothing to set up. +- The hypothesis is non-directional. The system's `direction: "up"` default is wrong for cancel / error / latency / abandon metrics; leaving it default silently flips polarity at interpretation time. +- The mechanism doesn't predict the proposed metric. Most "experiment didn't work because we measured the wrong thing" post-mortems trace back to here. +- The proposed primary is strongly lagging on the planned duration (retention as primary on a 2-week test). Suggest a leading proxy. + +When you push back, do it once with concrete language ("you said 'improve engagement' — which event do you want to move?"). If the user genuinely wants to leave the hypothesis vague, you can proceed, but log the vagueness in `description` so the post-launch step knows the test was exploratory rather than decisional. + +## Worked examples + +### ✅ Good + +> If we surface a free-item offer during onboarding step 2, then signup→activation conversion will increase by ≥3pp (currently 18%), because reducing first-action friction lowers cold-start dropout for new accounts. + +- Falsifiable: data can say "no, lift was <3pp." +- Directional: up. +- Mechanistic: first-action friction → cold-start dropout. +- Time-bounded: signup→activation is a within-session metric; readable inside any reasonable test duration. +- Mechanism predicts a conversion-class primary; signup→activation conversion fits. + +### ✅ Good (lagging hypothesis, leading proxy primary) + +> If we ship the new referral flow, then Day-30 retention will increase by ≥1.5pp, because referred users have stronger network effects. We will measure Day-7 retention as the experiment primary (historical correlation r=0.78 with Day-30) and keep Day-30 as a post-launch monitor. + +- Bounded-in-time problem is acknowledged and solved with a leading proxy. The lagging metric remains a post-launch check, not a ship gate. + +### ❌ Vague + +> Test the new onboarding. + +- No change description (which change? full redesign or one screen?). +- No outcome. +- No direction. +- No MDE. +- No mechanism. + +Coach: pull each of the five commitments out of the user before going further. + +### ❌ Non-falsifiable + +> The new dashboard will improve the user experience. + +- "Improve user experience" can't be tested. Ask: "Which specific behaviour changes if user experience is better? Engagement events per session? Time to first chart? Dashboards saved per user?" + +### ❌ Mechanism doesn't predict the metric + +> If we change the colour of the CTA button, then 30-day retention will increase by ≥2pp, because users will perceive the product as more polished. + +- Mechanism is plausible at best, but Day-30 retention is far downstream of a button-colour change. Even if the colour change does help, a 2-week experiment won't measure it. Either pick a leading proxy (click-through on the CTA) or shelf the test until you have a more credible mechanism for retention. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md new file mode 100644 index 0000000..3a9e24c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md @@ -0,0 +1,39 @@ +# Lifecycle Hand-off + +How to conclude an experiment once the verdict is settled. This reference is **interpretation guidance** — the per-field schema of the decide action lives in the experiment-update tool description. + +--- + +## Confirm before concluding — always + +Concluding an experiment is **irreversible**. Before invoking the decide action, surface the proposed parameters to the user (winning variant, success/fail, rationale message) and wait for explicit confirmation. A SHIP verdict is a recommendation, not an authorization. + +## The three pieces every decide call needs + +A decide call expresses three things: + +1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. + +## Special variant choices for success + +When you have a winning result but no single variant to ship: + +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) + +When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. + +## Multi-variant experiments + +For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: + +- If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). +- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. + +A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. + +## After concluding + +The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md new file mode 100644 index 0000000..7088e3c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md @@ -0,0 +1,75 @@ +# Metric selection + +Each metric serves exactly one of three roles. The hypothesis tells you which. + +## Primary metrics (1–3 max) + +The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. + +- **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: + - Onboarding-screen change → activation in the first session, not Week-4 retention. + - Checkout button A/B → checkout conversion, not 30-day LTV. + - Pricing-page tweak → click-through and trial start, not annualised revenue. + - When the only metric the team cares about is lagging, use a **leading proxy** with a known historical correlation to the lagging metric. The lagging metric stays a post-launch monitor, not a ship gate. +- **Prefer rates over counts** when the hypothesis is about behaviour change. "Conversion rate" is interpretable; "total conversions" conflates per-user behaviour with cohort size. + +If the user proposes a primary, sanity-check: + +- _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) +- _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). +- _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. +- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) + +## Guardrail metrics (0+, strongly recommended) + +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. + +Standard guardrails by domain — pick at least one from the row that matches the change: + +| Change targets… | Guardrail candidates | +| ------------------------------------ | ------------------------------------------------------- | +| Performance / UI / new client code | Page load time, API latency, error rate, crash rate | +| Engagement / activation / onboarding | Weekly active users, session count, Day-7 retention | +| Revenue / monetisation / pricing | ARPU, conversion-to-paid, refund rate, cancel rate | +| Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | +| Time-to-task / search / IA | Task abandonment rate, time-to-completion | + +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." + +Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. + +## Secondary metrics (0+, diagnostic only) + +Metrics for understanding **why** the primary moved, not for the ship decision. Examples: funnel-step completions, feature sub-use rates, time-on-screen, exploratory cohort breakdowns. + +**Secondary metrics are not decisional.** Even if the user names a secondary in their hypothesis text, they cannot ship/kill on its result. If a metric matters for the decision, it must be primary or guardrail. + +> **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: +> _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ + +This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). + +## Sanity checklist + +Run this before locking the metric set: + +- [ ] Each primary directly measures the hypothesis's predicted outcome. +- [ ] Each primary has an explicit direction (not the platform default). +- [ ] At least one guardrail covers the most likely failure mode of the change (perf for UI changes, retention for monetisation changes, etc.). +- [ ] Each guardrail has an explicit direction. +- [ ] No metric whose denominator is created by the treatment itself (changed-denominator). +- [ ] No primary or guardrail is a strong lagging indicator on the planned experiment duration (use leading proxies; demote lagging metrics to post-launch monitors). +- [ ] Total primary count ≤ 3. +- [ ] If primary count ≥ 2 OR non-control variants ≥ 2, multiple-testing correction is on (Benjamini-Hochberg default, Bonferroni for strict family-wise control). +- [ ] For each primary, baseline rate has been pulled from real data (not guessed). + +## Anti-patterns + +- ⛔ **No guardrails to "avoid noise."** Guardrails are the regression detection, not noise. Without them, a winning primary with a quietly regressing latency or refund-rate is a ship — and then a rollback two weeks later. +- ⛔ **Five primaries because "they're all important."** Past 3, the false-positive risk dominates. Pick the 1–3 the hypothesis actually predicts; demote the rest to secondaries. +- ⛔ **Primary = "total signups," metric = behaviour change.** A behaviour-change hypothesis needs a rate metric; total signups conflates per-user behaviour with the size of the cohort that entered the experiment. +- ⛔ **Guardrail left at default direction `up` on an error / cancel / latency metric.** Silently inverts the regression check. +- ⛔ **30-day retention as primary on a 2-week experiment.** Either the lagging metric can't move (no signal) or it moves on noise (false significance). Use a leading proxy. +- ⛔ **Primary metric only exists in treatment.** Changed denominator. Lift is meaningless. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md new file mode 100644 index 0000000..e46381c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md @@ -0,0 +1,167 @@ +# Per-Metric Interpretation + +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ + +--- + +## The mental model + +Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: + +1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). +3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. +4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. + +A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. + +--- + +## Polarity recipe + +Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: + +- A row in `summary.positive` with `direction: "down"` is a **regression**. +- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). + +--- + +## Reading the p-value in this platform + +Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). + +The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. + +For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. + +--- + +## Reading the lift correctly + +``` +lift = (treatment_mean - control_mean) / control_mean +``` + +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. +- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." + +--- + +## Verdict phrasing — a small palette + +Pick the phrase that matches the four-question pattern. These are the words to use with users; they map onto the platform's already-computed numbers, so the agent never has to invent thresholds. + +| Pattern (sig × polarity × magnitude) | Plain-language verdict | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Significant, polarity positive, magnitude large vs baseline | "**Clear win** — `` moved `` in the goal direction, which is meaningful at this baseline." (apply Twyman's Law if lift > ~30%) | +| Significant, polarity positive, magnitude small vs baseline | "**Statistically significant but practically small** — `` on a `` baseline is ``; confirm with the user whether that clears the business bar." | +| Significant, polarity negative | "**Regression** — `` moved `` against its goal direction. This is a reason not to ship even if other primaries won." | +| Not significant, lift in goal direction, well-powered | "**Likely no effect at the detectable size.** The experiment had enough power to detect ``; the observed lift is below that threshold." | +| Not significant, lift in goal direction, underpowered | "**Inconclusive — too underpowered to call.** Route to the why-no-statsig playbook to decide between wait / extend / restart." | +| Not significant, lift in wrong direction | "**No detectable harm**, but no win either." | +| `lift is None` | "**No measurement** — this variant's row failed to compute. Surface the failure and re-sync." | +| Lift > ~30% on any metric | Prefix with "**Twyman's Law check:** that lift is unusually large; verify the denominator hasn't changed before celebrating." | + +--- + +## Magnitude — make it absolute + +Statistical significance ≠ business impact. Always convert a win into absolute terms before declaring it meaningful: + +1. Baseline from the control variant's metric value (the experiment-details response carries it on the per-variant row). +2. Lift from the winning row. +3. Absolute lift: `baseline × lift`. Examples: + - `baseline = 0.02`, `lift = 0.04` → `+0.0008` → **+0.08 percentage points** of conversion rate. + - `baseline = 12.4 events/user/week`, `lift = -0.05` → `-0.62 events/user/week`. +4. Project to population per period: ask the user for traffic estimates if not in context. "A 5% lift on a 20% baseline metric serving 1M users/week" sounds very different from "a 5% lift on a 0.1% baseline metric serving 1k users/week." + +### Fallback when the baseline value or sample size is missing + +Common — happens whenever live computation timed out or the cached results were nulled. Don't silently skip practical significance; **a broken-data summary with only the lift number is exactly when users over-trust the percentage.** + +Run a query on the metric, scoped to the control variant over the experiment's date range, to fetch the baseline. Match the metric's aggregation: + +- `unique` (Bernoulli) → conversion **rate** as the baseline. +- `total` (Poisson / sum) → per-exposure **average** (raw total ÷ exposures), not the raw total. Multiplying lift by a raw total double-counts cohort size. + +--- + +## Twyman's Law in practice — changed-denominator lifts + +Before celebrating any lift > ~30%, ask: **did the treatment change who is _exposed_ to this metric, not just how they behave?** + +If the treatment causes more users to _see_ a screen, more events naturally fire — the metric grows because the denominator changed, not because per-user behavior changed. + +- A "Free item" promotion drives more users to checkout → "Checkout Screen Viewed" lifts +1000% mechanically. The interesting question is **conversion rate on the screen**, not raw views. +- A new banner makes a feature discoverable → "Feature Page Viewed" lifts dramatically. **Per-discover-er behavior** may be unchanged. + +When you see a > 30% lift, name the risk explicitly: + +> _"This metric measures exposure to the screen/event. The treatment likely caused more users to be exposed; that explains most of the lift mechanically. The interesting question is what those users did once they got there."_ + +--- + +## Metric distribution types + +Different metric types behave differently; cite the relevant nuance in your verdict. + +| Metric type | Distribution | Interpretation nuance | +| -------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------- | +| Unique users / conversion rate | Bernoulli | Variance = `p(1−p)`. Lift on rates near 50% is most powered; rates near 0% or 100% need much more sample. | +| Event counts / sessions per user | Poisson | Variance = mean. Highly sensitive to power users; consider whether one heavy user can swing results. | +| Revenue / numeric properties | Gaussian | Long tails (whales) inflate variance. Strongly consider Winsorization. | + +--- + +## Variance-reduction & outlier settings that change interpretation + +- **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). +- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Multiple comparisons & metric tiers — what's decisional and what isn't + +| Tier | How it influences the verdict | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Primary** | **Decisional.** The platform auto-applies correction when the experiment is configured for Bonferroni or Benjamini-Hochberg (across primaries × variants). | +| **Guardrail** | **Vetoes** a ship if polarity is negative with meaningful magnitude. | +| **Secondary** | **Exploratory only.** NOT Bonferroni-corrected. **Never base a ship decision on secondary metrics**, even if the hypothesis text references them. Treat any "significance" here as a hypothesis to test next. | + +If multiple-testing correction is off AND there are 2+ primaries × 1+ non-control variants: don't auto-discount a single significant primary, but look at the aggregate. If most primaries point the same direction, there's likely a real effect. If only one or two of many are significant, it's inconclusive until correction is enabled. + +--- + +## When a primary metric is inconclusive + +A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. + +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). + +--- + +## Frequentist vs Sequential — what affects per-metric reading + +Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. + +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). + +--- + +## Triggered analysis & dilution + +If the change only affects a subset of users (e.g. only triggers when a specific button is shown), the **effect on triggered users** is much larger than the **effect on the full exposed population**. + +- Triggered analysis zooms in on users who actually saw the change. +- Dilution math: `population_lift = triggered_lift × (triggered_users / total_exposed)`. + +The platform doesn't auto-compute triggered analysis. If the change is gated by a condition, ask the user about the trigger rate and walk through the math before declaring the population-level lift "small." + +--- + +## Novelty and primacy + +- **Novelty** — lift is large early, then decays as users habituate. +- **Primacy** — lift is small or negative early, then grows as users learn the new behavior. + +To detect either, look at the line-chart view of the metric (date-segmented). A monotonic decay from day 1 → day 14 is classic novelty; the steady-state lift is what matters for shipping. Call this out when interpreting any experiment shorter than ~2 weeks. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md new file mode 100644 index 0000000..5ce40a1 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md @@ -0,0 +1,93 @@ +# Pre-launch pitfalls + +Catalogue of the deterministic checks to run before the user creates an experiment. Detection logic lives in the platform's pre-launch validation capability; this document owns the prose — the _why_ behind each check — so the agent can explain the violation in human terms rather than just nagging. + +## Triage order + +Surface pitfalls in this order: + +1. **Blockers first.** An experiment that triggers a blocker should not launch as-is. Two today: **insufficient duration** for the configured MDE on available traffic, and **cohort too small** to supply enough eligible users. Both mean the experiment literally cannot reach statistical power. +2. **Warnings next.** Configuration smells that would degrade interpretability or trustworthiness. Most pitfalls fall here. +3. **FYIs last.** Soft nudges; not blocking even if the user ignores them. + +Within a severity tier, surface in this order (most actionable first): data-trust risks (pre-experiment bias, variance inflation) → configuration nudges (guardrails, hypothesis alignment). + +## The >5% guardrail hard-gate + +The single most important rule in the catalogue. **A 5% relative regression on any guardrail blocks ship even if the primary wins.** + +### Why 5% + +The threshold is calibrated to be tight enough to catch real degradations of user experience, revenue, or performance, and loose enough that day-to-day noise on a moderately-volatile guardrail doesn't trip it on every test. + +- Below 5%: typically within the noise band of most guardrails on a 2-week test. Tightening below 5% would generate too many false alarms. +- Above 5%: the team has implicitly traded measurable user/revenue/performance damage for headline-metric lift. That's not a ship — that's a re-design. + +### Why "hard gate" + +Guardrails are not "things to also look at." They are the **trustworthiness backstop**. A winning primary with a regressing guardrail means the change _exchanged_ something the team agreed must not regress for the headline-metric lift. If guardrails are negotiable, they aren't guardrails. + +### Why explain it to the user + +The most common reaction to a guardrail regression is "but the primary won, can't we just ship?" The agent's job is to make the trade-off explicit: + +> "Primary metric `` won by +2.3pp, but guardrail `` regressed by 7.4%. The 5% threshold exists because guardrails are the trustworthiness backstop — a winning primary with a regressing guardrail means you've traded `` for ``, which is a design choice that needs explicit sign-off, not a ship decision." + +If the team genuinely wants to make that trade, they can disable the guardrail before launch and document the decision in the experiment's description. Don't let them silently override; force the conversation. + +--- + +## The catalogue + +### Insufficient duration for the configured MDE — blocker + +**Expected exposures over the planned window cover less than 50% of the required per-arm sample.** The experiment cannot reach statistical power for this MDE no matter how clean the rest of the config is. The most likely outcome is "inconclusive," and a non-trivial fraction of those inconclusive results will be noise crossing the significance threshold rather than a real effect (the winner's-curse problem). Extend planned duration to cover the required sample, OR relax the MDE (only ship if the lift is bigger), OR pick a higher-volume primary metric, OR enable CUPED if pre-exposure data is available (cuts required sample 30–70%). + +### Cohort too small — blocker + +**Eligible cohort size is smaller than (number of arms × per-arm target).** Same root cause as the duration blocker, different lever. Even with infinite time, the experiment will run out of eligible users. Either expand the cohort to comfortably exceed (number of arms × per-arm target) eligible users (relax filters, broaden segment, extend eligibility window), or lower the per-arm target to what the cohort can supply (and accept the larger achievable MDE). + +### Pre-experiment bias likely — warning + +**Retro A/A is enabled, at least one continuous-ish metric (continuous, retention, or funnel) is configured, AND CUPED is off.** Pre-experiment bias is likely on metrics with seasonality or power-user skew. Without CUPED to absorb the baseline difference, post-experiment lifts inherit it — the team sees "treatment up 2%" when the real treatment effect is 0% and the baseline difference is +2%. Enable CUPED with a 2–4 week pre-exposure window; it specifically regresses out the pre-exposure baseline difference. + +### High variance, no Winsorization — warning + +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. + +### Multiple primaries, no correction — warning + +**≥2 primary metrics configured AND multiple-testing correction is off.** Family-wise false-positive rate compounds with each additional primary: at 3 primaries ~14.3%, at 5 ~22.6% — more than one in five "wins" is noise. Enable multiple-testing correction. Default to Benjamini-Hochberg (more powerful with correlated metrics); use Bonferroni for strict family-wise error control. + +### Marginally underpowered duration — warning + +**Expected exposures cover 50–100% of the required per-arm sample.** The experiment might reach significance on a true effect; it might not. Either way, the lift estimate at conclusion will be wider than expected. Extend duration to reach 100%+ of the required sample, or accept the higher Type-II error rate. Less urgent than the insufficient-duration blocker. + +### Missing guardrails — warning + +**Zero guardrail metrics configured.** Without guardrails, there's no >5% hard-gate to block a ship on a regression. The team is implicitly trusting that the primary captures every relevant impact — rarely true. Add at least one guardrail covering the most likely failure mode of the change: + +- UI change → page-load time or error rate. +- Monetisation / pricing → cancel rate or refund rate. +- Engagement change → Day-7 retention or session count. +- Performance change → error rate or crash rate. + +### Hypothesis ↔ metric mismatch — warning + +**The hypothesis mentions a canonical metric noun (conversion, retention, revenue, signup, engagement, click, purchase) but no primary's name appears to measure that outcome.** Soft signal — the heuristic is coarse, but it catches the common case where the user wrote "X will increase conversion" and then set the primary to "session count." Phrase as a question, not a verdict: _"Your hypothesis mentions ``, but no primary metric name suggests it measures that. Should `` be replaced or supplemented with a metric that more directly tests the hypothesis?"_ + +### Primary lacks a leading indicator — warning + +**Primaries include a retention-type metric AND no leading-indicator secondary (conversion or funnel type) is configured.** A retention primary is valid but reads slowly — there may not be enough signal to interpret results before the experiment concludes. Add a leading-indicator secondary measured within the experiment runtime; the retention primary stays as the ship decision, the secondary just gives early visibility. + +--- + +## Detection vs prose + +The detection math lives in the platform's pre-launch validation capability. The prose lives here. The platform reports which check fired; the agent renders the human-readable message. When the platform's thresholds change (e.g., the 50% / 100% bounds on the underpowered checks), the recommendation language in this document needs to track — the agent will quote stale numbers otherwise. + +## What's not in the catalogue (yet) + +- **Cross-test contamination** — when the same users are eligible for multiple concurrent experiments on the same surface. Hard to detect statically; usually surfaces as anomalous variance at interpretation time. +- **Novelty effect detection** — early days of the experiment show inflated treatment effect, then settle. Not a pre-launch check; lives in the post-launch interpretation skill. +- **Seasonality misalignment** — running a 2-week experiment that doesn't align to weekly cycles. Today this is detected indirectly via the duration check; a future explicit seasonality-alignment check is a reasonable add. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md new file mode 100644 index 0000000..60d083c --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md @@ -0,0 +1,81 @@ +# Prior experiments + +The first thing to do when a user proposes an experiment on a feature is look up prior experiments on that feature. Skipping this leads to redundant tests, contradictory ship decisions, and wasted traffic. + +## The lookup + +Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. + +If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. + +## What to do with what you find + +### Same feature already tested and shipped + +Reference the prior result before recommending a new test. The right answer is often "don't re-run; iterate on a new hypothesis." + +> "There's a prior experiment from [date] on the same feature with a similar hypothesis: it shipped at +X% on metric Y. Re-running won't tell us anything new. What's different about the change you're proposing? Is the new hypothesis about a different sub-population, a different metric, or a different mechanism?" + +If the user does want to re-run (e.g., the population has shifted significantly, the underlying product has changed, or the prior test was clearly underpowered), proceed — but design the new test to specifically address what's different from the prior. + +### Same feature tested and killed + +Treat this as a strong prior. Ask why the user thinks the new variant will work where the prior didn't. + +> "Prior experiment [date] on the same surface killed at [-X% / inconclusive]. What's different about your change that should produce a different outcome? If the prior failed because of [mechanism], does your change address that?" + +If the user can articulate a different mechanism, run the new test. If they can't, the most likely outcome is a repeat of the prior result — discourage the test or downgrade its priority. + +### Earlier iteration of the same hypothesis + +Use the prior result to inform the new design — specifically, **baseline rates and variance estimates**. Prior data is much more reliable than guessing. + +- Pull the prior's control-arm baseline rate; use it as the baseline for the new sizing calculation. +- Pull the prior's observed variance; use it instead of estimating from scratch. +- Pull the prior's exposure rate (exposures per day per variant); use it to set a realistic duration estimate. + +This often shrinks the required sample size or shortens the planned duration. Both are wins worth surfacing. + +### Recently concluded with similar metrics + +Pull the realised exposure rate. The "expected exposures per day" the user has in mind is usually higher than what actually shows up in a real experiment on the same surface — eligibility filters, opt-outs, and bot exclusion all bite. Use the prior's actual rate, not the theoretical one. + +### Multiple prior experiments on adjacent surfaces + +Look for **patterns**, not single data points. If three prior tests on the same funnel stage all moved in the same direction by similar magnitudes, that's the realistic prior for what the new test will do. If the prior tests are noisy or contradictory, treat the new test's expected lift with more uncertainty and consider running it longer. + +## Folding prior results into the new design + +Concretely, when you have a prior result that's relevant, the setup workflow changes as follows: + +| Step | Without prior | With prior | +| -------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | +| Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | +| Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | +| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | + +## When prior tests warn you away from testing at all + +Sometimes the prior data tells you the right answer is **don't run the experiment**: + +- The metric the user wants to move has been tested 4 times on this surface in the last year, all with inconclusive or null results, all adequately powered. The hypothesis-space is likely exhausted; suggest a different mechanism or a different surface. +- The baseline rate is so low that even the prior, well-powered tests couldn't detect anything below a 30% relative lift. The new test would inherit the same constraint. Either pick a higher-volume proxy metric or accept that the change has to be very large to be detectable. +- Recent guardrail regressions on the same surface suggest the surface is unstable; running more experiments without first fixing the trust issue is wasted traffic. + +Surface these findings as recommendations, not blockers. The user might have context the prior data doesn't capture. + +## What to record about the new design's relationship to prior tests + +In the experiment's description, link to the prior experiment(s) and note how the new design differs. This becomes critical at interpretation time — the post-launch step uses the prior context to calibrate its read of the new result. + +A useful template: + +``` +Prior: tested on , result: . +This experiment differs by: . +Inherited from prior: baseline rate (X%), σ², exposure rate (N/day/variant). +``` + +This is a 30-second annotation that pays back tenfold at analysis time. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md new file mode 100644 index 0000000..d077dba --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -0,0 +1,85 @@ +# XP vs FF: routing intent + +Before any setup work, decide whether the user actually wants an **experiment** (XP) or just a **feature flag** (FF). The decision is binary, but the language users use is blurry — "let's A/B test this" sometimes means "let's run a controlled experiment with a hypothesis and a stopping rule," and sometimes means "I want to ship it to 10% of users and see if anything breaks." + +## The discriminator + +| If the user wants… | Then it's a… | +| --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Causal evidence — "does this change move metric X by enough to justify shipping?" | **Experiment** (XP). | +| Progressive rollout — "ship to 10%, then 50%, then 100% if nothing breaks." | **Feature flag** (FF). | +| Kill-switch — "I want to be able to turn this off instantly if it goes sideways." | **Feature flag** (FF). | +| Per-segment gating — "only show this to enterprise customers." | **Feature flag** (FF). | +| Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | +| Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | + +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. + +## Disambiguation prompt + +When you can't tell from the user's wording, ask once, plainly: + +> "Are you trying to **measure** whether this change moves a metric (experiment), or are you rolling it out gradually / behind a flag with **no measurement criterion** (feature flag)? An experiment commits to a hypothesis, metrics, and a stopping rule; a feature flag is purely a delivery mechanism." + +Listen for these signals in the answer: + +- "I want to see if it improves X" / "if checkout conversion goes up" → experiment. +- "I want to make sure it doesn't break X" → could be either. Probe: "Is 'doesn't break' a measurable threshold, like a guardrail, or is it 'I'll watch dashboards and roll back if it's obviously bad'?" +- "I want enterprise to get it first" / "I want to roll out by region" → feature flag. +- "I just want a kill switch" → feature flag. +- "I want to ship it and prove ROI later" → ask whether the proof needs to be causal. If yes, that's an experiment, and it should be set up _before_ shipping, not after. (Post-hoc ROI claims from a flag rollout are not credible.) + +## Common ambiguous cases + +### "Ship to 10% as an experiment" + +Often this means "phased rollout, monitor metrics, ramp if nothing regresses." That's a feature flag with manual ramp logic, not an experiment. + +Ask: "Do you have a primary metric you're committing to before launch, with an MDE that decides whether to ship to 100%?" If yes, run as an experiment. If no, ship as a flag. + +### "I want to test the new pricing on enterprise customers" + +If "test" means "see how they react and decide whether to roll out," and the audience is small (a few enterprise customers), that's a **rollout**, not an experiment. Enterprise samples are usually too small to power an experiment, and the per-account variance is too high for a meaningful aggregate. + +Run as a flag, gather qualitative feedback, and decide based on the conversations — not on a p-value computed from N=4. + +### "Hold out a control while we ship to 100%" + +This is the classic "holdout experiment." Legitimate use case, but it has to be set up as an experiment up front (with a primary metric and a duration), not retroactively. After-the-fact holdout analysis suffers from selection bias and is not credible. + +If the user has already shipped to 100% and wants to "analyse the effect," there is no experiment to set up. Tell them so, and suggest a forward-looking test on the next change to the same surface. + +### "Just give me an A/B test, the simplest one" + +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. + +### "I want a feature flag but with stats" + +Now you're back to an experiment. Run the full setup workflow. + +## What changes once you've routed + +### If experiment + +Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. + +### If feature flag + +This skill stops. Hand off to the `manage-feature-flags` skill: + +- They configure variants, targeting, and rollout percentages directly. +- No hypothesis, no MDE, no stopping rule needed. +- Mixpanel doesn't compute lift or significance on a flag — they're on their own for observation. + +Make sure the user understands the trade-off explicitly: "Choosing flag means you give up the ship/no-ship decision criterion. If later you want to claim the change worked, that claim won't have the same evidentiary weight as a properly-designed experiment." + +## Don't run an experiment when + +There are cases where an experiment is technically possible but the wrong move: + +- **Sample is too small.** Enterprise rollouts to ~10 accounts cannot power a real test. Ship as a flag and use qualitative feedback. +- **Treatment is risky/irreversible.** A real billing change with potential refunds shouldn't run as a 50/50 split — phase as a flag with conservative rollout and direct monitoring. +- **No baseline data.** Brand-new metric, brand-new feature, no historical observation. Run a 1–2 week passive observation period first, then design the experiment from real numbers. +- **Hypothesis is "let's see what happens."** No directional commitment means the test will be interpreted post-hoc, which is the same as not running an experiment. + +Suggest the alternative explicitly so the user doesn't feel rejected — "this isn't an experiment-shaped problem; here's what to do instead." diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md new file mode 100644 index 0000000..98c7bbc --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -0,0 +1,99 @@ +# Segment-Breakdown Interpretation + +Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. + +--- + +## The mental model + +A segment breakdown asks: _did the treatment affect different user segments differently?_ It has three possible outcomes per segment: + +1. **The segment moved in the same direction as the overall effect**, with similar magnitude → reinforces the overall verdict; nothing new. +2. **The segment moved much more or less than overall**, but in the same direction → heterogeneity; the effect is concentrated in a subset. +3. **The segment moved in the _opposite_ direction** to overall → Simpson's paradox or a real reversal — this is where segment analysis earns its keep. + +Reading a segment breakdown well means recognizing which of those three you're looking at and not mistaking noise for any of them. + +--- + +## Per-segment polarity recipe — apply per row + +The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. + +- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). +- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. +- Filter out the control row in each segment. + +Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. + +--- + +## Sample-size floor per segment + +Each segment value needs its own meaningful per-variant sample for the per-segment stats to be reliable. The platform surfaces an "insufficient exposures" flag at the overall level — trust that signal over a hand-rolled threshold, and apply the same logic per segment. + +- Segments the platform would flag insufficient if scoped to alone → mark "insufficient sample, treat as directional only." +- A "significant" lift on a tiny per-variant segment (e.g. tens of users) is almost always noise. Say so. +- If many small segments matter to the user, pool them (e.g. all small countries into "RoW") and re-slice. + +--- + +## Heterogeneity vs Simpson's paradox vs noise + +| What you see | Interpretation | +| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Most segments lift positive, one or two negative, all with overlapping CIs | **Noise.** Not heterogeneity. Don't ship a segment-specific story. | +| One segment lifts much more than the rest, with a tight CI and a clear mechanism | **Real heterogeneity.** The change is concentrated in that segment. Consider shipping only to that segment, or revising the hypothesis. | +| Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | +| Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | + +When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. + +--- + +## What a "ship only to segment X" recommendation requires + +Don't recommend a segment-scoped ship unless **all** of these hold: + +1. The segment was named in the hypothesis upfront (pre-committed), OR the mechanism makes the heterogeneity obvious in hindsight (and you can articulate it). +2. The segment's per-variant sample clears whatever exposure floor the platform applies to the overall experiment, by a comfortable margin. +3. The segment's overall result (polarity-corrected) is a win on the primary metric with no guardrail regressions in that segment. +4. Guardrail behavior in the **other** segments is acceptable — shipping to one cohort doesn't quietly regress the rest of the product. +5. Multiple-testing correction is enabled, OR the segment was named upfront so multiple-testing doesn't apply. + +Otherwise, the segment-only ship is a post-hoc story dressed up as a decision. Recommend confirming with a follow-up experiment scoped to that segment. + +--- + +## When a segment loses but overall wins + +This is the everyday case of mixed effects. + +- If the losing segment is small and its absolute hit is acceptable, ship to all — but call out the loser in the rationale. +- If the losing segment is large or has a guardrail regression, recommend iterate, not ship. +- If the losing segment is a regulated / strategic cohort (paying tier, top customers, EU), default to iterate — guardrails on the cohort, not just overall. + +--- + +## What NOT to do + +- ❌ Slice by every dimension after the fact and report the most significant segment as the result — that's the canonical fishing expedition. +- ❌ Apply overall multiple-testing correction logic to segment-level rows from a per-segment query fallback — they're not corrected unless the platform did it. +- ❌ Confuse Simpson's paradox with a real reversal — check SRM per segment before claiming a true reversal. +- ❌ Recommend ship-to-segment based on a segment that wasn't pre-committed in the hypothesis or doesn't have a clean mechanism. +- ❌ Quote a per-segment lift number without the sample-size context (a 40% lift on 60 users isn't a number, it's a sentence). + +--- + +## Output shape + +1. **One-sentence segment-level summary** — homogeneous, heterogeneous, or Simpson's-suspicious. +2. **Per-segment table** — segment, exposed-per-variant, polarity-corrected verdict (win / loss / no effect / underpowered). +3. **What the segment view changes about the overall verdict** — usually one of: nothing, narrow to subset, iterate due to one cohort, or "investigate Simpson's." +4. **Caveats** — which segments are below the sample floor, which weren't pre-committed (and so are hypothesis-generating). + +--- + +## Platform support status + +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md new file mode 100644 index 0000000..4db49ac --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md @@ -0,0 +1,116 @@ +# Segment-of-Interest Selection + +Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. + +The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. + +--- + +## Why this matters: the fishing-expedition problem + +If you slice an experiment by every available property (10 platforms × 20 countries × 5 plan tiers × …), you will find "significant" segment-level effects by chance alone. The family-wise false positive rate explodes the same way it does for too many primary metrics — except there's usually no platform-level correction across segments. **Pre-committing to a small set of segments, ordered by hypothesis-driven probability, is the discipline that makes segment analysis credible.** + +Aim for 3–5 segments, max. If the user wants more, ask which ones are connected to the hypothesis and which are exploration. Mark the exploration set as "hypothesis-generating, not decisional." + +--- + +## The decision tree for picking segments + +Walk through these in order. The first match is the most defensible pick. + +### 1. Segments the hypothesis explicitly names + +If the experiment's `hypothesis` (or `description`) text mentions "new users", "mobile", "Pro tier", "EU customers" — those segments are pre-committed by the experiment design. Always include them. + +Look at: + +- `experiment.hypothesis` +- `experiment.description` +- The setup-side conversation, if present + +These are not exploratory; they're the variables the team committed to test. + +### 2. Segments where the mechanism is expected to matter + +The hypothesis names _what_ the change is and (ideally) _why_ it should work. The "why" tells you which user attributes plausibly moderate the effect: + +| Hypothesis mechanism | Segments likely to moderate the effect | +| ------------------------------------------------- | -------------------------------------------------- | +| "Reduces first-time friction in onboarding" | New vs returning; signup source; locale | +| "Improves discoverability of feature X" | Users who previously used X vs not; tenure | +| "Speeds up a slow flow" | Platform (mobile slower than web); connection type | +| "Lowers payment friction" | Plan tier; payment-method type; geography | +| "Replaces a confusing UI element" | New vs returning (returning users habituated) | +| "Surfaces a feature only relevant to power users" | Engagement-tier cohorts; tenure | +| "Localized copy / pricing change" | Country / language | + +If you can't articulate _why_ a segment should respond differently, it's not a hypothesis-driven slice. Demote it. + +### 3. Segments where the **denominator** plausibly differs + +Some properties don't change _behavior_ but change _who gets exposed_. Slicing on these helps catch changed-denominator artifacts before they're called a win. + +- Triggered vs untriggered cohorts (if the treatment only fires on certain pages). +- Platform / app version (the treatment may only ship on a subset of clients). +- Device class (mobile vs desktop) when the change is platform-specific. + +A 1000% lift in `Checkout Screen Viewed` overall usually disappears once you condition on "users who reached the checkout funnel" — that disappearance is the finding. + +### 4. Segments where SRM or baseline shift is suspected + +If overall SRM is borderline (or failing in one variant only), per-segment SRM can localize the bucketing bug to a specific platform / country / cohort. Examples: + +- iOS vs Android (often the SDK bucketing layer differs). +- Bot-suspicious countries (`bot_traffic` cause from health-check). +- A specific app version range that shipped a flag-evaluation change. + +This is diagnostic segmentation, not interpretation segmentation. Use it when the **trustworthiness gate** has already flagged trouble. + +### 5. Segments the platform de facto requires + +Some user dimensions are so foundational that any results report should mention them once: + +- **Platform** — web vs iOS vs Android. +- **New vs returning** — defined as first session within the experiment window vs before. +- **Geo region** — EU vs US vs APAC, when results meaningfully differ by regulatory or payment context. + +Don't include all three blindly — pick the one(s) most likely to vary given the change. + +--- + +## Sanity checks before committing to a slice + +For each segment you want to break down on: + +1. **Does each segment value have enough exposed users per variant to clear the platform's overall sufficiency threshold?** Below that, the per-segment stats are unreliable. If not, suggest pooling small segments or extending the experiment. +2. **Is the segmenting property captured for both control and treatment users?** (It almost always is, but verify.) A property only set when the treatment fires is not a valid segmenting axis. +3. **Is the segment defined the same way in pre- and during-experiment data?** Drifting definitions (e.g. "Pro tier" boundaries changed mid-test) invalidate the comparison. +4. **Is the segment determined _before_ exposure?** Segments derived from in-experiment behavior are post-treatment effects, not user attributes — slicing on them is selection-bias, not stratification. + +--- + +## How many slices to commit to + +| Situation | Number of slices | +| ----------------------------------------------------------------- | ------------------------------- | +| Hypothesis-driven, well-powered, decisional | 3–5 segments, named upfront | +| Exploratory ("anything weird?"), flagged as hypothesis-generating | Up to ~10, with explicit caveat | +| Diagnostic (chasing a failing SRM or strange overall result) | Whatever helps localize the bug | + +If the user wants to "just look at everything", push back: pick the top 3–5 with reasoning, then offer a separate exploratory pass that won't be used for the ship decision. + +--- + +## The pre-commit ritual + +Before running the breakdowns, tell the user something like: + +> _"Based on the hypothesis (``), I'd slice by `` and `` because ``. I'm intentionally not slicing `` because they don't connect to the proposed mechanism — looking at every dimension makes false positives almost guaranteed. We can do an exploratory pass after, separately from the ship decision. Sound right?"_ + +Pre-commitment is what separates "segmentation analysis" from "fishing." + +--- + +## Then read the results + +Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md new file mode 100644 index 0000000..7282bb4 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md @@ -0,0 +1,109 @@ +# Session-Replay Analysis Guidance + +Turn a quantitative experiment result into a behavior story using session replays. + +> **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. + +--- + +## When replays help, when they don't + +| Question | Replays help? | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | +| "Why is conversion lower in treatment?" | Yes — behavior diff is observable. | +| "Why is `Checkout Screen Viewed` 10× higher in treatment?" (changed-denominator suspect) | Yes — replays show whether users are _bouncing_ or _converting_ after they get there. | +| "Why is `time_on_page` higher in treatment?" | Yes — distinguishes engaged reading vs confused dwell. | +| "Is the treatment shipping a regression on iOS only?" | Sometimes — better answered first by segment breakdown. | +| "Why is SRM failing?" | No — replays don't show bucketing. Go to health checks. | +| "What's the lift?" | No — replays are qualitative; they explain _why_, not what. | +| "Why hasn't this hit statsig yet?" | No — that's a sample/power question, not a behavior question. | + +A useful heuristic: replays answer _behavioral_ questions. If the question isn't behavioral, replays will burn time without adding signal. + +--- + +## Cohort selection: which replays to compare + +You're looking for **paired contrast**, not a random sample. Pick the cohort that maximizes signal for the specific question. + +| Question | Cohort A (replays to pull) | Cohort B (replays to pull) | +| -------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- | +| Why is primary metric down in treatment? | Treatment users who **failed** the primary action | Control users who **succeeded** at the primary action | +| Why is a guardrail regression appearing? | Treatment users who **triggered** the guardrail negatively | Control users who did NOT trigger it | +| Why does treatment have a huge lift in `Screen Viewed` (denom shift) | Treatment users who reached the screen | Same users, looking at whether they completed the next step | +| Why is engagement higher / lower in a specific segment? | Treatment users in that segment | Control users in the same segment | +| What does the new UI look like in practice? | Any treatment users who saw the change | Any control users to confirm the baseline UI | + +**Aim for ~5 replays per cohort.** Fewer and you're anecdote-shopping; many more and you'll just confirm what the first 5 already showed. If the first 5 are inconclusive or contradictory, pull 5 more before changing tactics. + +Filter by recency — replays from the most recent days of the experiment best reflect steady-state behavior (avoid novelty / primacy noise). + +--- + +## What to actually watch for + +Go in with a hypothesis from the quantitative result. Don't watch replays blank-eyed; you'll see "users using the app" and learn nothing. + +### Friction / failure patterns + +- **Hesitation** — long pause before clicking a key element (often signals confusion). +- **Misclicks** — clicking non-interactive elements, or rage-clicking a button that didn't work. +- **Form abandonment** — typing into a field, then leaving without submitting. +- **Back-button bounce** — landing on the page, then immediately backing out. +- **Scroll-and-leave** — scrolling without engaging, then exiting. + +If treatment has more of these than control, you have a behavior explanation for a primary loss or guardrail regression. + +### Layout / discoverability issues + +- **CTA below the fold** — users never scrolling to where the new button is. +- **Element overlap on mobile** — the treatment looks fine in desktop testing but breaks on small screens. +- **Hidden state** — a tooltip / modal that fires once and is then gone, so the user never sees the key affordance. + +These usually explain segment heterogeneity (loss concentrated in mobile, or in a specific viewport size). + +### Changed-denominator behavior + +If you're investigating a Twyman's-Law-sized lift, look for: + +- **Users landing on the new screen and immediately leaving** — explains the inflated `Viewed` event without explaining real conversion. +- **Users completing the rest of the funnel at a much lower rate per-arrival** — explains why the headline metric grew but downstream metrics didn't follow. + +If treatment users _arrive_ at a screen more often but _complete_ at a lower per-arrival rate, the "lift" is a denominator artifact and the per-converter behavior is the real story. + +### Variant-specific UI issues + +- **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment fired twice / persisted state across sessions** — implementation regression. + +--- + +## How to frame the findings + +Replay analysis is qualitative. Be honest about that. + +- ✅ _"In 4 of 5 treatment replays, users hesitated >5 seconds at the new modal then closed it without acting. In 5 of 5 control replays, users clicked through within 2 seconds. This is consistent with the conversion drop in the experiment's results."_ +- ❌ _"Treatment is causing confusion."_ — too strong; n=5 is a hypothesis, not a verdict. + +Tie observations back to specific quantitative results from the experiment-details response. If the replay story contradicts the numbers, **trust the numbers first** and treat the replays as either a wrong cohort sample or a richer-than-expected behavior. + +--- + +## What NOT to do + +- ❌ Use replays to override a clear quantitative verdict. If primaries say "ship" and replays look ugly, the ugliness might be edge cases — confirm with segment analysis first. +- ❌ Cherry-pick a single dramatic replay. n=1 is anecdote. +- ❌ Replace segment analysis with replays. Replays explain _behavior_; segments explain _who_. Different questions. +- ❌ Pull replays from broad cohorts ("all treatment users") — the contrast pair is what reveals signal. +- ❌ Spend more time on replays than on the headline interpretation. The decision tree comes first; replays are the explanation step after it. + +--- + +## Output shape + +1. **The quantitative result the replays are explaining** — link back to the specific metric and verdict. +2. **Cohorts watched** — what filters were applied to A and B, how many replays in each. +3. **Patterns observed**, with counts (e.g. "4 of 5 treatment replays showed X; 0 of 5 control replays did"). +4. **The explanation hypothesis** — careful to frame as hypothesis ("consistent with"), not as proof. +5. **Recommended next action** — usually one of: ship anyway (regression edge-case), iterate (fix the friction), kill (treatment is materially worse), or run a follow-up A/B with the fix. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md new file mode 100644 index 0000000..7c41c9a --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md @@ -0,0 +1,109 @@ +# Sizing the experiment + +You almost never know the right sample size by guessing. Pull the data first, then run the math. + +## The standard formula + +Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): + +``` +n = 16 × σ² / d² +``` + +Where: + +- `σ²` = variance of the metric (depends on metric type — see below). +- `d` = MDE in the same units as the metric. + +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. + +## Variance by metric type + +- **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. +- **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. + +## Worked example + +Detecting a 5% **relative** lift on a 10% baseline conversion rate at 80% power, 95% confidence: + +``` +p = 0.10 +σ² = 0.10 × 0.90 = 0.09 +absolute MDE = 0.10 × 0.05 = 0.005 +n = 16 × 0.09 / 0.005² = 16 × 0.09 / 0.000025 = 57,600 per variant +``` + +That's ~57,600 per variant for a 5% relative lift — humbling, and surprising to most teams. Most "we'll just run it for two weeks" plans don't survive contact with this number. + +## Kohavi's inverted formula + +For most online experiments, traffic is the constraint, not patience. Pick a duration (2–4 weeks captures weekly cycles), use all available traffic in that window, then compute the **achievable MDE**: + +``` +MDE = 4σ / √n +``` + +This tells the user: "given your traffic, the smallest effect you can reliably detect is X." If that achievable MDE is larger than the lift the user actually expects, the experiment is **underpowered**. Flag immediately. + +Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." + +## Estimating the inputs from real data + +For each primary metric, before sizing, you need three numbers: + +1. **Baseline rate** — query the metric over the prior 2–4 weeks (the longer of: one full business cycle, or four weeks). Record `mean` and `variance`. Use the same event definition, segment filters, and unit-of-analysis you'll use in the experiment — a baseline computed differently from how the metric is configured in the experiment is worse than no baseline at all. +2. **Daily traffic** — query the exposure event (or whatever event qualifies users for the experiment) over the same window, grouped by day. Average to get expected exposures per day per variant. +3. **MDE the user wants** — ask explicitly. _"What's the smallest lift that would be worth shipping?"_ If they don't know, propose a 5–10% relative lift and confirm. + +From those three: + +``` +required_sample_per_variant = 16 × σ² / (baseline × MDE_relative)² +required_days = required_sample_per_variant × n_variants / daily_traffic_per_variant +``` + +If `required_days > 28` (four weeks), the experiment is **underpowered for the requested MDE on available traffic**. Tell the user. Don't wave it through. + +## Five remediations when the experiment is underpowered + +Offer these in order of cost — cheap first. + +1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. +2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. +5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. + +## Sample-size floor + +Independent of the math: keep per-variant sample size above the platform's reliability floor (verify in product — historically ~350–400). Below this, the statistical machinery itself becomes unreliable — CLT breaks down, the SRM check gets noisy. The platform's default per-variant target is fine for most tests; ~1,000 is the practical floor; the platform floor is the absolute floor. + +If the math says `n = 50` per variant, the test is either trivially easy (the lift is huge) or the variance estimate is wrong. Sanity-check before launching at the floor. + +## Lookup table (Bernoulli, 95% conf, 80% power) + +For a Bernoulli (conversion-rate) primary metric at 95% confidence, 80% power, two-sided test, MDE expressed as a **relative** lift on the baseline: + +| Baseline rate | MDE = 5% relative | MDE = 10% relative | MDE = 20% relative | +| ------------- | ----------------- | ------------------ | ------------------ | +| 1% | ~633k / variant | ~158k / variant | ~40k / variant | +| 5% | ~122k / variant | ~31k / variant | ~7.6k / variant | +| 10% | ~58k / variant | ~14k / variant | ~3.6k / variant | +| 25% | ~19k / variant | ~4.8k / variant | ~1.2k / variant | +| 50% | ~6.4k / variant | ~1.6k / variant | ~400 / variant | + +Use this for quick sanity-checking. Always confirm with a query against actual baseline data — these are illustrative. + +## Sample-size growth with variants + +For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. + +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. + +## Duration considerations + +- **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. +- **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md new file mode 100644 index 0000000..771a208 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md @@ -0,0 +1,102 @@ +# Statistical model + +Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. + +## Testing model: sequential vs frequentist + +**Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. + +### Pick sequential when + +- The user expects a **large lift** and wants to confirm or reject the hypothesis quickly. Sequential lets you stop the moment significance is reached — often days or weeks before a frequentist target. +- The user wants to check results before the experiment ends and act on them (early-stop on a clear winner). +- The expected effect size is uncertain (could be huge, could be tiny). Sequential adapts; frequentist needs you to commit to one MDE up front. +- The team will look at intermediate results regardless. Sequential prevents peeking from inflating false positives. +- The user is comfortable with slightly more complex stopping rules ("stop when the test-statistic crosses the boundary," not "stop when n reaches N"). + +### Pick frequentist when + +- The user is hunting for a **very small lift** (e.g. 1–2% relative on a high-volume metric). Frequentist's fixed-sample design is statistically more efficient at the margin and avoids the early-stop boundary inflation that costs power on tiny effects. +- The team is comfortable waiting for the full sample before checking results — no peeking. +- The team prefers wider industry familiarity ("we used a t-test"). +- The user wants the simplest reportable statistics (a single p-value and confidence interval at the end). +- The team has internal training / tooling that assumes frequentist. + +### The "I want to peek with frequentist" trap + +The most common request is "I want frequentist, but I also want to look at the results during the test." This inflates the false-positive rate enormously — naive peeking on a frequentist test at 5 evenly-spaced check-ins pushes the family-wise α from 5% to ~14%. + +Switch them to sequential. Sequential's whole point is making peeking safe. + +If the user insists on frequentist + peeking (some teams do, for tooling reasons), document the decision in the experiment's description so the interpretation step later knows the reported p-values overstate confidence. + +## End condition: sample-based vs date-based + +### Pick sample-based when + +- The team has a target MDE and wants the experiment to stop the moment the required sample is reached. Adaptive duration. +- Daily traffic is highly variable. Sample-size-based ends absorb the variability; date-based ends don't. +- There's no strong seasonality in the primary metric that would bias a mid-cycle stop. + +### Pick date-based when + +- The primary metric has **strong weekly (or other periodic) seasonality**. Pin the duration to a multiple of the seasonal cycle so each variant sees the same mix of high- and low-traffic periods. + - A common pattern: customers with strong weekday/weekend behaviour shifts run all experiments in 1-week increments (or 2 weeks for a stricter check) to fully capture each cycle. + - A sample-based end can fire mid-cycle and produce biased results in this case. +- The team has a fixed business window (e.g. "we want to ship by end of quarter"). +- The team has historically struggled with experiments running indefinitely. +- The hypothesis specifically requires a calendar window (e.g. a holiday-season test). + +### Combinations + +All four combinations are valid. The one customers most often miss is **frequentist + date-based** — some teams prefer time-based experiments for operational reasons even when running frequentist tests. Don't flag this as a misconfiguration. + +The one that's actually wrong is **frequentist + sample-based + peeking** — that's the "peeking trap" above. Surface it; switch them to sequential. + +## Confidence level + +Default 0.95 (α = 0.05; verify in product). Change only with intent. + +- **0.99** — for high-stakes irreversible ships (e.g. billing changes, deletion-flow changes, anything regulatory). Higher false-negative cost; accept it. Document the reason in the experiment's description. +- **0.90** — for low-stakes exploratory tests where speed matters more than rigour. Acknowledge the inflated false-positive rate to the user explicitly: at α = 0.10, one in ten "wins" is noise. + +Any change away from the default belongs in the description. The post-launch interpretation step uses this setting to read the result correctly; without it, a "win" at 0.90 looks the same as a "win" at 0.95. + +## Multiple testing correction + +Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without correction, the family-wise false-positive rate compounds: + +| Primaries | Non-control variants | Family-wise FPR at per-test α = 0.05 | +| --------: | -------------------: | -----------------------------------: | +| 1 | 1 | 5.0% | +| 2 | 1 | ~9.75% | +| 3 | 1 | ~14.3% | +| 5 | 1 | ~22.6% | +| 5 | 2 | ~40.1% | +| 5 | 3 | ~53.7% | + +Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. + +The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. + +Two methods are available: + +- **Bonferroni** — divides α by the number of tests (primaries × non-control variants). Simple and conservative. Guarantees the family-wise error rate stays below α, but can be overly strict when many primary metrics are correlated, hurting power. +- **Benjamini-Hochberg** — controls the **false discovery rate** (FDR) instead of the family-wise error rate. Ranks all primary-metric p-values and applies progressively looser thresholds. More powerful than Bonferroni when there are many primary metrics, especially when some have real effects. Preferred when the user has 3+ primaries or correlated metrics. + +**Default to Benjamini-Hochberg** for most experiments — less conservative, better suited to typical designs with correlated metrics. Use Bonferroni when: + +- The user needs strict family-wise error control (regulatory, high-stakes decisions where any single false positive is unacceptable). +- The primary metrics are independent (no shared drivers / overlapping populations), in which case Bonferroni's conservatism is not a real cost. +- The team explicitly asks for the simplest method to defend in a review. + +Turn correction off **only** when there's a single primary and a single non-control variant. + +## Power vs significance trade-off + +When the user pushes you on the confidence level: + +- Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." +- Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. + +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md new file mode 100644 index 0000000..22ca2fe --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md @@ -0,0 +1,115 @@ +# Why Hasn't This Reached Statistical Significance Yet? + +Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. + +The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. + +--- + +## First, rule out a broken result + +Inconclusive can mean two very different things: + +1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. +2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. + +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. + +Also check: + +- The primary's lift is missing or null → no measurement, not "no effect." +- The primary is listed on the experiment but has no computed result (live or cached) → "no measurement," not "no effect." +- The live results carry an error block → results are stale or partial; resolve the backend issue before drawing power conclusions. + +--- + +## The five real reasons an experiment hasn't hit statsig + +Walk through these in order. The first one that explains the picture is usually right. + +### 1. Not enough sample yet (not enough exposures) + +**What to check**: per-variant exposure counts against the configured end target (sample size or duration, whichever the experiment was configured with), and which testing model the experiment is using. + +- **Sequential** + target not reached → genuinely too early. Recommend **WAIT**. +- **Frequentist** + target not reached → also too early; do NOT peek-and-call. Recommend **WAIT** to the configured end, or restart as sequential next time so peeking is safe. +- Target _was_ reached and still no significance → not a sample-size problem; move to reasons 2–5. + +If exposures are falling short of plan because traffic dropped: surface that. Querying the exposure event with a date breakdown shows whether something changed mid-experiment. + +### 2. Observed effect is smaller than the MDE + +**What to check**: the lift on the primary metric, plus the MDE the user planned for (typically captured in the experiment's hypothesis/description, or recovered via the setup-side skill's power math). + +- Observed lift ≈ planned MDE → experiment is correctly sized for the effect; if not significant yet, see reason 1. +- Observed lift **much smaller** than planned MDE → the effect (if any) is below what this experiment was sized to detect. Two real options: + - **Accept the null** — at this size, the change isn't moving the metric. Document and move on. + - **Resize and rerun** — if a smaller effect would still be ship-worthy, re-run with a larger sample (lower MDE). +- Observed lift much **larger** than planned MDE but still not significant → unusual; likely high variance (see reason 3) or insufficient exposures (reason 1). + +### 3. Variance is too high (metric is too noisy) + +**What to check**: the metric's distribution type, plus whether CUPED and Winsorization are enabled. + +- **Gaussian** metric (revenue, time-on-page) with no Winsorization → whales inflate variance, widen CIs, and crush power. Recommend enabling Winsorization on the next run. +- **Poisson** metric (event counts per user) → one heavy user can swing results. Same Winsorization recommendation; also consider switching to a rate metric if the hypothesis is about behavior, not volume. +- **Bernoulli** metric near 0% or 100% → variance shrinks at the extremes, but so does the absolute scale of detectable effects. Lifts near 50% rates are easiest; lifts near 0%/100% need much more sample. +- **CUPED not enabled** AND the metric correlates with pre-exposure behavior AND users existed before the experiment → enabling CUPED on a re-run typically cuts required sample 30–70%. +- **CUPED enabled on a new-user-only cohort** → CUPED has no effect (no pre-exposure data exists). Not a misconfiguration to "fix," but variance reduction simply didn't happen. + +### 4. Traffic split is starving the variant + +**What to check**: the configured traffic split against the actual per-variant exposure counts. + +- Even split (50/50) when one variant is the bottleneck → balanced is optimal for power, so this is usually not the issue. +- Skewed split (e.g. 90/10) → the smaller variant is undersampled; power is bottlenecked by the small side. If the skew was for risk reasons, that's a deliberate trade-off; flag that the smaller variant will reach significance much later. +- Multi-variant test (3+ arms) → each treatment-vs-control comparison gets a fraction of total traffic. Each non-control variant needs to clear the platform's per-variant exposure floor in its own right. Adding arms costs power per-comparison. + +Never change traffic allocation mid-Frequentist test — it invalidates the SRM baseline and the power calculation. If allocation needs to change, restart the experiment. + +### 5. Exposure config is filtering more users than the user expects + +**What to check**: exposure event volume, any audience filters on the backing feature flag, and whether QA traffic is being excluded. + +- A property filter or audience filter on the feature flag is excluding most users → exposures lag the user's mental "available traffic." Inspect the flag's rollout rules; query the exposure event to confirm how many users actually got exposed. +- The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. +- QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). + +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). + +--- + +## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? + +Once you know which reason fits, the recommendation almost picks itself. + +| Reason | Recommendation | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | +| Not enough sample yet, still ACTIVE | **WAIT.** Show projected end date based on observed traffic. | +| Not enough sample yet, concluded early | **EXTEND** (Frequentist: relaunch with longer planned duration; Sequential: resume if possible). | +| Effect << MDE | **ACCEPT NULL** if the planned MDE is the smallest ship-worthy effect; otherwise **BOOST POWER** and re-run. | +| Variance too high | **BOOST POWER**: enable CUPED, enable Winsorization, switch to a less noisy metric proxy. | +| Variant starved by traffic split | **EXTEND** (if remaining time is enough) or restart with rebalanced split. | +| Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | +| Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | + +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). + +--- + +## What NOT to suggest + +- ❌ **Stop early on a favorable peek** in a Frequentist test — that's exactly the false-positive inflation problem. +- ❌ **Switch testing model mid-experiment** — restart, don't morph. +- ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. +- ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. +- ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." + +--- + +## Output shape + +1. **The reason** (one of the five above), in one sentence. +2. **The evidence** — concrete numbers from the experiment (e.g. "exposures only at 4.2k of the 10k target," "observed lift 0.8% vs planned MDE 5%"). +3. **Recommendation** from the table above, with the specific experiment update or follow-up action. +4. **What to NOT do**, briefly — the wrong-way temptation specific to this experiment. From 70991ec649dfb6e56acf8058678970e675245cfc Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Wed, 10 Jun 2026 00:39:19 +0000 Subject: [PATCH 02/11] Address review minors: add reference index, clarify routing fall-through, dedupe match logic MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applies the 3 issues surfaced by the /review-skill pass on manage-experiment: - D4 (Centralised logic): the umbrella SKILL.md had no map of what references/ contains. Add a "Reference files" table to SKILL.md listing each file, which command uses it, and its purpose. Mirrors the shape used by manage-feature-flags. - D7 (Qualified recommendations): the phase-derived rule in step 3 ("DRAFT → design, ACTIVE/CONCLUDED → interpret") read as if it always applied, but it only does when an experiment is in context. Restate step 3 as ordered rules with explicit fall-through, and add a row for ambiguous verbs like "audit" / "check" / "review". - D6 (Human-friendly identifiers): the "try ID first, then case-insensitive name match" rule was stated in three places (umbrella step 2, design step 8, interpret step 1). Drop the duplicates from the command files — interpret hands back to the umbrella, design retains only the "ask if not named" guidance (the lookup mechanics don't apply to free-text feature names). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../skills/manage-experiment/SKILL.md | 33 ++++++++++++++++--- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 +- .../skills/manage-experiment/SKILL.md | 33 ++++++++++++++++--- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 +- .../skills/manage-experiment/SKILL.md | 33 ++++++++++++++++--- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 +- 9 files changed, 93 insertions(+), 18 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index f2e1be1..0b8044c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -72,6 +72,28 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +## Reference files + +Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. + +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. @@ -104,10 +126,13 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. -- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -- **Ambiguous or none:** show the Command menu, take the user's choice. +Apply these rules in order; the first match wins. + +1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index b5f7046..5c70403 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -130,7 +130,7 @@ Use the exact catalogue labels from [../references/pitfalls.md](../references/pi After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md index 77d78ff..5bbfebb 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md @@ -57,7 +57,7 @@ Top-down: what to do, in order. ### 1. Fetch the experiment -The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. +The umbrella resolves the experiment in its step 2. If the user named one mid-command, hand back to the umbrella's experiment-resolution step rather than restating it here. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index f2e1be1..0b8044c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -72,6 +72,28 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +## Reference files + +Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. + +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. @@ -104,10 +126,13 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. -- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -- **Ambiguous or none:** show the Command menu, take the user's choice. +Apply these rules in order; the first match wins. + +1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index b5f7046..5c70403 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -130,7 +130,7 @@ Use the exact catalogue labels from [../references/pitfalls.md](../references/pi After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md index 77d78ff..5bbfebb 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md @@ -57,7 +57,7 @@ Top-down: what to do, in order. ### 1. Fetch the experiment -The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. +The umbrella resolves the experiment in its step 2. If the user named one mid-command, hand back to the umbrella's experiment-resolution step rather than restating it here. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index f2e1be1..0b8044c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -72,6 +72,28 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +## Reference files + +Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. + +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. @@ -104,10 +126,13 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -- **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. -- **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -- **Phase-derived:** an experiment exists in context and its state determines the command — `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -- **Ambiguous or none:** show the Command menu, take the user's choice. +Apply these rules in order; the first match wins. + +1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. +3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index b5f7046..5c70403 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -130,7 +130,7 @@ Use the exact catalogue labels from [../references/pitfalls.md](../references/pi After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. Accept the feature by name or by ID; try ID match first, then case-insensitive name match. +If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md index 77d78ff..5bbfebb 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md @@ -57,7 +57,7 @@ Top-down: what to do, in order. ### 1. Fetch the experiment -The umbrella's step 2 should have resolved the experiment already. If not (the user named it mid-command), accept by name or ID; try ID match first, then case-insensitive name match. +The umbrella resolves the experiment in its step 2. If the user named one mid-command, hand back to the umbrella's experiment-resolution step rather than restating it here. Request the experiment details with exposure and metric data included. The agent's tool layer maps that intent to the right parameters; don't hand-write API arguments. From b0c9b065b3d264ed66fc5ecfc08b8c83bd1d06f1 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Wed, 10 Jun 2026 01:10:28 +0000 Subject: [PATCH 03/11] Fix Winsorization percentile convention: agent passes tail-width (5), not cap-point (95) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #30 review caught an agent-actionable accuracy bug. The skill docs described Winsorization as "default 95 (cap top/bottom 5%)" — natural phrasing for a human, but the actual `Update-Experiment` tool schema defines `WinsorizationSettingsInput.percentile` as the **tail width to cap on each side**, with default `5` and `exclusiveMaximum: 50`. An agent reading the docs would pass `95`, get rejected by validation, or misinterpret as inverted Winsorization. Updated 6 files to match the schema's convention: - SKILL.md shared glossary - commands/design.md step 6 (advanced features) - references/advanced-features.md "What it does", "How to enable", "Percentile guidance" (default flipped from "95th" to `percentile=5`, "push back below ~80" flipped to "push back above ~20") - references/health-check-interpretation.md "Extreme winsorization percentile" misconfiguration check - references/per-metric-interpretation.md "Variance-reduction & outlier settings that change interpretation" - references/pitfalls.md "High variance, no Winsorization" warning The numeric guidance ("push back if more than 20% of each tail is capped") is preserved; only the convention name flipped to match what the agent actually sends through the tool. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) --- plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md | 2 +- .../skills/manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/advanced-features.md | 8 ++++---- .../references/health-check-interpretation.md | 2 +- .../references/per-metric-interpretation.md | 2 +- .../skills/manage-experiment/references/pitfalls.md | 2 +- plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md | 2 +- .../skills/manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/advanced-features.md | 8 ++++---- .../references/health-check-interpretation.md | 2 +- .../references/per-metric-interpretation.md | 2 +- .../skills/manage-experiment/references/pitfalls.md | 2 +- plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md | 2 +- .../skills/manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/advanced-features.md | 8 ++++---- .../references/health-check-interpretation.md | 2 +- .../references/per-metric-interpretation.md | 2 +- .../skills/manage-experiment/references/pitfalls.md | 2 +- 18 files changed, 27 insertions(+), 27 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index 0b8044c..aa1a854 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -69,7 +69,7 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index 5c70403..d295a61 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -93,7 +93,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md index 3f6f7dd..96fcd08 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md @@ -36,9 +36,9 @@ Three optional features most experiments don't touch — and that, used in the r ## Winsorization — outlier handling -**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. +**What it does.** Caps extreme values at both tails of the distribution. The `percentile` field on the settings is the **tail width** to cap on each side: the default `5` caps below the 5th and above the 95th (i.e. the 5% tails). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. -**How to enable.** Turn Winsorization on for the experiment and pick a percentile. +**How to enable.** Turn Winsorization on for the experiment and pick a `percentile`. The schema rejects `percentile` ≥ 50. ### When to enable @@ -54,9 +54,9 @@ Three optional features most experiments don't touch — and that, used in the r ### Percentile guidance -The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. +The default is `percentile=5` (cap each 5% tail). This is almost always right. Push back if the user sets a `percentile` above ~20 — that's more than 20% of values capped on each side, which throws away too much signal. Confirm intent before launching. -For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. +For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% tail) is sometimes appropriate, but that's the corner case. The default is the default for a reason. ### What changes downstream diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md index 1467468..82a09b7 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md @@ -142,7 +142,7 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Extreme winsorization percentile -**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a `percentile` far from the default (`5`, meaning 5% tails capped on each side).** A `percentile` approaching the schema cap of 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md index e46381c..e863069 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md @@ -116,7 +116,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md index 5ce40a1..c20eb30 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index 0b8044c..aa1a854 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -69,7 +69,7 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index 5c70403..d295a61 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -93,7 +93,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md index 3f6f7dd..96fcd08 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md @@ -36,9 +36,9 @@ Three optional features most experiments don't touch — and that, used in the r ## Winsorization — outlier handling -**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. +**What it does.** Caps extreme values at both tails of the distribution. The `percentile` field on the settings is the **tail width** to cap on each side: the default `5` caps below the 5th and above the 95th (i.e. the 5% tails). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. -**How to enable.** Turn Winsorization on for the experiment and pick a percentile. +**How to enable.** Turn Winsorization on for the experiment and pick a `percentile`. The schema rejects `percentile` ≥ 50. ### When to enable @@ -54,9 +54,9 @@ Three optional features most experiments don't touch — and that, used in the r ### Percentile guidance -The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. +The default is `percentile=5` (cap each 5% tail). This is almost always right. Push back if the user sets a `percentile` above ~20 — that's more than 20% of values capped on each side, which throws away too much signal. Confirm intent before launching. -For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. +For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% tail) is sometimes appropriate, but that's the corner case. The default is the default for a reason. ### What changes downstream diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md index 1467468..82a09b7 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md @@ -142,7 +142,7 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Extreme winsorization percentile -**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a `percentile` far from the default (`5`, meaning 5% tails capped on each side).** A `percentile` approaching the schema cap of 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md index e46381c..e863069 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md @@ -116,7 +116,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md index 5ce40a1..c20eb30 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index 0b8044c..aa1a854 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -69,7 +69,7 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping at a configured percentile, applied pooled across variants. Default 95 (verify in product). Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. - **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index 5c70403..d295a61 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -93,7 +93,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. Push back if the configured percentile is below ~80. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md index 3f6f7dd..96fcd08 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md @@ -36,9 +36,9 @@ Three optional features most experiments don't touch — and that, used in the r ## Winsorization — outlier handling -**What it does.** Caps extreme values at a percentile boundary (default 95th — verify in product). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. +**What it does.** Caps extreme values at both tails of the distribution. The `percentile` field on the settings is the **tail width** to cap on each side: the default `5` caps below the 5th and above the 95th (i.e. the 5% tails). This squeezes the long tail of heavy-tailed distributions so a handful of outliers can't dominate the per-arm mean. -**How to enable.** Turn Winsorization on for the experiment and pick a percentile. +**How to enable.** Turn Winsorization on for the experiment and pick a `percentile`. The schema rejects `percentile` ≥ 50. ### When to enable @@ -54,9 +54,9 @@ Three optional features most experiments don't touch — and that, used in the r ### Percentile guidance -The platform default is typically 95 (cap top/bottom 5%) — verify in product. This is almost always right. Push back if the user sets a percentile below ~80 — that's more than 20% of values being capped, which throws away too much signal. Confirm intent before launching. +The default is `percentile=5` (cap each 5% tail). This is almost always right. Push back if the user sets a `percentile` above ~20 — that's more than 20% of values capped on each side, which throws away too much signal. Confirm intent before launching. -For very heavy tails (extreme whale distributions), 99th percentile is sometimes appropriate, but that's the corner case. The platform default is the default for a reason. +For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% tail) is sometimes appropriate, but that's the corner case. The default is the default for a reason. ### What changes downstream diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md index 1467468..82a09b7 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md @@ -142,7 +142,7 @@ These don't always invalidate results, but they change how to _read_ them. Surfa ### Extreme winsorization percentile -**Winsorization enabled with a percentile far from the platform default (typically 95).** A percentile near 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. +**Winsorization enabled with a `percentile` far from the default (`5`, meaning 5% tails capped on each side).** A `percentile` approaching the schema cap of 50 caps almost all data — almost certainly a misconfiguration. Confirm with the user; recommend resetting to the default unless they have a specific reason. ### SRM check disabled diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md index e46381c..e863069 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md @@ -116,7 +116,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at the configured percentile, pooled across variants. Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A percentile much lower than the platform default (typically 95) is a misconfiguration — see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md index 5ce40a1..c20eb30 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the platform default percentile (typically 95). Push back if the user sets a percentile below ~80 — more than 20% of values capped is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. ### Multiple primaries, no correction — warning From 1bc3dcd591d2a469730a8c3e865c46744489aad2 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg Date: Wed, 10 Jun 2026 11:12:22 -0700 Subject: [PATCH 04/11] Add launch + monitor commands to manage-experiment (#31) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add launch and monitor commands to manage-experiment Extends the manage-experiment umbrella (PR #30) from 2 commands (design, interpret) to 4 commands covering the full experiment lifecycle. Stacks on top of PR #30 so the umbrella structure and this extension can be reviewed separately. ## New commands ### commands/launch.md (~125 lines) Owns the irreversible DRAFT → ACTIVE transition. Runs the final launch-readiness check (lifted from design step 7 — that step now catches design-time issues only) and performs the launch with explicit CONFIRM. Includes the post-launch handoff: recommend a tracking dashboard + a calendar reminder for monitor. ### commands/monitor.md (~115 lines) Mid-flight safety checks. Answers "is it safe to keep this running?" which is distinct from interpret's "did it work?" Covers: - The peek-safety table: what's safe to look at mid-flight (SRM, sample pace, guardrails, Sequential primaries) and what isn't (Frequentist primaries — the peeking trap). - The terminate-early decision rules: trustworthiness failure, guardrail regression beyond tolerance, Sequential boundary crossed. - An explicit don't-peek-on-Frequentist pushback for users who ask to see primary results mid-flight. ## Knock-on changes - umbrella SKILL.md: 4-row commands table, 5-option menu, phase-derived routing extended to DRAFT-design vs DRAFT-launch and ACTIVE-monitor vs ACTIVE-interpret. Reference table "Used by" columns updated (pitfalls → design+launch, health-check → monitor+interpret, sizing → design+monitor+interpret). - commands/design.md step 7: reframed as "design-time sanity check" rather than "the pre-launch check". The full launch-readiness gate now lives in launch.md. Step 8 saves a DRAFT (reversible) rather than creating-and-launching. Added step 9: hand off to launch. - design.md opening line: updated to say "stops at DRAFT" instead of "don't create until confirmed". Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) * Address review issues: lift cross-command policies to umbrella, unify emoji, hedge cross-skill refs Applies the 5 issues surfaced by the /review-skill pass on the 4-command manage-experiment: ## Major (D4) The 5% guardrail hard-gate and the peek-safety table were duplicated across design / launch / monitor — three places to keep load-bearing thresholds in sync. - Add a "Cross-command policies" section to the umbrella SKILL.md that owns the guardrail hard-gate threshold and the peek-safety table. Also moves the emoji convention there so all commands share one vocabulary. - design.md "The >5% guardrail hard-gate" → "Guardrails (the hard-gate enforced downstream)" — keeps the design-time guidance, defers the threshold to the umbrella. - design.md step 5: "peeking is safe by design" → "peek-safety table in the umbrella's Cross-command policies covers why". - launch.md readiness checklist: ">5% hard-gate from the design command" → "regression hard-gate (see umbrella Cross-command policies)". - monitor.md Components: drop the entire "What's safe to peek at" table, reference the umbrella. Glossary defers "peeking trap" to the umbrella. Terminate-early rules cite the umbrella threshold. ## Minor (D3) design.md output-style section had a duplicate hand-off line restating what step 9 already covers — removed. ## Minor (D7) launch.md step 5 dashboard recommendation was unqualified — added "recommend running create-dashboard in a follow-up session" so the agent doesn't try to invoke it inline and interrupt the launch flow. ## Suggestion (D3) launch.md irreversibility section mentioned "pause/resume via the underlying feature flag" without naming the skill — added explicit "handled by the manage-feature-flags skill" cross-ref. ## Emoji convention (D6) Declared once in the umbrella's Cross-command policies. Existing emoji usage across design/launch/monitor already matches; interpret has no symbolic checklist output (its output is prose verdict + per- metric breakdown), which is correct. ## Knock-on numbers Skill core: 692 → 720 lines (umbrella +38 with the new policies section, monitor -10 from removed table, others flat). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) * PR #31 polish: define peeking trap in umbrella, add calendar-reminder example /review-document feedback on PR #31: - D6: monitor.md deferred the "peeking trap" definition to the umbrella's Cross-command policies, but the umbrella's peek-safety section only showed consequences and never defined the term. Added a one-line definition before the table. - D8: launch.md step 5 calendar-reminder recommendation was abstract. Added concrete phrasing the agent can hand to the user so it doesn't improvise the wording inconsistently across sessions. Synced to eu/in. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) --- .../skills/manage-experiment/SKILL.md | 156 +++++++++++++----- .../manage-experiment/commands/design.md | 32 ++-- .../manage-experiment/commands/launch.md | 126 ++++++++++++++ .../manage-experiment/commands/monitor.md | 106 ++++++++++++ .../skills/manage-experiment/SKILL.md | 156 +++++++++++++----- .../manage-experiment/commands/design.md | 32 ++-- .../manage-experiment/commands/launch.md | 126 ++++++++++++++ .../manage-experiment/commands/monitor.md | 106 ++++++++++++ .../skills/manage-experiment/SKILL.md | 156 +++++++++++++----- .../manage-experiment/commands/design.md | 32 ++-- .../manage-experiment/commands/launch.md | 126 ++++++++++++++ .../manage-experiment/commands/monitor.md | 106 ++++++++++++ 12 files changed, 1089 insertions(+), 171 deletions(-) create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md create mode 100644 plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/monitor.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md create mode 100644 plugins/mixpanel-mcp-in/skills/manage-experiment/commands/monitor.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md create mode 100644 plugins/mixpanel-mcp/skills/manage-experiment/commands/monitor.md diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index aa1a854..c977181 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -1,29 +1,38 @@ --- name: manage-experiment description: > - Coach the user through any phase of a Mixpanel experiment — design before - launch (hypothesis framing, metric selection, sizing, statistical model - choice, advanced statistical features like CUPED / Winsorization / - Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret - after launch (read results, decide ship / iterate / kill / wait, interpret - health checks like SRM and Retro A/A, break results down by segment, use - session replays to explain a result). Use when the user mentions + Coach the user through any phase of a Mixpanel experiment — design (hypothesis + framing, metric selection, sizing, statistical model choice, advanced features + like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final + pre-launch readiness check and the irreversible launch action), monitor + (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- + Frequentist rule, terminate-early decisions), and interpret after the + experiment is mature (read results, decide ship / iterate / kill / wait, + interpret health checks like SRM and Retro A/A, break results down by segment, + use session replays to explain a result). Use when the user mentions experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or - any phrasing like "set up an experiment", "design an A/B test", "how did - experiment X do", "should we ship", "why isn't this significant yet", - "should this be sequential or fixed-horizon", "what's my MDE", "is this - experiment configured correctly", "audit my experiment". Do NOT use for - plain feature-flag rollouts with no measurement criterion — that belongs - to the `manage-feature-flags` skill. + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any + phrasing like "set up an experiment", "design an A/B test", "launch this + experiment", "is it safe to keep this running", "is my experiment SRM-ing", + "should I peek", "how did experiment X do", "should we ship", "why isn't this + significant yet", "should this be sequential or fixed-horizon", "what's my + MDE", "is this experiment configured correctly", "audit my experiment". Do + NOT use for plain feature-flag rollouts with no measurement criterion — that + belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- # Manage Experiment -This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. -The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). +The four commands map cleanly to experiment states: + +- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. +- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. +- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. + +The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). --- @@ -37,10 +46,19 @@ Each command lives in its own file under `commands/` and is loaded on demand. Ma | Command | File | Match if message contains any of | | ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, hypothesis, MDE, sizing, sequential vs frequentist, CUPED, Winsorization | +| `launch` | `commands/launch.md` | launch, go live, start the experiment, ready to ship the experiment, pre-launch check, launch readiness | +| `monitor` | `commands/monitor.md` | monitor, mid-flight, is it safe, should I peek, SRM mid-flight, sample pace, guardrail wobble, terminate early | | `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | -If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. +If a message could route to more than one, use the **phase-derived** rule based on experiment state: + +- `DRAFT`, configuration incomplete → `design`. +- `DRAFT`, configuration complete and user is ready to go → `launch`. +- `ACTIVE`, mid-flight (planned end not reached) → `monitor`. +- `ACTIVE`, reached planned end, or `CONCLUDED` → `interpret`. + +If the experiment state is unknown or doesn't disambiguate (e.g. `DRAFT` could be either design or launch), ask the user. ## Command menu @@ -50,15 +68,17 @@ Shown when no command was detected or inferred. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Manage Experiment — [Project Name] ([project_id]) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks - 2. Interpret — Read results, ship / iterate / kill / wait - 3. Exit + 1. Design — Hypothesis, metrics, sizing, statistical model, advanced features + 2. Launch — Pre-launch readiness check, then launch (irreversible) + 3. Monitor — Mid-flight safety: SRM, sample pace, guardrails, peeking discipline + 4. Interpret — Read results, ship / iterate / kill / wait + 5. Exit ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ``` ## Shared glossary -Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. +Terms all four commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, peeking trap, etc.) live in their command files. - **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. - **Primary / Guardrail / Secondary metric.** @@ -76,27 +96,67 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. -| File | Used by | Purpose | -| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | -| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | -| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | -| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | -| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | -| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | -| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | -| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | -| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | -| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | -| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | -| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | -| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | -| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | -| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | -| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + monitor + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design + launch | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | monitor + interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + +## Cross-command policies + +Rules that apply across more than one command. Defined here once so all commands reference the same threshold and don't drift. + +### Guardrail hard-gate (5% relative regression) + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. The threshold lives here, not in any one command: + +- `design` warns if no guardrails are configured (the gate has nothing to enforce). +- `launch` blocks-or-warns based on guardrail configuration in the readiness check. +- `monitor` uses the threshold to decide when a mid-flight guardrail regression justifies termination. +- `interpret` uses the threshold for the ITERATE vs SHIP decision. + +If a team agrees on a different threshold (3% for high-volume billing, 10% for early experiments), change it here and the commands inherit it. + +### Peek-safety table + +The **peeking trap**: stopping early on a favorable Frequentist peek inflates the false-positive rate because each look at the data is another chance to cross the significance threshold by chance. Sequential testing is built to make peeking safe (the stopping boundaries account for repeated looks); Frequentist testing is not. + +The table below is what's safe to look at mid-flight, and what isn't. Used by `monitor` directly; referenced from `design` (when picking sequential vs frequentist) and `interpret` (when deciding whether a mid-flight peek invalidates a verdict). + +| Signal | Safe to peek mid-flight? | Why | +| --------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------- | +| SRM verdict | Yes | Bucketing health is independent of effect size. Detecting SRM early lets you stop before more exposure data is wasted. | +| Sample pace | Yes | A pacing problem is operational, not statistical. Detecting it early gives time to remediate. | +| Guardrail polarity | Yes (with care) | A guardrail regression mid-flight is a real safety signal. Stopping for a guardrail regression is not p-hacking. | +| Primary metric lift (Sequential) | Yes | Sequential testing makes peeking part of the design. The platform's stopping boundaries account for it. | +| Primary metric lift (Frequentist) | **No** | Stopping early on a favorable Frequentist peek is the canonical peeking trap. The false-positive rate inflates fast. | + +The rule users get wrong most often: thinking they can "just check" the primary mid-flight on a Frequentist test "without acting on it." If the look influences any decision — even the decision to wait — it's a peek. + +### Output emoji conventions + +All four commands use the same visual vocabulary so multi-command sessions read consistently: + +- ✅ — pass / ok / nothing to flag +- ⚠️ — warning / attention needed (proceed if user accepts) +- 🛑 — blocker / fail / stop (do not proceed) +- ℹ️ — fyi / informational ## Behaviour rules -1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. 2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. 3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. 4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. @@ -128,10 +188,15 @@ If the user is starting a new experiment from scratch (no existing experiment to Apply these rules in order; the first match wins. -1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +3. **Phase-derived (only when an experiment was resolved in step 2):** + - `DRAFT` + configuration incomplete or user is iterating on it → `design`. + - `DRAFT` + configuration complete and user is ready → `launch`. + - `ACTIVE` and planned end not reached → `monitor`. + - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. + If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command @@ -140,4 +205,7 @@ If the command file is not already in context, read `commands/[command].md`. Fol ## 5. Complete -Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). +Print `✅ Done.` Return to step 3 if the user wants to chain another command. Typical chains: + +- Same session: `design` → `launch` (the user finalized the design and is ready to go live). +- Across sessions: `launch` → (wait 24h+) → `monitor` → (wait to planned end) → `interpret`. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index d295a61..adafd0f 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -1,6 +1,6 @@ # Command: design -Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. This command stops at `DRAFT` — the irreversible launch happens in the separate `launch` command. **Don't save the draft until the user explicitly confirms the configuration.** The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. @@ -33,9 +33,11 @@ MDE = 4σ / √n The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). -### The >5% guardrail hard-gate +### Guardrails (the hard-gate enforced downstream) -A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). +Guardrails are the trustworthiness backstop. Without them, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. The umbrella owns the regression threshold — see [Cross-command policies in SKILL.md](../SKILL.md#cross-command-policies). This command's job is making sure guardrails exist; the threshold is enforced by `launch`, `monitor`, and `interpret`. + +If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). ### Pre-launch pitfall catalogue @@ -83,7 +85,7 @@ Sample-size floor: keep per-variant target above the platform's reliability floo Four choices, each with a default that's right for most users: -- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **Testing model** — default Sequential (peek-safety table in the umbrella's Cross-command policies covers why); Frequentist only for small-lift hunts on well-sized tests. - **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. - **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. - **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. @@ -97,13 +99,15 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). -### 7. Run the pre-launch pitfall check +### 7. Sanity-check the design before saving + +Run the catalogue from [../references/pitfalls.md](../references/pitfalls.md) against the proposed configuration so the user catches design-time problems before they save a `DRAFT`. Surface only what fires; order blockers → warnings → fyi. -Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). +The full readiness check runs again in the `launch` command before the experiment goes live — this step in `design` is for catching issues now while the configuration is easy to change, not for gating draft creation. -### 8. Confirm with the user, then create +### 8. Confirm and save as DRAFT -Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: +Saving the design as a `DRAFT` is reversible (the user can keep iterating, or delete the draft). It is **not** the launch — the experiment doesn't go live until the `launch` command runs. Surface the configuration summary and **wait for explicit confirmation** before creating the draft: ``` *Experiment Setup Summary* @@ -120,15 +124,19 @@ Creating the experiment is the irreversible step. Present a compact summary and • *Expected duration on current traffic:* days • *Achievable MDE on current traffic:* % relative -*Pitfall check:* +*Design-time pitfall check:* ✅ Insufficient duration — adequate ✅ Cohort too small — adequate ⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship ``` -Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. + +After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. +### 9. Hand off to launch + +`design` stops at `DRAFT`. When the user is ready to go live, route to the `launch` command in this skill, which runs the final readiness check and performs the irreversible launch. If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. @@ -157,5 +165,3 @@ If the user hasn't named a specific feature or surface, ask before fetching base - When underpowered, say so plainly and list remediations in order of cost. - Don't moralise about peeking — switch them to sequential. - Guardrail regressions are hard gates, not "slight concerns." - -When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md new file mode 100644 index 0000000..4f8aecb --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md @@ -0,0 +1,126 @@ +# Command: launch + +Launch a designed Mixpanel experiment. This is the irreversible transition from `DRAFT` to `ACTIVE` — once exposures start, variants are locked, the statistical model is fixed, and mid-flight configuration changes invalidate the test. This command exists to give that transition a deliberate seam. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (launch-specific) + +- **Pre-launch pitfall check.** The deterministic configuration validation that runs against the designed experiment before launch. Categorized as blockers (stop launch), warnings (explain trade-off, proceed if user accepts), fyi. +- **Allocation lock.** The moment the variant percentages stop being editable. After launch, the only safe per-variant change is the post-conclude ship action. +- **Cohort lock.** The moment the targeting cohort stops being editable. Changing the cohort mid-flight changes _who_ is being measured, which silently invalidates the comparison. + +--- + +## Components (launch-specific) + +### The irreversibility rule + +A launched experiment cannot be "un-launched" without losing the exposure data accumulated to that point. The only operations the post-launch state supports are: + +- **Monitor** mid-flight (the `monitor` command in this skill). +- **Conclude** at the end (the `interpret` command's decide action). +- **Pause / resume** via the underlying feature flag (handled by the `manage-feature-flags` skill) — rarely the right move; usually masks a design problem. + +There is no "edit the variants" or "change the statistical model" operation post-launch that preserves the result's validity. Surface this constraint to the user before launching — if there's any ambiguity about whether the configuration is final, send them back to `design`. + +### Launch readiness checklist + +Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. + +| Severity | Check | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | + +The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is ready + +The umbrella resolves the experiment in its step 2. Verify it's in `DRAFT` state. If it's already `ACTIVE` or `CONCLUDED`, this command is the wrong one — route to `monitor` or `interpret`. + +### 2. Run the launch readiness checklist + +Apply the catalogue from Components against the current experiment configuration. Surface results in this order: + +``` +*Launch Readiness — [Experiment Name]* + +🛑 Blockers (must fix before launch) + • [blocker description] +⚠️ Warnings (recommend addressing) + • [warning description] +ℹ️ FYI + • [fyi description] +``` + +If any blockers fire, **stop**. Tell the user what to fix and route them back to `design` to update the configuration. Don't offer to launch past a blocker. + +If warnings fire, name each trade-off explicitly. Don't just list them — explain what risk the user is accepting by launching anyway. + +If only FYIs fire (or nothing fires), proceed to step 3. + +### 3. Present the launch confirmation + +Surface the launch summary and **wait for explicit confirmation** before invoking the launch action. The summary should match what the user saw at the end of `design`, with any post-design edits reflected: + +``` +*Launch Summary — [Experiment Name]* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction), … +• *Guardrails:* (direction), … +• *Variants:* control % / treatment % (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days + +*After launch:* + • Variants are locked. + • Statistical model is locked. + • Cohort targeting is locked. + • The only safe operations are monitor (mid-flight) and conclude (at the end). + +Reply CONFIRM to launch. Anything else cancels. +``` + +The literal `CONFIRM` requirement matches the irreversibility-confirmation discipline in `manage-lexicon` and `manage-feature-flags`. + +### 4. Launch + +On `CONFIRM`, invoke the launch action. If the launch fails, surface the platform's error verbatim — don't paraphrase. The user needs to know whether the failure is a transient platform issue (retry) or a configuration issue (back to `design`). + +### 5. Hand off to monitor + +After a successful launch, recommend the user check back in 24h via the `monitor` command. Surface two things they should set up as follow-ups (don't interrupt the launch flow to do them inline): + +1. **A tracking dashboard** for the primary and guardrail metrics — gives the user a single place to watch the experiment without re-opening the skill every time. Recommend running `create-dashboard` in a follow-up session. +2. **A calendar reminder** for the canary check (24h) and the mid-flight check (~halfway through the planned duration). Concrete phrasing the agent can use: _"Worth scheduling two check-ins: 24h from now for the canary, and at the midpoint of your N-day window for the mid-flight read. Both run through the `monitor` command."_ + +Print `✅ Launched.` and return control to the umbrella. + +--- + +## Output style + +- Lead with the readiness verdict — pass, warnings, or blockers — before showing the summary. +- For blockers, name the specific configuration field the user needs to change. +- For warnings, name the trade-off, not just the rule. +- Don't moralise about launching with warnings — surface them, get explicit acceptance, proceed. +- Don't launch without `CONFIRM`. Treat any other response (including "yes", "ok", "sure") as cancel-and-clarify. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/monitor.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/monitor.md new file mode 100644 index 0000000..f27fdd0 --- /dev/null +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/monitor.md @@ -0,0 +1,106 @@ +# Command: monitor + +Mid-flight safety checks on a running experiment. This command answers **"is it safe to keep this experiment running?"** — distinct from `interpret`, which answers **"did the experiment work?"** Monitor is for the middle of the experiment, before there's enough signal to interpret. Peek only at what's safe to peek at; surface anything that warrants pause or termination. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (monitor-specific) + +- **Sample pace.** The ratio of actual exposures accumulated to expected exposures at this point in the experiment's planned duration. A pace below 0.7 (≥30% slower than projected) suggests the experiment is underpowered relative to its design, or that something is wrong with exposure tracking. +- **Mid-flight SRM.** A Sample Ratio Mismatch detected during the experiment, before exposures are mature. Distinct from the SRM check at interpretation time — mid-flight SRM is a bucketing-bug early-warning, not a verdict on the result. + +The **peeking trap** and the **peek-safety table** (what's safe to look at mid-flight, what isn't) live in the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies) — this command applies them, doesn't re-derive them. + +--- + +## Components (monitor-specific) + +For the **peek-safety table** (what's safe to look at mid-flight, what isn't), see the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies). For the **guardrail hard-gate threshold**, same place. + +### Terminate-early decision rules + +Three situations that justify ending a running experiment before its planned end: + +1. **Trustworthiness failure.** SRM fails mid-flight, or a misconfiguration is discovered that invalidates the design. Terminate, fix, restart. The accumulated exposures are not salvageable. +2. **Guardrail regression beyond the hard-gate threshold** (defined in the umbrella). The guardrail regresses by more than the threshold, with a tight CI. Continuing exposes more users to a measurable harm. Terminate and route to `interpret` for the ship/iterate verdict. +3. **Sequential stopping boundary crossed (Sequential tests only).** The platform's sequential boundary fires. This is the by-design early stop — terminate and route to `interpret`. + +What does **not** justify early termination: + +- "It's been a week and the lift looks good" (peeking trap on Frequentist — see the umbrella's peek-safety table). +- "The team is tired of waiting" (sunk cost mid-flight is real but not a statistical reason). +- "Looks like it'll be inconclusive" (futility analysis exists but requires the right test design; don't improvise). +- A guardrail wobbling within its noise band. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is in scope for monitor + +The umbrella resolves the experiment in its step 2. Verify it's in `ACTIVE` state. If it's `DRAFT`, route to `design` or `launch`. If it's `CONCLUDED`, route to `interpret`. + +### 2. Read the safe signals + +Fetch the experiment with exposure data included. Report: + +- **Current state:** `ACTIVE` since [date], [N] days into a planned [N-or-date] window. +- **Sample pace:** actual vs expected exposures at this point. Flag if pace < 0.7. +- **Mid-flight SRM verdict** (from the platform). Flag if `FAIL`. +- **Guardrail summary** (polarity-corrected for each guardrail). Flag any with significant regressions. +- **Statistical model:** Sequential vs Frequentist. If Sequential, report whether the stopping boundary has been crossed. + +### 3. Apply the don't-peek rule for primaries + +If the experiment is **Frequentist** and the user asks about primary-metric results, push back politely: + +> "This experiment is configured as Frequentist, which means peeking at the primary mid-flight inflates the false-positive rate even if you're just curious. Mid-flight, the safe signals are SRM, sample pace, and guardrails — happy to walk through those. The primary verdict is meaningful once the planned [N-day | N-sample] target is reached." + +If the experiment is **Sequential**, peeking is by design. Report primary status and whether the sequential boundary fired. + +### 4. Decide: keep running, pause, or terminate + +Apply the **terminate-early decision rules** from Components. + +- **Trustworthiness failure or guardrail regression beyond tolerance** → recommend terminate, name the specific signal that fired, route to `interpret` for the formal ship/kill/iterate call. +- **Sequential stopping boundary crossed** → recommend terminate, route to `interpret`. +- **Pace problem (slow sample accrual)** → recommend extending the planned duration if possible, or accepting a coarser MDE. Don't terminate for pace alone; pace means the design is wrong, not the experiment. +- **Everything within tolerance** → recommend continue, give the user a checkback recommendation (next milestone: mid-flight, 80% sample, or planned end). + +For the SRM-failure remediation playbook, see [../references/health-check-interpretation.md](../references/health-check-interpretation.md). For the underpowered-experiment remediation playbook, see [../references/sizing.md](../references/sizing.md). + +### 5. Output + +Default to this shape: + +``` +*Mid-flight Status — [Experiment Name]* + +*Safe to keep running:* YES | NO (with reason) | YES, but with caveats + +*Signals:* + • SRM: ✅ PASS | 🛑 FAIL + • Sample pace: ✅ on track ([N]% of expected) | ⚠️ slow ([N]% of expected) + • Guardrails: ✅ flat | ⚠️ regressed [X]% + • Sequential boundary (if applicable): not crossed | CROSSED + +*Recommendation:* continue | terminate (reason) | extend duration + +*Next checkback:* [milestone — date or sample count] +``` + +If the user requested a primary-metric peek on a Frequentist test, lead with the don't-peek explanation before any other signals. + +--- + +## Output style + +- Lead with the verdict — safe to continue or not. +- Name the specific signal driving any "terminate" recommendation, never just "looks bad." +- Don't surface primary-metric lift on Frequentist tests, even if asked. Explain why, then offer the safe signals. +- Don't moralise about peeking — explain the math once, then route the user to safe signals. +- Treat "the team wants to ship now" as a separate conversation from "is the experiment ready to ship" — don't conflate. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index aa1a854..c977181 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -1,29 +1,38 @@ --- name: manage-experiment description: > - Coach the user through any phase of a Mixpanel experiment — design before - launch (hypothesis framing, metric selection, sizing, statistical model - choice, advanced statistical features like CUPED / Winsorization / - Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret - after launch (read results, decide ship / iterate / kill / wait, interpret - health checks like SRM and Retro A/A, break results down by segment, use - session replays to explain a result). Use when the user mentions + Coach the user through any phase of a Mixpanel experiment — design (hypothesis + framing, metric selection, sizing, statistical model choice, advanced features + like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final + pre-launch readiness check and the irreversible launch action), monitor + (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- + Frequentist rule, terminate-early decisions), and interpret after the + experiment is mature (read results, decide ship / iterate / kill / wait, + interpret health checks like SRM and Retro A/A, break results down by segment, + use session replays to explain a result). Use when the user mentions experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or - any phrasing like "set up an experiment", "design an A/B test", "how did - experiment X do", "should we ship", "why isn't this significant yet", - "should this be sequential or fixed-horizon", "what's my MDE", "is this - experiment configured correctly", "audit my experiment". Do NOT use for - plain feature-flag rollouts with no measurement criterion — that belongs - to the `manage-feature-flags` skill. + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any + phrasing like "set up an experiment", "design an A/B test", "launch this + experiment", "is it safe to keep this running", "is my experiment SRM-ing", + "should I peek", "how did experiment X do", "should we ship", "why isn't this + significant yet", "should this be sequential or fixed-horizon", "what's my + MDE", "is this experiment configured correctly", "audit my experiment". Do + NOT use for plain feature-flag rollouts with no measurement criterion — that + belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- # Manage Experiment -This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. -The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). +The four commands map cleanly to experiment states: + +- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. +- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. +- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. + +The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). --- @@ -37,10 +46,19 @@ Each command lives in its own file under `commands/` and is loaded on demand. Ma | Command | File | Match if message contains any of | | ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, hypothesis, MDE, sizing, sequential vs frequentist, CUPED, Winsorization | +| `launch` | `commands/launch.md` | launch, go live, start the experiment, ready to ship the experiment, pre-launch check, launch readiness | +| `monitor` | `commands/monitor.md` | monitor, mid-flight, is it safe, should I peek, SRM mid-flight, sample pace, guardrail wobble, terminate early | | `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | -If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. +If a message could route to more than one, use the **phase-derived** rule based on experiment state: + +- `DRAFT`, configuration incomplete → `design`. +- `DRAFT`, configuration complete and user is ready to go → `launch`. +- `ACTIVE`, mid-flight (planned end not reached) → `monitor`. +- `ACTIVE`, reached planned end, or `CONCLUDED` → `interpret`. + +If the experiment state is unknown or doesn't disambiguate (e.g. `DRAFT` could be either design or launch), ask the user. ## Command menu @@ -50,15 +68,17 @@ Shown when no command was detected or inferred. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Manage Experiment — [Project Name] ([project_id]) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks - 2. Interpret — Read results, ship / iterate / kill / wait - 3. Exit + 1. Design — Hypothesis, metrics, sizing, statistical model, advanced features + 2. Launch — Pre-launch readiness check, then launch (irreversible) + 3. Monitor — Mid-flight safety: SRM, sample pace, guardrails, peeking discipline + 4. Interpret — Read results, ship / iterate / kill / wait + 5. Exit ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ``` ## Shared glossary -Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. +Terms all four commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, peeking trap, etc.) live in their command files. - **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. - **Primary / Guardrail / Secondary metric.** @@ -76,27 +96,67 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. -| File | Used by | Purpose | -| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | -| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | -| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | -| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | -| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | -| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | -| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | -| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | -| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | -| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | -| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | -| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | -| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | -| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | -| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | -| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + monitor + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design + launch | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | monitor + interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + +## Cross-command policies + +Rules that apply across more than one command. Defined here once so all commands reference the same threshold and don't drift. + +### Guardrail hard-gate (5% relative regression) + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. The threshold lives here, not in any one command: + +- `design` warns if no guardrails are configured (the gate has nothing to enforce). +- `launch` blocks-or-warns based on guardrail configuration in the readiness check. +- `monitor` uses the threshold to decide when a mid-flight guardrail regression justifies termination. +- `interpret` uses the threshold for the ITERATE vs SHIP decision. + +If a team agrees on a different threshold (3% for high-volume billing, 10% for early experiments), change it here and the commands inherit it. + +### Peek-safety table + +The **peeking trap**: stopping early on a favorable Frequentist peek inflates the false-positive rate because each look at the data is another chance to cross the significance threshold by chance. Sequential testing is built to make peeking safe (the stopping boundaries account for repeated looks); Frequentist testing is not. + +The table below is what's safe to look at mid-flight, and what isn't. Used by `monitor` directly; referenced from `design` (when picking sequential vs frequentist) and `interpret` (when deciding whether a mid-flight peek invalidates a verdict). + +| Signal | Safe to peek mid-flight? | Why | +| --------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------- | +| SRM verdict | Yes | Bucketing health is independent of effect size. Detecting SRM early lets you stop before more exposure data is wasted. | +| Sample pace | Yes | A pacing problem is operational, not statistical. Detecting it early gives time to remediate. | +| Guardrail polarity | Yes (with care) | A guardrail regression mid-flight is a real safety signal. Stopping for a guardrail regression is not p-hacking. | +| Primary metric lift (Sequential) | Yes | Sequential testing makes peeking part of the design. The platform's stopping boundaries account for it. | +| Primary metric lift (Frequentist) | **No** | Stopping early on a favorable Frequentist peek is the canonical peeking trap. The false-positive rate inflates fast. | + +The rule users get wrong most often: thinking they can "just check" the primary mid-flight on a Frequentist test "without acting on it." If the look influences any decision — even the decision to wait — it's a peek. + +### Output emoji conventions + +All four commands use the same visual vocabulary so multi-command sessions read consistently: + +- ✅ — pass / ok / nothing to flag +- ⚠️ — warning / attention needed (proceed if user accepts) +- 🛑 — blocker / fail / stop (do not proceed) +- ℹ️ — fyi / informational ## Behaviour rules -1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. 2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. 3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. 4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. @@ -128,10 +188,15 @@ If the user is starting a new experiment from scratch (no existing experiment to Apply these rules in order; the first match wins. -1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +3. **Phase-derived (only when an experiment was resolved in step 2):** + - `DRAFT` + configuration incomplete or user is iterating on it → `design`. + - `DRAFT` + configuration complete and user is ready → `launch`. + - `ACTIVE` and planned end not reached → `monitor`. + - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. + If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command @@ -140,4 +205,7 @@ If the command file is not already in context, read `commands/[command].md`. Fol ## 5. Complete -Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). +Print `✅ Done.` Return to step 3 if the user wants to chain another command. Typical chains: + +- Same session: `design` → `launch` (the user finalized the design and is ready to go live). +- Across sessions: `launch` → (wait 24h+) → `monitor` → (wait to planned end) → `interpret`. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index d295a61..adafd0f 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -1,6 +1,6 @@ # Command: design -Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. This command stops at `DRAFT` — the irreversible launch happens in the separate `launch` command. **Don't save the draft until the user explicitly confirms the configuration.** The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. @@ -33,9 +33,11 @@ MDE = 4σ / √n The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). -### The >5% guardrail hard-gate +### Guardrails (the hard-gate enforced downstream) -A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). +Guardrails are the trustworthiness backstop. Without them, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. The umbrella owns the regression threshold — see [Cross-command policies in SKILL.md](../SKILL.md#cross-command-policies). This command's job is making sure guardrails exist; the threshold is enforced by `launch`, `monitor`, and `interpret`. + +If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). ### Pre-launch pitfall catalogue @@ -83,7 +85,7 @@ Sample-size floor: keep per-variant target above the platform's reliability floo Four choices, each with a default that's right for most users: -- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **Testing model** — default Sequential (peek-safety table in the umbrella's Cross-command policies covers why); Frequentist only for small-lift hunts on well-sized tests. - **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. - **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. - **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. @@ -97,13 +99,15 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). -### 7. Run the pre-launch pitfall check +### 7. Sanity-check the design before saving + +Run the catalogue from [../references/pitfalls.md](../references/pitfalls.md) against the proposed configuration so the user catches design-time problems before they save a `DRAFT`. Surface only what fires; order blockers → warnings → fyi. -Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). +The full readiness check runs again in the `launch` command before the experiment goes live — this step in `design` is for catching issues now while the configuration is easy to change, not for gating draft creation. -### 8. Confirm with the user, then create +### 8. Confirm and save as DRAFT -Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: +Saving the design as a `DRAFT` is reversible (the user can keep iterating, or delete the draft). It is **not** the launch — the experiment doesn't go live until the `launch` command runs. Surface the configuration summary and **wait for explicit confirmation** before creating the draft: ``` *Experiment Setup Summary* @@ -120,15 +124,19 @@ Creating the experiment is the irreversible step. Present a compact summary and • *Expected duration on current traffic:* days • *Achievable MDE on current traffic:* % relative -*Pitfall check:* +*Design-time pitfall check:* ✅ Insufficient duration — adequate ✅ Cohort too small — adequate ⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship ``` -Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. + +After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. +### 9. Hand off to launch + +`design` stops at `DRAFT`. When the user is ready to go live, route to the `launch` command in this skill, which runs the final readiness check and performs the irreversible launch. If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. @@ -157,5 +165,3 @@ If the user hasn't named a specific feature or surface, ask before fetching base - When underpowered, say so plainly and list remediations in order of cost. - Don't moralise about peeking — switch them to sequential. - Guardrail regressions are hard gates, not "slight concerns." - -When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md new file mode 100644 index 0000000..4f8aecb --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md @@ -0,0 +1,126 @@ +# Command: launch + +Launch a designed Mixpanel experiment. This is the irreversible transition from `DRAFT` to `ACTIVE` — once exposures start, variants are locked, the statistical model is fixed, and mid-flight configuration changes invalidate the test. This command exists to give that transition a deliberate seam. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (launch-specific) + +- **Pre-launch pitfall check.** The deterministic configuration validation that runs against the designed experiment before launch. Categorized as blockers (stop launch), warnings (explain trade-off, proceed if user accepts), fyi. +- **Allocation lock.** The moment the variant percentages stop being editable. After launch, the only safe per-variant change is the post-conclude ship action. +- **Cohort lock.** The moment the targeting cohort stops being editable. Changing the cohort mid-flight changes _who_ is being measured, which silently invalidates the comparison. + +--- + +## Components (launch-specific) + +### The irreversibility rule + +A launched experiment cannot be "un-launched" without losing the exposure data accumulated to that point. The only operations the post-launch state supports are: + +- **Monitor** mid-flight (the `monitor` command in this skill). +- **Conclude** at the end (the `interpret` command's decide action). +- **Pause / resume** via the underlying feature flag (handled by the `manage-feature-flags` skill) — rarely the right move; usually masks a design problem. + +There is no "edit the variants" or "change the statistical model" operation post-launch that preserves the result's validity. Surface this constraint to the user before launching — if there's any ambiguity about whether the configuration is final, send them back to `design`. + +### Launch readiness checklist + +Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. + +| Severity | Check | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | + +The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is ready + +The umbrella resolves the experiment in its step 2. Verify it's in `DRAFT` state. If it's already `ACTIVE` or `CONCLUDED`, this command is the wrong one — route to `monitor` or `interpret`. + +### 2. Run the launch readiness checklist + +Apply the catalogue from Components against the current experiment configuration. Surface results in this order: + +``` +*Launch Readiness — [Experiment Name]* + +🛑 Blockers (must fix before launch) + • [blocker description] +⚠️ Warnings (recommend addressing) + • [warning description] +ℹ️ FYI + • [fyi description] +``` + +If any blockers fire, **stop**. Tell the user what to fix and route them back to `design` to update the configuration. Don't offer to launch past a blocker. + +If warnings fire, name each trade-off explicitly. Don't just list them — explain what risk the user is accepting by launching anyway. + +If only FYIs fire (or nothing fires), proceed to step 3. + +### 3. Present the launch confirmation + +Surface the launch summary and **wait for explicit confirmation** before invoking the launch action. The summary should match what the user saw at the end of `design`, with any post-design edits reflected: + +``` +*Launch Summary — [Experiment Name]* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction), … +• *Guardrails:* (direction), … +• *Variants:* control % / treatment % (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days + +*After launch:* + • Variants are locked. + • Statistical model is locked. + • Cohort targeting is locked. + • The only safe operations are monitor (mid-flight) and conclude (at the end). + +Reply CONFIRM to launch. Anything else cancels. +``` + +The literal `CONFIRM` requirement matches the irreversibility-confirmation discipline in `manage-lexicon` and `manage-feature-flags`. + +### 4. Launch + +On `CONFIRM`, invoke the launch action. If the launch fails, surface the platform's error verbatim — don't paraphrase. The user needs to know whether the failure is a transient platform issue (retry) or a configuration issue (back to `design`). + +### 5. Hand off to monitor + +After a successful launch, recommend the user check back in 24h via the `monitor` command. Surface two things they should set up as follow-ups (don't interrupt the launch flow to do them inline): + +1. **A tracking dashboard** for the primary and guardrail metrics — gives the user a single place to watch the experiment without re-opening the skill every time. Recommend running `create-dashboard` in a follow-up session. +2. **A calendar reminder** for the canary check (24h) and the mid-flight check (~halfway through the planned duration). Concrete phrasing the agent can use: _"Worth scheduling two check-ins: 24h from now for the canary, and at the midpoint of your N-day window for the mid-flight read. Both run through the `monitor` command."_ + +Print `✅ Launched.` and return control to the umbrella. + +--- + +## Output style + +- Lead with the readiness verdict — pass, warnings, or blockers — before showing the summary. +- For blockers, name the specific configuration field the user needs to change. +- For warnings, name the trade-off, not just the rule. +- Don't moralise about launching with warnings — surface them, get explicit acceptance, proceed. +- Don't launch without `CONFIRM`. Treat any other response (including "yes", "ok", "sure") as cancel-and-clarify. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/monitor.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/monitor.md new file mode 100644 index 0000000..f27fdd0 --- /dev/null +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/monitor.md @@ -0,0 +1,106 @@ +# Command: monitor + +Mid-flight safety checks on a running experiment. This command answers **"is it safe to keep this experiment running?"** — distinct from `interpret`, which answers **"did the experiment work?"** Monitor is for the middle of the experiment, before there's enough signal to interpret. Peek only at what's safe to peek at; surface anything that warrants pause or termination. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (monitor-specific) + +- **Sample pace.** The ratio of actual exposures accumulated to expected exposures at this point in the experiment's planned duration. A pace below 0.7 (≥30% slower than projected) suggests the experiment is underpowered relative to its design, or that something is wrong with exposure tracking. +- **Mid-flight SRM.** A Sample Ratio Mismatch detected during the experiment, before exposures are mature. Distinct from the SRM check at interpretation time — mid-flight SRM is a bucketing-bug early-warning, not a verdict on the result. + +The **peeking trap** and the **peek-safety table** (what's safe to look at mid-flight, what isn't) live in the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies) — this command applies them, doesn't re-derive them. + +--- + +## Components (monitor-specific) + +For the **peek-safety table** (what's safe to look at mid-flight, what isn't), see the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies). For the **guardrail hard-gate threshold**, same place. + +### Terminate-early decision rules + +Three situations that justify ending a running experiment before its planned end: + +1. **Trustworthiness failure.** SRM fails mid-flight, or a misconfiguration is discovered that invalidates the design. Terminate, fix, restart. The accumulated exposures are not salvageable. +2. **Guardrail regression beyond the hard-gate threshold** (defined in the umbrella). The guardrail regresses by more than the threshold, with a tight CI. Continuing exposes more users to a measurable harm. Terminate and route to `interpret` for the ship/iterate verdict. +3. **Sequential stopping boundary crossed (Sequential tests only).** The platform's sequential boundary fires. This is the by-design early stop — terminate and route to `interpret`. + +What does **not** justify early termination: + +- "It's been a week and the lift looks good" (peeking trap on Frequentist — see the umbrella's peek-safety table). +- "The team is tired of waiting" (sunk cost mid-flight is real but not a statistical reason). +- "Looks like it'll be inconclusive" (futility analysis exists but requires the right test design; don't improvise). +- A guardrail wobbling within its noise band. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is in scope for monitor + +The umbrella resolves the experiment in its step 2. Verify it's in `ACTIVE` state. If it's `DRAFT`, route to `design` or `launch`. If it's `CONCLUDED`, route to `interpret`. + +### 2. Read the safe signals + +Fetch the experiment with exposure data included. Report: + +- **Current state:** `ACTIVE` since [date], [N] days into a planned [N-or-date] window. +- **Sample pace:** actual vs expected exposures at this point. Flag if pace < 0.7. +- **Mid-flight SRM verdict** (from the platform). Flag if `FAIL`. +- **Guardrail summary** (polarity-corrected for each guardrail). Flag any with significant regressions. +- **Statistical model:** Sequential vs Frequentist. If Sequential, report whether the stopping boundary has been crossed. + +### 3. Apply the don't-peek rule for primaries + +If the experiment is **Frequentist** and the user asks about primary-metric results, push back politely: + +> "This experiment is configured as Frequentist, which means peeking at the primary mid-flight inflates the false-positive rate even if you're just curious. Mid-flight, the safe signals are SRM, sample pace, and guardrails — happy to walk through those. The primary verdict is meaningful once the planned [N-day | N-sample] target is reached." + +If the experiment is **Sequential**, peeking is by design. Report primary status and whether the sequential boundary fired. + +### 4. Decide: keep running, pause, or terminate + +Apply the **terminate-early decision rules** from Components. + +- **Trustworthiness failure or guardrail regression beyond tolerance** → recommend terminate, name the specific signal that fired, route to `interpret` for the formal ship/kill/iterate call. +- **Sequential stopping boundary crossed** → recommend terminate, route to `interpret`. +- **Pace problem (slow sample accrual)** → recommend extending the planned duration if possible, or accepting a coarser MDE. Don't terminate for pace alone; pace means the design is wrong, not the experiment. +- **Everything within tolerance** → recommend continue, give the user a checkback recommendation (next milestone: mid-flight, 80% sample, or planned end). + +For the SRM-failure remediation playbook, see [../references/health-check-interpretation.md](../references/health-check-interpretation.md). For the underpowered-experiment remediation playbook, see [../references/sizing.md](../references/sizing.md). + +### 5. Output + +Default to this shape: + +``` +*Mid-flight Status — [Experiment Name]* + +*Safe to keep running:* YES | NO (with reason) | YES, but with caveats + +*Signals:* + • SRM: ✅ PASS | 🛑 FAIL + • Sample pace: ✅ on track ([N]% of expected) | ⚠️ slow ([N]% of expected) + • Guardrails: ✅ flat | ⚠️ regressed [X]% + • Sequential boundary (if applicable): not crossed | CROSSED + +*Recommendation:* continue | terminate (reason) | extend duration + +*Next checkback:* [milestone — date or sample count] +``` + +If the user requested a primary-metric peek on a Frequentist test, lead with the don't-peek explanation before any other signals. + +--- + +## Output style + +- Lead with the verdict — safe to continue or not. +- Name the specific signal driving any "terminate" recommendation, never just "looks bad." +- Don't surface primary-metric lift on Frequentist tests, even if asked. Explain why, then offer the safe signals. +- Don't moralise about peeking — explain the math once, then route the user to safe signals. +- Treat "the team wants to ship now" as a separate conversation from "is the experiment ready to ship" — don't conflate. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index aa1a854..c977181 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -1,29 +1,38 @@ --- name: manage-experiment description: > - Coach the user through any phase of a Mixpanel experiment — design before - launch (hypothesis framing, metric selection, sizing, statistical model - choice, advanced statistical features like CUPED / Winsorization / - Bonferroni / Benjamini-Hochberg, pre-launch pitfall checks) and interpret - after launch (read results, decide ship / iterate / kill / wait, interpret - health checks like SRM and Retro A/A, break results down by segment, use - session replays to explain a result). Use when the user mentions + Coach the user through any phase of a Mixpanel experiment — design (hypothesis + framing, metric selection, sizing, statistical model choice, advanced features + like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final + pre-launch readiness check and the irreversible launch action), monitor + (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- + Frequentist rule, terminate-early decisions), and interpret after the + experiment is mature (read results, decide ship / iterate / kill / wait, + interpret health checks like SRM and Retro A/A, break results down by segment, + use session replays to explain a result). Use when the user mentions experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or - any phrasing like "set up an experiment", "design an A/B test", "how did - experiment X do", "should we ship", "why isn't this significant yet", - "should this be sequential or fixed-horizon", "what's my MDE", "is this - experiment configured correctly", "audit my experiment". Do NOT use for - plain feature-flag rollouts with no measurement criterion — that belongs - to the `manage-feature-flags` skill. + sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any + phrasing like "set up an experiment", "design an A/B test", "launch this + experiment", "is it safe to keep this running", "is my experiment SRM-ing", + "should I peek", "how did experiment X do", "should we ship", "why isn't this + significant yet", "should this be sequential or fixed-horizon", "what's my + MDE", "is this experiment configured correctly", "audit my experiment". Do + NOT use for plain feature-flag rollouts with no measurement criterion — that + belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- # Manage Experiment -This skill manages a Mixpanel experiment across its lifecycle — **designing** before launch and **interpreting** after launch. Two commands sit under the umbrella; pick by experiment phase: design when the experiment doesn't exist yet, interpret once exposures are flowing. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. -The skill runs as a single interactive session per experiment. The two commands compose naturally — designing produces a configuration that interpreting later consumes — but they're rarely invoked in the same session (the gap is days to weeks). +The four commands map cleanly to experiment states: + +- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. +- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. +- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. + +The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). --- @@ -37,10 +46,19 @@ Each command lives in its own file under `commands/` and is loaded on demand. Ma | Command | File | Match if message contains any of | | ----------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, pre-launch, MDE, sizing, hypothesis, sequential vs frequentist, CUPED, Winsorization | +| `design` | `commands/design.md` | design, set up, configure, plan, sanity-check, hypothesis, MDE, sizing, sequential vs frequentist, CUPED, Winsorization | +| `launch` | `commands/launch.md` | launch, go live, start the experiment, ready to ship the experiment, pre-launch check, launch readiness | +| `monitor` | `commands/monitor.md` | monitor, mid-flight, is it safe, should I peek, SRM mid-flight, sample pace, guardrail wobble, terminate early | | `interpret` | `commands/interpret.md` | read results, ship, iterate, kill, wait, statsig, SRM, sample ratio mismatch, retro A/A, lift, polarity, segment breakdown, session replays | -If a message could route to either (e.g. "audit my experiment", "check on experiment X"), use the **phase-derived** rule: experiment in `DRAFT` → `design`; experiment in `ACTIVE` or `CONCLUDED` → `interpret`. If the experiment state is unknown, ask the user. +If a message could route to more than one, use the **phase-derived** rule based on experiment state: + +- `DRAFT`, configuration incomplete → `design`. +- `DRAFT`, configuration complete and user is ready to go → `launch`. +- `ACTIVE`, mid-flight (planned end not reached) → `monitor`. +- `ACTIVE`, reached planned end, or `CONCLUDED` → `interpret`. + +If the experiment state is unknown or doesn't disambiguate (e.g. `DRAFT` could be either design or launch), ask the user. ## Command menu @@ -50,15 +68,17 @@ Shown when no command was detected or inferred. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Manage Experiment — [Project Name] ([project_id]) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - 1. Design — Hypothesis, metrics, sizing, model, pre-launch checks - 2. Interpret — Read results, ship / iterate / kill / wait - 3. Exit + 1. Design — Hypothesis, metrics, sizing, statistical model, advanced features + 2. Launch — Pre-launch readiness check, then launch (irreversible) + 3. Monitor — Mid-flight safety: SRM, sample pace, guardrails, peeking discipline + 4. Interpret — Read results, ship / iterate / kill / wait + 5. Exit ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ``` ## Shared glossary -Terms both commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, etc.) live in their command files. +Terms all four commands use without redefining. Phase-specific terms (hypothesis, polarity, SRM, peeking trap, etc.) live in their command files. - **Variant.** One arm of the experiment. The variant treated as the baseline is the **control**; the others are **treatments**. The platform marks which key is the control. - **Primary / Guardrail / Secondary metric.** @@ -76,27 +96,67 @@ Terms both commands use without redefining. Phase-specific terms (hypothesis, po Each command file links into these on demand. The map is here so the skill has a single index of what `references/` contains. -| File | Used by | Purpose | -| ------------------------------------------------------------------------------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | -| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | -| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | -| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | -| [references/sizing.md](references/sizing.md) | design + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | -| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | -| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | -| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | -| [references/pitfalls.md](references/pitfalls.md) | design | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | -| [references/health-check-interpretation.md](references/health-check-interpretation.md) | interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | -| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | -| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | -| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | -| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | -| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | -| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | +| File | Used by | Purpose | +| ------------------------------------------------------------------------------------------------ | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| [references/routing-xp-vs-ff.md](references/routing-xp-vs-ff.md) | design | Experiment vs Feature Flag disambiguation — when each is the right tool and the hand-off rules. | +| [references/hypothesis-framing.md](references/hypothesis-framing.md) | design | The four properties of a good hypothesis, rubric, common misalignment patterns, worked good/bad examples. | +| [references/metric-selection.md](references/metric-selection.md) | design | Picking primaries, guardrails, and secondaries. Guardrails-by-domain table. Lagging-indicator and changed-denominator traps. | +| [references/sizing.md](references/sizing.md) | design + monitor + interpret | Sample-size and MDE formulas, Kohavi's inversion, baseline-by-rate lookup, the five remediations for underpowered tests. | +| [references/statistical-model.md](references/statistical-model.md) | design | Sequential vs frequentist, end-condition choice, confidence level, multiple-testing correction. Peeking-trap math. | +| [references/advanced-features.md](references/advanced-features.md) | design | When CUPED and Winsorization help, when each is wrong, and the common misconfigurations. | +| [references/prior-experiments.md](references/prior-experiments.md) | design | How to look up and fold-in prior experiments on the same feature. | +| [references/pitfalls.md](references/pitfalls.md) | design + launch | The pre-launch pitfall catalogue: blockers (stop launch), warnings (explain trade-off), fyi. | +| [references/health-check-interpretation.md](references/health-check-interpretation.md) | monitor + interpret | Reading SRM, Retro A/A, exposures-sufficient, and misconfiguration verdicts. The trustworthiness gate's remediation playbook. | +| [references/per-metric-interpretation.md](references/per-metric-interpretation.md) | interpret | Translating a single metric's lift / CI / p-value into a plain-language verdict, with the Twyman's Law guard. | +| [references/why-no-statsig.md](references/why-no-statsig.md) | interpret | Wait / extend / boost power / narrow / accept-null decision tree when nothing's significant. | +| [references/segment-of-interest-selection.md](references/segment-of-interest-selection.md) | interpret | How to pick the 3–5 segments worth breaking results down on, before slicing every dimension. | +| [references/segment-breakdown-interpretation.md](references/segment-breakdown-interpretation.md) | interpret | Reading per-segment results: heterogeneity vs Simpson's paradox vs noise; the "ship to segment X" requirements. | +| [references/session-replay-analysis.md](references/session-replay-analysis.md) | interpret | Turning a quantitative experiment result into a behavior story using session replays. | +| [references/lifecycle-handoff.md](references/lifecycle-handoff.md) | interpret | The decide-action call shape, multi-variant ship semantics, special variant constants. | + +## Cross-command policies + +Rules that apply across more than one command. Defined here once so all commands reference the same threshold and don't drift. + +### Guardrail hard-gate (5% relative regression) + +A **5% relative regression on any guardrail blocks ship**, even when the primary wins. The threshold lives here, not in any one command: + +- `design` warns if no guardrails are configured (the gate has nothing to enforce). +- `launch` blocks-or-warns based on guardrail configuration in the readiness check. +- `monitor` uses the threshold to decide when a mid-flight guardrail regression justifies termination. +- `interpret` uses the threshold for the ITERATE vs SHIP decision. + +If a team agrees on a different threshold (3% for high-volume billing, 10% for early experiments), change it here and the commands inherit it. + +### Peek-safety table + +The **peeking trap**: stopping early on a favorable Frequentist peek inflates the false-positive rate because each look at the data is another chance to cross the significance threshold by chance. Sequential testing is built to make peeking safe (the stopping boundaries account for repeated looks); Frequentist testing is not. + +The table below is what's safe to look at mid-flight, and what isn't. Used by `monitor` directly; referenced from `design` (when picking sequential vs frequentist) and `interpret` (when deciding whether a mid-flight peek invalidates a verdict). + +| Signal | Safe to peek mid-flight? | Why | +| --------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------- | +| SRM verdict | Yes | Bucketing health is independent of effect size. Detecting SRM early lets you stop before more exposure data is wasted. | +| Sample pace | Yes | A pacing problem is operational, not statistical. Detecting it early gives time to remediate. | +| Guardrail polarity | Yes (with care) | A guardrail regression mid-flight is a real safety signal. Stopping for a guardrail regression is not p-hacking. | +| Primary metric lift (Sequential) | Yes | Sequential testing makes peeking part of the design. The platform's stopping boundaries account for it. | +| Primary metric lift (Frequentist) | **No** | Stopping early on a favorable Frequentist peek is the canonical peeking trap. The false-positive rate inflates fast. | + +The rule users get wrong most often: thinking they can "just check" the primary mid-flight on a Frequentist test "without acting on it." If the look influences any decision — even the decision to wait — it's a peek. + +### Output emoji conventions + +All four commands use the same visual vocabulary so multi-command sessions read consistently: + +- ✅ — pass / ok / nothing to flag +- ⚠️ — warning / attention needed (proceed if user accepts) +- 🛑 — blocker / fail / stop (do not proceed) +- ℹ️ — fyi / informational ## Behaviour rules -1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`) and concluding one (in `interpret`) are both irreversible. Show the proposed action, wait for the user to confirm. +1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. 2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. 3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. 4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. @@ -128,10 +188,15 @@ If the user is starting a new experiment from scratch (no existing experiment to Apply these rules in order; the first match wins. -1. **Explicit:** user names a phase (`/design`, "set up a new experiment", "interpret experiment X") → use that command. +1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** `DRAFT` → `design`; `ACTIVE` or `CONCLUDED` → `interpret`. -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise treat as ambiguous and fall through to the menu. +3. **Phase-derived (only when an experiment was resolved in step 2):** + - `DRAFT` + configuration incomplete or user is iterating on it → `design`. + - `DRAFT` + configuration complete and user is ready → `launch`. + - `ACTIVE` and planned end not reached → `monitor`. + - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. + If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. ## 4. Load and execute the command @@ -140,4 +205,7 @@ If the command file is not already in context, read `commands/[command].md`. Fol ## 5. Complete -Print `✅ Done.` Return to step 3 if the user wants to chain another command (e.g. design → launch externally → interpret in a later session). +Print `✅ Done.` Return to step 3 if the user wants to chain another command. Typical chains: + +- Same session: `design` → `launch` (the user finalized the design and is ready to go live). +- Across sessions: `launch` → (wait 24h+) → `monitor` → (wait to planned end) → `interpret`. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index d295a61..adafd0f 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -1,6 +1,6 @@ # Command: design -Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. **Don't create the experiment until the user explicitly confirms the configuration** — once it's live, mid-flight config changes invalidate the test. +Design a Mixpanel experiment before launch. A well-designed experiment starts from the hypothesis and works backward: the hypothesis dictates the metrics that test it, the metrics dictate the sample size, the sample size + traffic dictate duration and testing model. This command stops at `DRAFT` — the irreversible launch happens in the separate `launch` command. **Don't save the draft until the user explicitly confirms the configuration.** The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/Secondary metric, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction). Phase-specific terms below. @@ -33,9 +33,11 @@ MDE = 4σ / √n The `16` is `(z_{α/2} + z_β)² × 2` rounded. Variance `σ²` depends on metric type: Bernoulli `p(1−p)`; Poisson `≈ mean`; Gaussian computed from data. The full derivation, worked examples, lookup table, and the five remediations for underpowered experiments live in [../references/sizing.md](../references/sizing.md). -### The >5% guardrail hard-gate +### Guardrails (the hard-gate enforced downstream) -A **5% relative regression on any guardrail blocks ship**, even when the primary wins. Guardrails are the trustworthiness backstop; without this rule, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. Below 5% lives in the noise band of most guardrails; above 5% means the team has traded measurable damage for headline lift. If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). +Guardrails are the trustworthiness backstop. Without them, a winning primary with a quietly regressing guardrail ships and rolls back two weeks later. The umbrella owns the regression threshold — see [Cross-command policies in SKILL.md](../SKILL.md#cross-command-policies). This command's job is making sure guardrails exist; the threshold is enforced by `launch`, `monitor`, and `interpret`. + +If the user wants to ship past a regressing guardrail, force the conversation — disable the guardrail explicitly and document why. Don't let them silently override. Full rationale in [../references/pitfalls.md](../references/pitfalls.md). ### Pre-launch pitfall catalogue @@ -83,7 +85,7 @@ Sample-size floor: keep per-variant target above the platform's reliability floo Four choices, each with a default that's right for most users: -- **Testing model** — default Sequential (peeking is safe by design); Frequentist only for small-lift hunts on well-sized tests. +- **Testing model** — default Sequential (peek-safety table in the umbrella's Cross-command policies covers why); Frequentist only for small-lift hunts on well-sized tests. - **End condition** — sample-based for variable traffic; date-based for strong weekly seasonality. - **Confidence level** — default 0.95 (verify in product); 0.99 for irreversible high-stakes ships; 0.90 only when speed beats rigour. - **Multiple-testing correction** — enable when there are ≥2 primaries OR ≥2 non-control variants; default Benjamini-Hochberg, Bonferroni for strict family-wise control. @@ -97,13 +99,15 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). -### 7. Run the pre-launch pitfall check +### 7. Sanity-check the design before saving + +Run the catalogue from [../references/pitfalls.md](../references/pitfalls.md) against the proposed configuration so the user catches design-time problems before they save a `DRAFT`. Surface only what fires; order blockers → warnings → fyi. -Run the catalogue from **Components** against the proposed configuration. Surface only what fires; order blockers → warnings → fyi. Blockers should stop launch (the experiment cannot reach statistical power as configured). Warnings should be explained — name the trade-off, don't just nag. Full catalogue in [../references/pitfalls.md](../references/pitfalls.md). +The full readiness check runs again in the `launch` command before the experiment goes live — this step in `design` is for catching issues now while the configuration is easy to change, not for gating draft creation. -### 8. Confirm with the user, then create +### 8. Confirm and save as DRAFT -Creating the experiment is the irreversible step. Present a compact summary and **wait for explicit confirmation** before invoking the creation action: +Saving the design as a `DRAFT` is reversible (the user can keep iterating, or delete the draft). It is **not** the launch — the experiment doesn't go live until the `launch` command runs. Surface the configuration summary and **wait for explicit confirmation** before creating the draft: ``` *Experiment Setup Summary* @@ -120,15 +124,19 @@ Creating the experiment is the irreversible step. Present a compact summary and • *Expected duration on current traffic:* days • *Achievable MDE on current traffic:* % relative -*Pitfall check:* +*Design-time pitfall check:* ✅ Insufficient duration — adequate ✅ Cohort too small — adequate ⚠️ Missing guardrails — no guardrail metrics configured; >5% hard-gate cannot protect this ship ``` -Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across sessions. +Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. + +After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. -After creating, link the new experiment back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. +### 9. Hand off to launch + +`design` stops at `DRAFT`. When the user is ready to go live, route to the `launch` command in this skill, which runs the final readiness check and performs the irreversible launch. If the user hasn't named a specific feature or surface, ask before fetching baselines or designing — designing the wrong experiment burns more time than the clarifying question costs. @@ -157,5 +165,3 @@ If the user hasn't named a specific feature or surface, ask before fetching base - When underpowered, say so plainly and list remediations in order of cost. - Don't moralise about peeking — switch them to sequential. - Guardrail regressions are hard gates, not "slight concerns." - -When the experiment is created and live, hand off to the `interpret` command in this same skill once exposures are flowing. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md new file mode 100644 index 0000000..4f8aecb --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md @@ -0,0 +1,126 @@ +# Command: launch + +Launch a designed Mixpanel experiment. This is the irreversible transition from `DRAFT` to `ACTIVE` — once exposures start, variants are locked, the statistical model is fixed, and mid-flight configuration changes invalidate the test. This command exists to give that transition a deliberate seam. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (launch-specific) + +- **Pre-launch pitfall check.** The deterministic configuration validation that runs against the designed experiment before launch. Categorized as blockers (stop launch), warnings (explain trade-off, proceed if user accepts), fyi. +- **Allocation lock.** The moment the variant percentages stop being editable. After launch, the only safe per-variant change is the post-conclude ship action. +- **Cohort lock.** The moment the targeting cohort stops being editable. Changing the cohort mid-flight changes _who_ is being measured, which silently invalidates the comparison. + +--- + +## Components (launch-specific) + +### The irreversibility rule + +A launched experiment cannot be "un-launched" without losing the exposure data accumulated to that point. The only operations the post-launch state supports are: + +- **Monitor** mid-flight (the `monitor` command in this skill). +- **Conclude** at the end (the `interpret` command's decide action). +- **Pause / resume** via the underlying feature flag (handled by the `manage-feature-flags` skill) — rarely the right move; usually masks a design problem. + +There is no "edit the variants" or "change the statistical model" operation post-launch that preserves the result's validity. Surface this constraint to the user before launching — if there's any ambiguity about whether the configuration is final, send them back to `design`. + +### Launch readiness checklist + +Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. + +| Severity | Check | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | + +The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is ready + +The umbrella resolves the experiment in its step 2. Verify it's in `DRAFT` state. If it's already `ACTIVE` or `CONCLUDED`, this command is the wrong one — route to `monitor` or `interpret`. + +### 2. Run the launch readiness checklist + +Apply the catalogue from Components against the current experiment configuration. Surface results in this order: + +``` +*Launch Readiness — [Experiment Name]* + +🛑 Blockers (must fix before launch) + • [blocker description] +⚠️ Warnings (recommend addressing) + • [warning description] +ℹ️ FYI + • [fyi description] +``` + +If any blockers fire, **stop**. Tell the user what to fix and route them back to `design` to update the configuration. Don't offer to launch past a blocker. + +If warnings fire, name each trade-off explicitly. Don't just list them — explain what risk the user is accepting by launching anyway. + +If only FYIs fire (or nothing fires), proceed to step 3. + +### 3. Present the launch confirmation + +Surface the launch summary and **wait for explicit confirmation** before invoking the launch action. The summary should match what the user saw at the end of `design`, with any post-design edits reflected: + +``` +*Launch Summary — [Experiment Name]* + +• *Hypothesis:* If , then will by ≥, because . +• *Primary metrics:* (direction), … +• *Guardrails:* (direction), … +• *Variants:* control % / treatment % (or as configured) +• *Statistical model:* sequential | frequentist +• *End condition:* sample-based (per-arm ) | date-based ( days) +• *Confidence level:* +• *Multiple testing correction:* benjamini-hochberg | bonferroni | off +• *Advanced features:* CUPED on/off · Winsorization on/off (percentile

) +• *Expected duration on current traffic:* days + +*After launch:* + • Variants are locked. + • Statistical model is locked. + • Cohort targeting is locked. + • The only safe operations are monitor (mid-flight) and conclude (at the end). + +Reply CONFIRM to launch. Anything else cancels. +``` + +The literal `CONFIRM` requirement matches the irreversibility-confirmation discipline in `manage-lexicon` and `manage-feature-flags`. + +### 4. Launch + +On `CONFIRM`, invoke the launch action. If the launch fails, surface the platform's error verbatim — don't paraphrase. The user needs to know whether the failure is a transient platform issue (retry) or a configuration issue (back to `design`). + +### 5. Hand off to monitor + +After a successful launch, recommend the user check back in 24h via the `monitor` command. Surface two things they should set up as follow-ups (don't interrupt the launch flow to do them inline): + +1. **A tracking dashboard** for the primary and guardrail metrics — gives the user a single place to watch the experiment without re-opening the skill every time. Recommend running `create-dashboard` in a follow-up session. +2. **A calendar reminder** for the canary check (24h) and the mid-flight check (~halfway through the planned duration). Concrete phrasing the agent can use: _"Worth scheduling two check-ins: 24h from now for the canary, and at the midpoint of your N-day window for the mid-flight read. Both run through the `monitor` command."_ + +Print `✅ Launched.` and return control to the umbrella. + +--- + +## Output style + +- Lead with the readiness verdict — pass, warnings, or blockers — before showing the summary. +- For blockers, name the specific configuration field the user needs to change. +- For warnings, name the trade-off, not just the rule. +- Don't moralise about launching with warnings — surface them, get explicit acceptance, proceed. +- Don't launch without `CONFIRM`. Treat any other response (including "yes", "ok", "sure") as cancel-and-clarify. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/monitor.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/monitor.md new file mode 100644 index 0000000..f27fdd0 --- /dev/null +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/monitor.md @@ -0,0 +1,106 @@ +# Command: monitor + +Mid-flight safety checks on a running experiment. This command answers **"is it safe to keep this experiment running?"** — distinct from `interpret`, which answers **"did the experiment work?"** Monitor is for the middle of the experiment, before there's enough signal to interpret. Peek only at what's safe to peek at; surface anything that warrants pause or termination. + +The umbrella `SKILL.md` defines the shared glossary. Phase-specific terms below. + +--- + +## Glossary (monitor-specific) + +- **Sample pace.** The ratio of actual exposures accumulated to expected exposures at this point in the experiment's planned duration. A pace below 0.7 (≥30% slower than projected) suggests the experiment is underpowered relative to its design, or that something is wrong with exposure tracking. +- **Mid-flight SRM.** A Sample Ratio Mismatch detected during the experiment, before exposures are mature. Distinct from the SRM check at interpretation time — mid-flight SRM is a bucketing-bug early-warning, not a verdict on the result. + +The **peeking trap** and the **peek-safety table** (what's safe to look at mid-flight, what isn't) live in the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies) — this command applies them, doesn't re-derive them. + +--- + +## Components (monitor-specific) + +For the **peek-safety table** (what's safe to look at mid-flight, what isn't), see the umbrella's [Cross-command policies](../SKILL.md#cross-command-policies). For the **guardrail hard-gate threshold**, same place. + +### Terminate-early decision rules + +Three situations that justify ending a running experiment before its planned end: + +1. **Trustworthiness failure.** SRM fails mid-flight, or a misconfiguration is discovered that invalidates the design. Terminate, fix, restart. The accumulated exposures are not salvageable. +2. **Guardrail regression beyond the hard-gate threshold** (defined in the umbrella). The guardrail regresses by more than the threshold, with a tight CI. Continuing exposes more users to a measurable harm. Terminate and route to `interpret` for the ship/iterate verdict. +3. **Sequential stopping boundary crossed (Sequential tests only).** The platform's sequential boundary fires. This is the by-design early stop — terminate and route to `interpret`. + +What does **not** justify early termination: + +- "It's been a week and the lift looks good" (peeking trap on Frequentist — see the umbrella's peek-safety table). +- "The team is tired of waiting" (sunk cost mid-flight is real but not a statistical reason). +- "Looks like it'll be inconclusive" (futility analysis exists but requires the right test design; don't improvise). +- A guardrail wobbling within its noise band. + +--- + +## Steps + +Top-down: what to do, in order. + +### 1. Confirm the experiment is in scope for monitor + +The umbrella resolves the experiment in its step 2. Verify it's in `ACTIVE` state. If it's `DRAFT`, route to `design` or `launch`. If it's `CONCLUDED`, route to `interpret`. + +### 2. Read the safe signals + +Fetch the experiment with exposure data included. Report: + +- **Current state:** `ACTIVE` since [date], [N] days into a planned [N-or-date] window. +- **Sample pace:** actual vs expected exposures at this point. Flag if pace < 0.7. +- **Mid-flight SRM verdict** (from the platform). Flag if `FAIL`. +- **Guardrail summary** (polarity-corrected for each guardrail). Flag any with significant regressions. +- **Statistical model:** Sequential vs Frequentist. If Sequential, report whether the stopping boundary has been crossed. + +### 3. Apply the don't-peek rule for primaries + +If the experiment is **Frequentist** and the user asks about primary-metric results, push back politely: + +> "This experiment is configured as Frequentist, which means peeking at the primary mid-flight inflates the false-positive rate even if you're just curious. Mid-flight, the safe signals are SRM, sample pace, and guardrails — happy to walk through those. The primary verdict is meaningful once the planned [N-day | N-sample] target is reached." + +If the experiment is **Sequential**, peeking is by design. Report primary status and whether the sequential boundary fired. + +### 4. Decide: keep running, pause, or terminate + +Apply the **terminate-early decision rules** from Components. + +- **Trustworthiness failure or guardrail regression beyond tolerance** → recommend terminate, name the specific signal that fired, route to `interpret` for the formal ship/kill/iterate call. +- **Sequential stopping boundary crossed** → recommend terminate, route to `interpret`. +- **Pace problem (slow sample accrual)** → recommend extending the planned duration if possible, or accepting a coarser MDE. Don't terminate for pace alone; pace means the design is wrong, not the experiment. +- **Everything within tolerance** → recommend continue, give the user a checkback recommendation (next milestone: mid-flight, 80% sample, or planned end). + +For the SRM-failure remediation playbook, see [../references/health-check-interpretation.md](../references/health-check-interpretation.md). For the underpowered-experiment remediation playbook, see [../references/sizing.md](../references/sizing.md). + +### 5. Output + +Default to this shape: + +``` +*Mid-flight Status — [Experiment Name]* + +*Safe to keep running:* YES | NO (with reason) | YES, but with caveats + +*Signals:* + • SRM: ✅ PASS | 🛑 FAIL + • Sample pace: ✅ on track ([N]% of expected) | ⚠️ slow ([N]% of expected) + • Guardrails: ✅ flat | ⚠️ regressed [X]% + • Sequential boundary (if applicable): not crossed | CROSSED + +*Recommendation:* continue | terminate (reason) | extend duration + +*Next checkback:* [milestone — date or sample count] +``` + +If the user requested a primary-metric peek on a Frequentist test, lead with the don't-peek explanation before any other signals. + +--- + +## Output style + +- Lead with the verdict — safe to continue or not. +- Name the specific signal driving any "terminate" recommendation, never just "looks bad." +- Don't surface primary-metric lift on Frequentist tests, even if asked. Explain why, then offer the safe signals. +- Don't moralise about peeking — explain the math once, then route the user to safe signals. +- Treat "the team wants to ship now" as a separate conversation from "is the experiment ready to ship" — don't conflate. From 0f18c48200bfa61c586fd932236e73aa0238d4e9 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Thu, 11 Jun 2026 20:12:44 +0000 Subject: [PATCH 05/11] manage-experiment: port statsig diagnosis depth + align to skill rubric Port the three pieces from the analytics monorepo's statsig-diagnosis guidance (tal-multi-546) that PR #30's why-no-statsig.md lacked, then reconcile them with this repo's skill quality rubric. Port: - sizing.md: relative, unit-safe achievable-MDE formulas (MDE_relative, n_required) with the sigma=sqrt(sigma^2) caution, framed as explaining the NO verdict rather than overriding it. - why-no-statsig.md: a "pull the diagnostic inputs first" step and a priority-ordered "name the single most-likely blocker" table so the answer is one blocker with numbers, not a generic option list. Rubric alignment (review-skill: 60% capped -> ~98%): - Reframe the diagnostic-inputs step to intent (no named tool call or response field paths) so it stays engine-agnostic, lifting the cap. - Fix advanced-features.md winsorization percentile contradiction (tail-width 5 convention vs stale 95/90 boundary references). - Fix dead/stale cross-references in per-metric-interpretation.md and routing-xp-vs-ff.md to point at sections/commands that exist. - Add a Contents block to every 100+ line reference. - Collapse the state->command mapping duplicated 3x in SKILL.md (now 200). Synced byte-identical across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 --- .../skills/manage-experiment/SKILL.md | 15 +------ .../references/advanced-features.md | 12 ++++- .../references/health-check-interpretation.md | 12 +++++ .../references/hypothesis-framing.md | 9 ++++ .../references/per-metric-interpretation.md | 21 ++++++++- .../references/routing-xp-vs-ff.md | 4 +- .../segment-of-interest-selection.md | 9 ++++ .../references/session-replay-analysis.md | 9 ++++ .../manage-experiment/references/sizing.md | 34 ++++++++++++++ .../references/statistical-model.md | 8 ++++ .../references/why-no-statsig.md | 45 +++++++++++++++++++ .../skills/manage-experiment/SKILL.md | 15 +------ .../references/advanced-features.md | 12 ++++- .../references/health-check-interpretation.md | 12 +++++ .../references/hypothesis-framing.md | 9 ++++ .../references/per-metric-interpretation.md | 21 ++++++++- .../references/routing-xp-vs-ff.md | 4 +- .../segment-of-interest-selection.md | 9 ++++ .../references/session-replay-analysis.md | 9 ++++ .../manage-experiment/references/sizing.md | 34 ++++++++++++++ .../references/statistical-model.md | 8 ++++ .../references/why-no-statsig.md | 45 +++++++++++++++++++ .../skills/manage-experiment/SKILL.md | 15 +------ .../references/advanced-features.md | 12 ++++- .../references/health-check-interpretation.md | 12 +++++ .../references/hypothesis-framing.md | 9 ++++ .../references/per-metric-interpretation.md | 21 ++++++++- .../references/routing-xp-vs-ff.md | 4 +- .../segment-of-interest-selection.md | 9 ++++ .../references/session-replay-analysis.md | 9 ++++ .../manage-experiment/references/sizing.md | 34 ++++++++++++++ .../references/statistical-model.md | 8 ++++ .../references/why-no-statsig.md | 45 +++++++++++++++++++ 33 files changed, 477 insertions(+), 57 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index c977181..7bed770 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -24,13 +24,7 @@ license: Apache-2.0 # Manage Experiment -This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. - -The four commands map cleanly to experiment states: - -- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. -- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. -- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella, picked by experiment phase (the state→command mapping lives in the **Canonical commands** section below). The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). @@ -190,12 +184,7 @@ Apply these rules in order; the first match wins. 1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** - - `DRAFT` + configuration incomplete or user is iterating on it → `design`. - - `DRAFT` + configuration complete and user is ready → `launch`. - - `ACTIVE` and planned end not reached → `monitor`. - - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. - If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" 4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md index 96fcd08..ffcf1d3 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md @@ -2,6 +2,14 @@ Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. +## Contents + +- CUPED — variance reduction +- Winsorization — outlier handling +- Multiple testing correction — Bonferroni vs Benjamini-Hochberg +- Decision flowchart +- Common misconfigurations + ## CUPED — variance reduction **What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. @@ -83,7 +91,7 @@ Primary metric is Bernoulli (conversion rate)? │ └── No → CUPED OFF └── No (continuous / count / retention) Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? - ├── Yes → Winsorization ON (platform default percentile, typically 95) + ├── Yes → Winsorization ON (default `percentile=5`, i.e. cap each 5% tail) └── No → Winsorization OFF Does it correlate with pre-exposure behaviour of existing users? ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) @@ -98,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md index 82a09b7..818b163 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md @@ -2,6 +2,18 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +## Contents + +- Kohavi framing — always cite when a health check fails +- 1. SRM (Sample Ratio Mismatch) +- 2. Retro A/A (pre-experiment bias) failure +- 3. Insufficient exposures +- 4. Frequentist peeking +- 5. Live computation timeout / broken data +- 6. Experiment ran < 3 days +- 7. Misconfigurations +- Output shape when a health check fails + --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md index 8c115af..40cf56e 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md @@ -2,6 +2,15 @@ All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. +## Contents + +- The shape +- When the user gives you a one-liner +- Mechanism → metric class +- Hypothesis ↔ metric alignment +- When to push back +- Worked examples + ## The shape > **If** ``, **then** `` will ``, **because** ``. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md index e863069..320cb8e 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md @@ -1,6 +1,23 @@ # Per-Metric Interpretation -Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single result row actually mean?"_ + +## Contents + +- The mental model +- Polarity recipe +- Reading the p-value in this platform +- Reading the lift correctly +- Verdict phrasing — a small palette +- Magnitude — make it absolute +- Twyman's Law in practice — changed-denominator lifts +- Metric distribution types +- Variance-reduction & outlier settings that change interpretation +- Multiple comparisons & metric tiers — what's decisional and what isn't +- When a primary metric is inconclusive +- Frequentist vs Sequential — what affects per-metric reading +- Triggered analysis & dilution +- Novelty and primacy --- @@ -19,7 +36,7 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: +Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: - A row in `summary.positive` with `direction: "down"` is a **regression**. - A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md index d077dba..7e13471 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -51,7 +51,7 @@ If the user has already shipped to 100% and wants to "analyse the effect," there ### "Just give me an A/B test, the simplest one" -Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through the hypothesis and metric-selection steps of the `design` command — the cost is 10 minutes; the value is having a result you can actually act on. ### "I want a feature flag but with stats" @@ -61,7 +61,7 @@ Now you're back to an experiment. Run the full setup workflow. ### If experiment -Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. +Continue with the `design` command's setup workflow. The output is a configured experiment ready to launch. ### If feature flag diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md index 4db49ac..d290517 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md @@ -4,6 +4,15 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +## Contents + +- Why this matters: the fishing-expedition problem +- The decision tree for picking segments +- Sanity checks before committing to a slice +- How many slices to commit to +- The pre-commit ritual +- Then read the results + --- ## Why this matters: the fishing-expedition problem diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md index 7282bb4..11d5158 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md @@ -4,6 +4,15 @@ Turn a quantitative experiment result into a behavior story using session replay > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +## Contents + +- When replays help, when they don't +- Cohort selection: which replays to compare +- What to actually watch for +- How to frame the findings +- What NOT to do +- Output shape + --- ## When replays help, when they don't diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md index 7c41c9a..8a057d9 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md @@ -2,6 +2,20 @@ You almost never know the right sample size by guessing. Pull the data first, then run the math. +## Contents + +- The standard formula +- Variance by metric type +- Worked example +- Kohavi's inverted formula +- Achievable MDE for a running experiment (diagnosis form) +- Estimating the inputs from real data +- Five remediations when the experiment is underpowered +- Sample-size floor +- Lookup table (Bernoulli, 95% conf, 80% power) +- Sample-size growth with variants +- Duration considerations + ## The standard formula Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): @@ -48,6 +62,26 @@ This tells the user: "given your traffic, the smallest effect you can reliably d Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." +## Achievable MDE for a running experiment (diagnosis form) + +When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: + +- `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: + + ``` + MDE_relative = 4σ / (baseline × √n) + ``` + +- Use `σ`, the **standard deviation** — `σ = √σ²`. Plugging the variance `σ²` in directly overstates the MDE by a factor of `√σ²`. For a Bernoulli metric, `σ = √(p(1−p))`, not `p(1−p)`. + +Now compare apples to apples: both `MDE_relative` and the platform's `lift` are relative fractions. If `|lift| < MDE_relative`, the observed effect is below the detection floor at the current sample — it may be real, but the experiment was sized for a larger one. Invert the required-sample formula to quote the multiplier ("~3× more exposures"): + +``` +n_required = 16 × σ² / (baseline × MDE_target)² +``` + +These formulas explain _why_ the platform returned `significance = NO`; they do **not** override it. Never recompute or restate the platform's significance verdict from them. + ## Estimating the inputs from real data For each primary metric, before sizing, you need three numbers: diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md index 771a208..1d832e0 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md @@ -2,6 +2,14 @@ Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. +## Contents + +- Testing model: sequential vs frequentist +- End condition: sample-based vs date-based +- Confidence level +- Multiple testing correction +- Power vs significance trade-off + ## Testing model: sequential vs frequentist **Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md index 22ca2fe..4995f62 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md @@ -4,6 +4,16 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +## Contents + +- First, rule out a broken result +- Pull the diagnostic inputs first +- The five real reasons an experiment hasn't hit statsig +- When several reasons fire: name the single most-likely blocker +- Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? +- What NOT to suggest +- Output shape + --- ## First, rule out a broken result @@ -23,6 +33,22 @@ Also check: --- +## Pull the diagnostic inputs first + +Before walking the reasons, fetch the experiment with its exposures and metric results included. Prefer the live results; if live computation failed, fall back to the cached results and flag the staleness — and never treat missing data as "no effect." Gather: + +- **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. +- **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. +- **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. +- **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. +- **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). + +For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. + +--- + ## The five real reasons an experiment hasn't hit statsig Walk through these in order. The first one that explains the picture is usually right. @@ -79,6 +105,24 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM --- +## When several reasons fire: name the single most-likely blocker + +Most experiments trip one or two of the five reasons, not all five. Don't hand back the whole list — collapse to the single most-likely blocker using this precedence (highest firing rule wins) and tell the user _that one_, with numbers. + +| Most-likely blocker (priority order) | What to tell the user | +| ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Exposures aren't flowing** (reason 5) | "You're capturing `` of an expected `` exposures — the pipeline isn't delivering most eligible users. Fix the flag / SDK emission before judging significance." Confirm with a query on the exposure event. | +| **Traffic split is silently skewed** (reason 4, no SRM failure) | "Your smallest arm has only `` exposures (vs ``), so the test is sample-bound by that arm. Rebalance the rollout (pause + restart, not mid-experiment) before adding time." | +| **Effect is real but below the detection floor** (reason 2 AND sample at/above target) | "You have enough traffic for the MDE this was sized for, but the observed lift (~`%`) is below the achievable MDE (~`%`). Detecting `%` needs ~``/arm — about ``× your current sample — or lower the MDE / accept the smaller effect." | +| **Underpowered for the configured MDE** (reason 1) | "You're sized for a `%` lift but only have ``/`` of the required per-arm sample. Extend ~`` days, raise allocation, or enable CUPED / Winsorization to claw back power." | +| **Variance inflated, reduction off, _and_ another blocker fires** (reason 3 + any of 1/2/4/5) | "Variance reduction is leaving power on the table — enabling Winsorization (default 5/95) or CUPED would tighten the CI before adding exposures." Stack this on the primary blocker; don't offer it alone. | +| **Variance reduction is the only lever** (reason 3 only; 1/2/4/5 clear) | "The experiment is well-sized and well-allocated, but variance reduction is off — enabling Winsorization (default 5/95) or CUPED could resolve the borderline result without more data. Try that before accepting the null." | +| **None of the above** — well-sized, well-allocated, exposures flowing, reduction on/NA, lift ≈ 0 | "Well-powered for the MDE that matters and the effect is genuinely near zero — this is a real null. Accept the null and ship the decision, or iterate on a stronger hypothesis." Quote the achievable-MDE numbers so the conclusion is trusted. | + +**Always quote the numbers** — "12k of 58k required per arm," "smallest arm 4.1k vs 8.2k expected," "~3× more exposures." Vague advice ("collect more data") is the failure mode this playbook exists to prevent. + +--- + ## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? Once you know which reason fits, the recommendation almost picks itself. @@ -104,6 +148,7 @@ When recommending EXTEND on an active experiment, the action is to update the ex - ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. - ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. - ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." +- ❌ **Hand back the generic option list** instead of naming the single most-likely blocker with numbers — that's the failure mode this playbook exists to prevent. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index c977181..7bed770 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -24,13 +24,7 @@ license: Apache-2.0 # Manage Experiment -This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. - -The four commands map cleanly to experiment states: - -- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. -- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. -- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella, picked by experiment phase (the state→command mapping lives in the **Canonical commands** section below). The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). @@ -190,12 +184,7 @@ Apply these rules in order; the first match wins. 1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** - - `DRAFT` + configuration incomplete or user is iterating on it → `design`. - - `DRAFT` + configuration complete and user is ready → `launch`. - - `ACTIVE` and planned end not reached → `monitor`. - - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. - If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" 4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md index 96fcd08..ffcf1d3 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md @@ -2,6 +2,14 @@ Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. +## Contents + +- CUPED — variance reduction +- Winsorization — outlier handling +- Multiple testing correction — Bonferroni vs Benjamini-Hochberg +- Decision flowchart +- Common misconfigurations + ## CUPED — variance reduction **What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. @@ -83,7 +91,7 @@ Primary metric is Bernoulli (conversion rate)? │ └── No → CUPED OFF └── No (continuous / count / retention) Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? - ├── Yes → Winsorization ON (platform default percentile, typically 95) + ├── Yes → Winsorization ON (default `percentile=5`, i.e. cap each 5% tail) └── No → Winsorization OFF Does it correlate with pre-exposure behaviour of existing users? ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) @@ -98,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md index 82a09b7..818b163 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md @@ -2,6 +2,18 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +## Contents + +- Kohavi framing — always cite when a health check fails +- 1. SRM (Sample Ratio Mismatch) +- 2. Retro A/A (pre-experiment bias) failure +- 3. Insufficient exposures +- 4. Frequentist peeking +- 5. Live computation timeout / broken data +- 6. Experiment ran < 3 days +- 7. Misconfigurations +- Output shape when a health check fails + --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md index 8c115af..40cf56e 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md @@ -2,6 +2,15 @@ All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. +## Contents + +- The shape +- When the user gives you a one-liner +- Mechanism → metric class +- Hypothesis ↔ metric alignment +- When to push back +- Worked examples + ## The shape > **If** ``, **then** `` will ``, **because** ``. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md index e863069..320cb8e 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md @@ -1,6 +1,23 @@ # Per-Metric Interpretation -Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single result row actually mean?"_ + +## Contents + +- The mental model +- Polarity recipe +- Reading the p-value in this platform +- Reading the lift correctly +- Verdict phrasing — a small palette +- Magnitude — make it absolute +- Twyman's Law in practice — changed-denominator lifts +- Metric distribution types +- Variance-reduction & outlier settings that change interpretation +- Multiple comparisons & metric tiers — what's decisional and what isn't +- When a primary metric is inconclusive +- Frequentist vs Sequential — what affects per-metric reading +- Triggered analysis & dilution +- Novelty and primacy --- @@ -19,7 +36,7 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: +Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: - A row in `summary.positive` with `direction: "down"` is a **regression**. - A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md index d077dba..7e13471 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -51,7 +51,7 @@ If the user has already shipped to 100% and wants to "analyse the effect," there ### "Just give me an A/B test, the simplest one" -Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through the hypothesis and metric-selection steps of the `design` command — the cost is 10 minutes; the value is having a result you can actually act on. ### "I want a feature flag but with stats" @@ -61,7 +61,7 @@ Now you're back to an experiment. Run the full setup workflow. ### If experiment -Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. +Continue with the `design` command's setup workflow. The output is a configured experiment ready to launch. ### If feature flag diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md index 4db49ac..d290517 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md @@ -4,6 +4,15 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +## Contents + +- Why this matters: the fishing-expedition problem +- The decision tree for picking segments +- Sanity checks before committing to a slice +- How many slices to commit to +- The pre-commit ritual +- Then read the results + --- ## Why this matters: the fishing-expedition problem diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md index 7282bb4..11d5158 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md @@ -4,6 +4,15 @@ Turn a quantitative experiment result into a behavior story using session replay > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +## Contents + +- When replays help, when they don't +- Cohort selection: which replays to compare +- What to actually watch for +- How to frame the findings +- What NOT to do +- Output shape + --- ## When replays help, when they don't diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md index 7c41c9a..8a057d9 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md @@ -2,6 +2,20 @@ You almost never know the right sample size by guessing. Pull the data first, then run the math. +## Contents + +- The standard formula +- Variance by metric type +- Worked example +- Kohavi's inverted formula +- Achievable MDE for a running experiment (diagnosis form) +- Estimating the inputs from real data +- Five remediations when the experiment is underpowered +- Sample-size floor +- Lookup table (Bernoulli, 95% conf, 80% power) +- Sample-size growth with variants +- Duration considerations + ## The standard formula Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): @@ -48,6 +62,26 @@ This tells the user: "given your traffic, the smallest effect you can reliably d Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." +## Achievable MDE for a running experiment (diagnosis form) + +When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: + +- `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: + + ``` + MDE_relative = 4σ / (baseline × √n) + ``` + +- Use `σ`, the **standard deviation** — `σ = √σ²`. Plugging the variance `σ²` in directly overstates the MDE by a factor of `√σ²`. For a Bernoulli metric, `σ = √(p(1−p))`, not `p(1−p)`. + +Now compare apples to apples: both `MDE_relative` and the platform's `lift` are relative fractions. If `|lift| < MDE_relative`, the observed effect is below the detection floor at the current sample — it may be real, but the experiment was sized for a larger one. Invert the required-sample formula to quote the multiplier ("~3× more exposures"): + +``` +n_required = 16 × σ² / (baseline × MDE_target)² +``` + +These formulas explain _why_ the platform returned `significance = NO`; they do **not** override it. Never recompute or restate the platform's significance verdict from them. + ## Estimating the inputs from real data For each primary metric, before sizing, you need three numbers: diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md index 771a208..1d832e0 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md @@ -2,6 +2,14 @@ Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. +## Contents + +- Testing model: sequential vs frequentist +- End condition: sample-based vs date-based +- Confidence level +- Multiple testing correction +- Power vs significance trade-off + ## Testing model: sequential vs frequentist **Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md index 22ca2fe..4995f62 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md @@ -4,6 +4,16 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +## Contents + +- First, rule out a broken result +- Pull the diagnostic inputs first +- The five real reasons an experiment hasn't hit statsig +- When several reasons fire: name the single most-likely blocker +- Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? +- What NOT to suggest +- Output shape + --- ## First, rule out a broken result @@ -23,6 +33,22 @@ Also check: --- +## Pull the diagnostic inputs first + +Before walking the reasons, fetch the experiment with its exposures and metric results included. Prefer the live results; if live computation failed, fall back to the cached results and flag the staleness — and never treat missing data as "no effect." Gather: + +- **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. +- **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. +- **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. +- **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. +- **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). + +For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. + +--- + ## The five real reasons an experiment hasn't hit statsig Walk through these in order. The first one that explains the picture is usually right. @@ -79,6 +105,24 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM --- +## When several reasons fire: name the single most-likely blocker + +Most experiments trip one or two of the five reasons, not all five. Don't hand back the whole list — collapse to the single most-likely blocker using this precedence (highest firing rule wins) and tell the user _that one_, with numbers. + +| Most-likely blocker (priority order) | What to tell the user | +| ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Exposures aren't flowing** (reason 5) | "You're capturing `` of an expected `` exposures — the pipeline isn't delivering most eligible users. Fix the flag / SDK emission before judging significance." Confirm with a query on the exposure event. | +| **Traffic split is silently skewed** (reason 4, no SRM failure) | "Your smallest arm has only `` exposures (vs ``), so the test is sample-bound by that arm. Rebalance the rollout (pause + restart, not mid-experiment) before adding time." | +| **Effect is real but below the detection floor** (reason 2 AND sample at/above target) | "You have enough traffic for the MDE this was sized for, but the observed lift (~`%`) is below the achievable MDE (~`%`). Detecting `%` needs ~``/arm — about ``× your current sample — or lower the MDE / accept the smaller effect." | +| **Underpowered for the configured MDE** (reason 1) | "You're sized for a `%` lift but only have ``/`` of the required per-arm sample. Extend ~`` days, raise allocation, or enable CUPED / Winsorization to claw back power." | +| **Variance inflated, reduction off, _and_ another blocker fires** (reason 3 + any of 1/2/4/5) | "Variance reduction is leaving power on the table — enabling Winsorization (default 5/95) or CUPED would tighten the CI before adding exposures." Stack this on the primary blocker; don't offer it alone. | +| **Variance reduction is the only lever** (reason 3 only; 1/2/4/5 clear) | "The experiment is well-sized and well-allocated, but variance reduction is off — enabling Winsorization (default 5/95) or CUPED could resolve the borderline result without more data. Try that before accepting the null." | +| **None of the above** — well-sized, well-allocated, exposures flowing, reduction on/NA, lift ≈ 0 | "Well-powered for the MDE that matters and the effect is genuinely near zero — this is a real null. Accept the null and ship the decision, or iterate on a stronger hypothesis." Quote the achievable-MDE numbers so the conclusion is trusted. | + +**Always quote the numbers** — "12k of 58k required per arm," "smallest arm 4.1k vs 8.2k expected," "~3× more exposures." Vague advice ("collect more data") is the failure mode this playbook exists to prevent. + +--- + ## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? Once you know which reason fits, the recommendation almost picks itself. @@ -104,6 +148,7 @@ When recommending EXTEND on an active experiment, the action is to update the ex - ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. - ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. - ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." +- ❌ **Hand back the generic option list** instead of naming the single most-likely blocker with numbers — that's the failure mode this playbook exists to prevent. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index c977181..7bed770 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -24,13 +24,7 @@ license: Apache-2.0 # Manage Experiment -This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella; pick by experiment phase. - -The four commands map cleanly to experiment states: - -- `DRAFT` (experiment doesn't exist or hasn't launched) → `design` or `launch`. -- `ACTIVE` (mid-flight, exposures accumulating) → `monitor`. -- `ACTIVE` (reached planned end) or `CONCLUDED` → `interpret`. +This skill manages a Mixpanel experiment across its full lifecycle — **design**, **launch**, **monitor**, **interpret**. Four commands sit under the umbrella, picked by experiment phase (the state→command mapping lives in the **Canonical commands** section below). The skill runs as a single interactive session per experiment. Commands compose naturally across phases — designing produces a configuration that launching commits, monitoring watches for safety issues mid-flight, interpreting consumes the matured result — but they're rarely invoked in the same session (the lifecycle spans days to weeks). @@ -190,12 +184,7 @@ Apply these rules in order; the first match wins. 1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. 2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** - - `DRAFT` + configuration incomplete or user is iterating on it → `design`. - - `DRAFT` + configuration complete and user is ready → `launch`. - - `ACTIVE` and planned end not reached → `monitor`. - - `ACTIVE` and planned end reached, or `CONCLUDED` → `interpret`. - If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" 4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. 5. **Otherwise:** show the Command menu, take the user's choice. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md index 96fcd08..ffcf1d3 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md @@ -2,6 +2,14 @@ Three optional features most experiments don't touch — and that, used in the right spot, dramatically improve power or trustworthiness. Each one has a clear set of conditions where it helps and a clear set of conditions where enabling it is wrong. +## Contents + +- CUPED — variance reduction +- Winsorization — outlier handling +- Multiple testing correction — Bonferroni vs Benjamini-Hochberg +- Decision flowchart +- Common misconfigurations + ## CUPED — variance reduction **What it does.** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance on metrics that correlate with users' pre-experiment behaviour. Lower variance → smaller required sample size → faster experiments. Typical reductions are 30–70%, which translates directly into 30–70% smaller required sample. @@ -83,7 +91,7 @@ Primary metric is Bernoulli (conversion rate)? │ └── No → CUPED OFF └── No (continuous / count / retention) Heavy-tailed distribution with outliers (revenue, time-on-page, session length)? - ├── Yes → Winsorization ON (platform default percentile, typically 95) + ├── Yes → Winsorization ON (default `percentile=5`, i.e. cap each 5% tail) └── No → Winsorization OFF Does it correlate with pre-exposure behaviour of existing users? ├── Yes → CUPED ON (if 2–4 week pre-exposure window available, no new-user cohort) @@ -98,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a percentile below ~80.** Cuts more than 20% of data. Almost always a typo for 95 or 90. Confirm intent. +- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md index 82a09b7..818b163 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md @@ -2,6 +2,18 @@ Turn the platform's already-computed health verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action. +## Contents + +- Kohavi framing — always cite when a health check fails +- 1. SRM (Sample Ratio Mismatch) +- 2. Retro A/A (pre-experiment bias) failure +- 3. Insufficient exposures +- 4. Frequentist peeking +- 5. Live computation timeout / broken data +- 6. Experiment ran < 3 days +- 7. Misconfigurations +- Output shape when a health check fails + --- ## Kohavi framing — always cite when a health check fails diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md index 8c115af..40cf56e 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md @@ -2,6 +2,15 @@ All four properties of a good hypothesis — falsifiable, directional, mechanistic, bounded in time — matter. Drop any one and the design downstream silently degrades. +## Contents + +- The shape +- When the user gives you a one-liner +- Mechanism → metric class +- Hypothesis ↔ metric alignment +- When to push back +- Worked examples + ## The shape > **If** ``, **then** `` will ``, **because** ``. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md index e863069..320cb8e 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md @@ -1,6 +1,23 @@ # Per-Metric Interpretation -Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single row of `summary` actually mean?"_ +Translate a metric's lift, confidence interval, and p-value into a plain-language verdict — i.e. _"what does this single result row actually mean?"_ + +## Contents + +- The mental model +- Polarity recipe +- Reading the p-value in this platform +- Reading the lift correctly +- Verdict phrasing — a small palette +- Magnitude — make it absolute +- Twyman's Law in practice — changed-denominator lifts +- Metric distribution types +- Variance-reduction & outlier settings that change interpretation +- Multiple comparisons & metric tiers — what's decisional and what isn't +- When a primary metric is inconclusive +- Frequentist vs Sequential — what affects per-metric reading +- Triggered analysis & dilution +- Novelty and primacy --- @@ -19,7 +36,7 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Apply the polarity recipe from the spine — see the **Components** section of `SKILL.md`. Treat the bucket name in `summary.positive` / `summary.negative` as sign-of-lift only; the business verdict comes from combining it with `metric.direction`. Examples worth remembering: +Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: - A row in `summary.positive` with `direction: "down"` is a **regression**. - A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md index d077dba..7e13471 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -51,7 +51,7 @@ If the user has already shipped to 100% and wants to "analyse the effect," there ### "Just give me an A/B test, the simplest one" -Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through Step 1 (hypothesis) and Step 2 (metrics) of the main workflow — the cost is 10 minutes; the value is having a result you can actually act on. +Probably an experiment. But "simplest" usually means "skip hypothesis, skip MDE, skip guardrails," which kills the test's interpretability. Coach the user through the hypothesis and metric-selection steps of the `design` command — the cost is 10 minutes; the value is having a result you can actually act on. ### "I want a feature flag but with stats" @@ -61,7 +61,7 @@ Now you're back to an experiment. Run the full setup workflow. ### If experiment -Continue with the four-step setup workflow in the main `SKILL.md`. The output of this skill is a configured experiment ready to launch. +Continue with the `design` command's setup workflow. The output is a configured experiment ready to launch. ### If feature flag diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md index 4db49ac..d290517 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md @@ -4,6 +4,15 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +## Contents + +- Why this matters: the fishing-expedition problem +- The decision tree for picking segments +- Sanity checks before committing to a slice +- How many slices to commit to +- The pre-commit ritual +- Then read the results + --- ## Why this matters: the fishing-expedition problem diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md index 7282bb4..11d5158 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md @@ -4,6 +4,15 @@ Turn a quantitative experiment result into a behavior story using session replay > **Scope boundary.** This skill provides the _interpretation_ guidance for replay analysis. Actually fetching replay IDs for control vs treatment cohorts is a separate platform capability. If replay fetching isn't available in the current environment, say so to the user and recommend the manual flow: pull replays via the experiment's "View replays" UI for each variant, then bring the IDs back to discuss. +## Contents + +- When replays help, when they don't +- Cohort selection: which replays to compare +- What to actually watch for +- How to frame the findings +- What NOT to do +- Output shape + --- ## When replays help, when they don't diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md index 7c41c9a..8a057d9 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md @@ -2,6 +2,20 @@ You almost never know the right sample size by guessing. Pull the data first, then run the math. +## Contents + +- The standard formula +- Variance by metric type +- Worked example +- Kohavi's inverted formula +- Achievable MDE for a running experiment (diagnosis form) +- Estimating the inputs from real data +- Five remediations when the experiment is underpowered +- Sample-size floor +- Lookup table (Bernoulli, 95% conf, 80% power) +- Sample-size growth with variants +- Duration considerations + ## The standard formula Required sample size per variant (two-sample, two-sided test at 95% confidence, 80% power): @@ -48,6 +62,26 @@ This tells the user: "given your traffic, the smallest effect you can reliably d Underpowered experiments suffer from **winner's curse**: if you do reach significance, the lift estimate is exaggerated, because only the high-variance positive realisations crossed the threshold. The post-launch result then fails to replicate, and the team learns "experiments are unreliable" rather than "this experiment was underpowered." +## Achievable MDE for a running experiment (diagnosis form) + +When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: + +- `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: + + ``` + MDE_relative = 4σ / (baseline × √n) + ``` + +- Use `σ`, the **standard deviation** — `σ = √σ²`. Plugging the variance `σ²` in directly overstates the MDE by a factor of `√σ²`. For a Bernoulli metric, `σ = √(p(1−p))`, not `p(1−p)`. + +Now compare apples to apples: both `MDE_relative` and the platform's `lift` are relative fractions. If `|lift| < MDE_relative`, the observed effect is below the detection floor at the current sample — it may be real, but the experiment was sized for a larger one. Invert the required-sample formula to quote the multiplier ("~3× more exposures"): + +``` +n_required = 16 × σ² / (baseline × MDE_target)² +``` + +These formulas explain _why_ the platform returned `significance = NO`; they do **not** override it. Never recompute or restate the platform's significance verdict from them. + ## Estimating the inputs from real data For each primary metric, before sizing, you need three numbers: diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md index 771a208..1d832e0 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md @@ -2,6 +2,14 @@ Once required sample size and acceptable duration are known, two configuration choices are left: the **testing model** (sequential vs frequentist) and the **end condition** (sample-based vs date-based). Two adjacent choices change how the tests are interpreted: **confidence level** and **multiple-testing correction**. +## Contents + +- Testing model: sequential vs frequentist +- End condition: sample-based vs date-based +- Confidence level +- Multiple testing correction +- Power vs significance trade-off + ## Testing model: sequential vs frequentist **Default to sequential** for most users. Peeking is the most common Mixpanel customer mistake, and sequential testing makes early-look safe by design. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md index 22ca2fe..4995f62 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md @@ -4,6 +4,16 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +## Contents + +- First, rule out a broken result +- Pull the diagnostic inputs first +- The five real reasons an experiment hasn't hit statsig +- When several reasons fire: name the single most-likely blocker +- Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? +- What NOT to suggest +- Output shape + --- ## First, rule out a broken result @@ -23,6 +33,22 @@ Also check: --- +## Pull the diagnostic inputs first + +Before walking the reasons, fetch the experiment with its exposures and metric results included. Prefer the live results; if live computation failed, fall back to the cached results and flag the staleness — and never treat missing data as "no effect." Gather: + +- **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. +- **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. +- **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. +- **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. +- **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). + +For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. + +--- + ## The five real reasons an experiment hasn't hit statsig Walk through these in order. The first one that explains the picture is usually right. @@ -79,6 +105,24 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM --- +## When several reasons fire: name the single most-likely blocker + +Most experiments trip one or two of the five reasons, not all five. Don't hand back the whole list — collapse to the single most-likely blocker using this precedence (highest firing rule wins) and tell the user _that one_, with numbers. + +| Most-likely blocker (priority order) | What to tell the user | +| ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Exposures aren't flowing** (reason 5) | "You're capturing `` of an expected `` exposures — the pipeline isn't delivering most eligible users. Fix the flag / SDK emission before judging significance." Confirm with a query on the exposure event. | +| **Traffic split is silently skewed** (reason 4, no SRM failure) | "Your smallest arm has only `` exposures (vs ``), so the test is sample-bound by that arm. Rebalance the rollout (pause + restart, not mid-experiment) before adding time." | +| **Effect is real but below the detection floor** (reason 2 AND sample at/above target) | "You have enough traffic for the MDE this was sized for, but the observed lift (~`%`) is below the achievable MDE (~`%`). Detecting `%` needs ~``/arm — about ``× your current sample — or lower the MDE / accept the smaller effect." | +| **Underpowered for the configured MDE** (reason 1) | "You're sized for a `%` lift but only have ``/`` of the required per-arm sample. Extend ~`` days, raise allocation, or enable CUPED / Winsorization to claw back power." | +| **Variance inflated, reduction off, _and_ another blocker fires** (reason 3 + any of 1/2/4/5) | "Variance reduction is leaving power on the table — enabling Winsorization (default 5/95) or CUPED would tighten the CI before adding exposures." Stack this on the primary blocker; don't offer it alone. | +| **Variance reduction is the only lever** (reason 3 only; 1/2/4/5 clear) | "The experiment is well-sized and well-allocated, but variance reduction is off — enabling Winsorization (default 5/95) or CUPED could resolve the borderline result without more data. Try that before accepting the null." | +| **None of the above** — well-sized, well-allocated, exposures flowing, reduction on/NA, lift ≈ 0 | "Well-powered for the MDE that matters and the effect is genuinely near zero — this is a real null. Accept the null and ship the decision, or iterate on a stronger hypothesis." Quote the achievable-MDE numbers so the conclusion is trusted. | + +**Always quote the numbers** — "12k of 58k required per arm," "smallest arm 4.1k vs 8.2k expected," "~3× more exposures." Vague advice ("collect more data") is the failure mode this playbook exists to prevent. + +--- + ## Decision: WAIT, EXTEND, BOOST POWER, NARROW, or ACCEPT NULL? Once you know which reason fits, the recommendation almost picks itself. @@ -104,6 +148,7 @@ When recommending EXTEND on an active experiment, the action is to update the ex - ❌ **Add more primary metrics** to "fish" for a win — multiplies the family-wise FPR. If a single primary is inconclusive, more primaries make the picture worse, not better. - ❌ **Re-run identical hypothesis on the same audience right after concluding "no effect"** — without a power change, you'll get the same answer. - ❌ **Claim "no effect"** from an underpowered inconclusive result — the right framing is "the experiment wasn't sized to detect the effect we observed." +- ❌ **Hand back the generic option list** instead of naming the single most-likely blocker with numbers — that's the failure mode this playbook exists to prevent. --- From 46c6126f539645d392cdcb2d73e54adfc94f25c4 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Thu, 11 Jun 2026 21:52:34 +0000 Subject: [PATCH 06/11] manage-experiment: align prior-experiments lookup with overlap-ranking tool MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Search-Prior-Experiments engine tool (analytics #95682) ranks stored experiments by overlap on the draft's metric set, flag key, and hypothesis rather than free-text keyword matching. Update the design command and the prior-experiments reference so they steer the agent to hand the tool the draft it's building instead of constructing single-keyword queries. Engine-agnostic wording — describes the signals, not tool names or field paths. Co-Authored-By: Claude Opus 4.8 --- .../mixpanel-mcp-eu/skills/manage-experiment/commands/design.md | 2 +- .../skills/manage-experiment/references/prior-experiments.md | 2 +- .../mixpanel-mcp-in/skills/manage-experiment/commands/design.md | 2 +- .../skills/manage-experiment/references/prior-experiments.md | 2 +- .../mixpanel-mcp/skills/manage-experiment/commands/design.md | 2 +- .../skills/manage-experiment/references/prior-experiments.md | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index adafd0f..b052290 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -53,7 +53,7 @@ Top-down: what to do, in order. **Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). -**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). +**Check for prior experiments on the same feature.** Search the project using the draft's flag key, planned metrics, and hypothesis — the lookup ranks prior experiments by overlap on those signals, so pass it the draft rather than a lone keyword. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). ### 2. Write the hypothesis diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md index 60d083c..476a26c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md @@ -4,7 +4,7 @@ The first thing to do when a user proposes an experiment on a feature is look up ## The lookup -Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. +Search the project's prior experiments using the draft you're building — its flag key, its planned metrics, and its hypothesis. The lookup ranks stored experiments by overlap on those signals (shared metrics weigh most, then flag key, then hypothesis wording), so hand it as much of the draft as you have rather than a single keyword. Cast the net wide on the first call — keep the similarity threshold low so adjacent experiments the user may have forgotten about still surface. If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index adafd0f..b052290 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -53,7 +53,7 @@ Top-down: what to do, in order. **Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). -**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). +**Check for prior experiments on the same feature.** Search the project using the draft's flag key, planned metrics, and hypothesis — the lookup ranks prior experiments by overlap on those signals, so pass it the draft rather than a lone keyword. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). ### 2. Write the hypothesis diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md index 60d083c..476a26c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md @@ -4,7 +4,7 @@ The first thing to do when a user proposes an experiment on a feature is look up ## The lookup -Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. +Search the project's prior experiments using the draft you're building — its flag key, its planned metrics, and its hypothesis. The lookup ranks stored experiments by overlap on those signals (shared metrics weigh most, then flag key, then hypothesis wording), so hand it as much of the draft as you have rather than a single keyword. Cast the net wide on the first call — keep the similarity threshold low so adjacent experiments the user may have forgotten about still surface. If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index adafd0f..b052290 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -53,7 +53,7 @@ Top-down: what to do, in order. **Route Experiment vs Feature Flag first.** Wants causal evidence (lift, ship/no-ship from data) → experiment. Wants progressive rollout, kill switch, or per-segment gating with no measurement criterion → feature flag (route to the `manage-feature-flags` skill). If ambiguous, ask once: _"Are you measuring whether this change moves a metric (experiment), or rolling it out gradually with no measurement criterion (feature flag)?"_ Deeper disambiguation in [../references/routing-xp-vs-ff.md](../references/routing-xp-vs-ff.md). -**Check for prior experiments on the same feature.** Search the project by keywords drawn from the feature name. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). +**Check for prior experiments on the same feature.** Search the project using the draft's flag key, planned metrics, and hypothesis — the lookup ranks prior experiments by overlap on those signals, so pass it the draft rather than a lone keyword. If a prior-experiments lookup isn't available, say so explicitly — don't fabricate "no priors found." Surface anything you find: a same-feature ship suggests "don't re-run, iterate on a new hypothesis"; a prior kill is a strong prior the user has to argue past; an earlier iteration gives you reliable baseline and variance numbers that sharpen the new MDE. Fold-in playbook in [../references/prior-experiments.md](../references/prior-experiments.md). ### 2. Write the hypothesis diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md index 60d083c..476a26c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md @@ -4,7 +4,7 @@ The first thing to do when a user proposes an experiment on a feature is look up ## The lookup -Search the project's prior experiments by keywords drawn from the feature name and surface area. Cast the net wide on the first call — single-keyword searches catch related experiments the user may have forgotten about. +Search the project's prior experiments using the draft you're building — its flag key, its planned metrics, and its hypothesis. The lookup ranks stored experiments by overlap on those signals (shared metrics weigh most, then flag key, then hypothesis wording), so hand it as much of the draft as you have rather than a single keyword. Cast the net wide on the first call — keep the similarity threshold low so adjacent experiments the user may have forgotten about still surface. If no prior-experiments lookup is available in the current environment, tell the user explicitly that you couldn't check and proceed. Don't fabricate "no prior tests found" — that's worse than admitting the blind spot. From 0b80668a589be10fc1ea2278a9dd013927c771d8 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Thu, 11 Jun 2026 22:04:05 +0000 Subject: [PATCH 07/11] manage-experiment: trim description under the 1024-char spec limit The description was 1326 chars (over the Agent Skills 1024 limit) because it enumerated ~15 verbatim trigger phrasings. Trim to a representative handful and fold Bonferroni/Benjamini-Hochberg into "multiple-testing correction"; the Canonical commands table in the body already carries the full keyword set. Now 921 chars. Synced across all three plugin variants. Co-Authored-By: Claude Opus 4.8 --- .../skills/manage-experiment/SKILL.md | 28 ++++++++----------- .../skills/manage-experiment/SKILL.md | 28 ++++++++----------- .../skills/manage-experiment/SKILL.md | 28 ++++++++----------- 3 files changed, 33 insertions(+), 51 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index 7bed770..cac3775 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -2,23 +2,17 @@ name: manage-experiment description: > Coach the user through any phase of a Mixpanel experiment — design (hypothesis - framing, metric selection, sizing, statistical model choice, advanced features - like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final - pre-launch readiness check and the irreversible launch action), monitor - (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- - Frequentist rule, terminate-early decisions), and interpret after the - experiment is mature (read results, decide ship / iterate / kill / wait, - interpret health checks like SRM and Retro A/A, break results down by segment, - use session replays to explain a result). Use when the user mentions - experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any - phrasing like "set up an experiment", "design an A/B test", "launch this - experiment", "is it safe to keep this running", "is my experiment SRM-ing", - "should I peek", "how did experiment X do", "should we ship", "why isn't this - significant yet", "should this be sequential or fixed-horizon", "what's my - MDE", "is this experiment configured correctly", "audit my experiment". Do - NOT use for plain feature-flag rollouts with no measurement criterion — that - belongs to the `manage-feature-flags` skill. + framing, metric selection, sizing, statistical-model choice, advanced features + like CUPED / Winsorization / multiple-testing correction), launch (pre-launch + readiness check and the irreversible launch), monitor (mid-flight safety: SRM, + sample pace, guardrail peeks, terminate-early calls), and interpret (read + results, decide ship / iterate / kill / wait, read health checks like SRM and + Retro A/A, break down by segment, use session replays). Use when the user + mentions an experiment or A/B test, a ship/kill decision, MDE, sample ratio + mismatch, CUPED, or statistical significance, or asks things like "set up an + experiment", "is my experiment SRM-ing", "should we ship", "what's my MDE", or + "audit my experiment". Do NOT use for plain feature-flag rollouts with no + measurement criterion — that belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index 7bed770..cac3775 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -2,23 +2,17 @@ name: manage-experiment description: > Coach the user through any phase of a Mixpanel experiment — design (hypothesis - framing, metric selection, sizing, statistical model choice, advanced features - like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final - pre-launch readiness check and the irreversible launch action), monitor - (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- - Frequentist rule, terminate-early decisions), and interpret after the - experiment is mature (read results, decide ship / iterate / kill / wait, - interpret health checks like SRM and Retro A/A, break results down by segment, - use session replays to explain a result). Use when the user mentions - experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any - phrasing like "set up an experiment", "design an A/B test", "launch this - experiment", "is it safe to keep this running", "is my experiment SRM-ing", - "should I peek", "how did experiment X do", "should we ship", "why isn't this - significant yet", "should this be sequential or fixed-horizon", "what's my - MDE", "is this experiment configured correctly", "audit my experiment". Do - NOT use for plain feature-flag rollouts with no measurement criterion — that - belongs to the `manage-feature-flags` skill. + framing, metric selection, sizing, statistical-model choice, advanced features + like CUPED / Winsorization / multiple-testing correction), launch (pre-launch + readiness check and the irreversible launch), monitor (mid-flight safety: SRM, + sample pace, guardrail peeks, terminate-early calls), and interpret (read + results, decide ship / iterate / kill / wait, read health checks like SRM and + Retro A/A, break down by segment, use session replays). Use when the user + mentions an experiment or A/B test, a ship/kill decision, MDE, sample ratio + mismatch, CUPED, or statistical significance, or asks things like "set up an + experiment", "is my experiment SRM-ing", "should we ship", "what's my MDE", or + "audit my experiment". Do NOT use for plain feature-flag rollouts with no + measurement criterion — that belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index 7bed770..cac3775 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -2,23 +2,17 @@ name: manage-experiment description: > Coach the user through any phase of a Mixpanel experiment — design (hypothesis - framing, metric selection, sizing, statistical model choice, advanced features - like CUPED / Winsorization / Bonferroni / Benjamini-Hochberg), launch (final - pre-launch readiness check and the irreversible launch action), monitor - (mid-flight safety: SRM, sample pace, guardrail-only peek, the don't-peek-on- - Frequentist rule, terminate-early decisions), and interpret after the - experiment is mature (read results, decide ship / iterate / kill / wait, - interpret health checks like SRM and Retro A/A, break results down by segment, - use session replays to explain a result). Use when the user mentions - experiment, A/B test, ship/kill decision, MDE, minimum detectable effect, - sample ratio mismatch, CUPED, sizing, statistical significance, lift, or any - phrasing like "set up an experiment", "design an A/B test", "launch this - experiment", "is it safe to keep this running", "is my experiment SRM-ing", - "should I peek", "how did experiment X do", "should we ship", "why isn't this - significant yet", "should this be sequential or fixed-horizon", "what's my - MDE", "is this experiment configured correctly", "audit my experiment". Do - NOT use for plain feature-flag rollouts with no measurement criterion — that - belongs to the `manage-feature-flags` skill. + framing, metric selection, sizing, statistical-model choice, advanced features + like CUPED / Winsorization / multiple-testing correction), launch (pre-launch + readiness check and the irreversible launch), monitor (mid-flight safety: SRM, + sample pace, guardrail peeks, terminate-early calls), and interpret (read + results, decide ship / iterate / kill / wait, read health checks like SRM and + Retro A/A, break down by segment, use session replays). Use when the user + mentions an experiment or A/B test, a ship/kill decision, MDE, sample ratio + mismatch, CUPED, or statistical significance, or asks things like "set up an + experiment", "is my experiment SRM-ing", "should we ship", "what's my MDE", or + "audit my experiment". Do NOT use for plain feature-flag rollouts with no + measurement criterion — that belongs to the `manage-feature-flags` skill. license: Apache-2.0 --- From e212dcc8028be09323e2989d1d0fa2b3d53156cd Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 16 Jun 2026 01:08:06 +0000 Subject: [PATCH 08/11] manage-experiment: address skill-review findings across SKILL + references Fixes from the review-skill rubric pass and the per-reference review: - DRY/portability: de-link all reference->reference chains so references stay one level deep; strip tool-docs leakage (significance enum, summary.* buckets, liftConfidence, decide constants) to intent-level. - Centralise the Winsorization >~20% push-back rule in the SKILL.md glossary; centralise the polarity recipe in the interpret command; both now referenced rather than restated. - Source defaults: qualify the Benjamini-Hochberg platform default, CUPED 30-70% range, and the `up` direction default with "verify". - Accuracy fixes in references: correct the family-wise FPR claim (3-arm -> 4-arm), drop in-file Winsorization duplication, fix the "constants"->"choices" dangling referent in lifecycle-handoff, soften absolutist hypothesis-framing phrasing, de-changelog the segment-breakdown platform-status note. Applied identically across the mixpanel-mcp, -eu, and -in variants. Co-Authored-By: Claude Opus 4.8 --- .../skills/manage-experiment/SKILL.md | 6 ++-- .../manage-experiment/commands/design.md | 4 +-- .../manage-experiment/commands/interpret.md | 20 +++++++------ .../references/advanced-features.md | 4 +-- .../references/health-check-interpretation.md | 2 +- .../references/hypothesis-framing.md | 2 +- .../references/lifecycle-handoff.md | 8 +++--- .../references/metric-selection.md | 8 +++--- .../references/per-metric-interpretation.md | 28 +++++++++---------- .../manage-experiment/references/pitfalls.md | 2 +- .../references/prior-experiments.md | 2 +- .../references/routing-xp-vs-ff.md | 2 +- .../segment-breakdown-interpretation.md | 12 +++----- .../segment-of-interest-selection.md | 4 +-- .../references/session-replay-analysis.md | 2 +- .../manage-experiment/references/sizing.md | 12 ++++---- .../references/statistical-model.md | 4 +-- .../references/why-no-statsig.md | 12 ++++---- .../skills/manage-experiment/SKILL.md | 6 ++-- .../manage-experiment/commands/design.md | 4 +-- .../manage-experiment/commands/interpret.md | 20 +++++++------ .../references/advanced-features.md | 4 +-- .../references/health-check-interpretation.md | 2 +- .../references/hypothesis-framing.md | 2 +- .../references/lifecycle-handoff.md | 8 +++--- .../references/metric-selection.md | 8 +++--- .../references/per-metric-interpretation.md | 28 +++++++++---------- .../manage-experiment/references/pitfalls.md | 2 +- .../references/prior-experiments.md | 2 +- .../references/routing-xp-vs-ff.md | 2 +- .../segment-breakdown-interpretation.md | 12 +++----- .../segment-of-interest-selection.md | 4 +-- .../references/session-replay-analysis.md | 2 +- .../manage-experiment/references/sizing.md | 12 ++++---- .../references/statistical-model.md | 4 +-- .../references/why-no-statsig.md | 12 ++++---- .../skills/manage-experiment/SKILL.md | 6 ++-- .../manage-experiment/commands/design.md | 4 +-- .../manage-experiment/commands/interpret.md | 20 +++++++------ .../references/advanced-features.md | 4 +-- .../references/health-check-interpretation.md | 2 +- .../references/hypothesis-framing.md | 2 +- .../references/lifecycle-handoff.md | 8 +++--- .../references/metric-selection.md | 8 +++--- .../references/per-metric-interpretation.md | 28 +++++++++---------- .../manage-experiment/references/pitfalls.md | 2 +- .../references/prior-experiments.md | 2 +- .../references/routing-xp-vs-ff.md | 2 +- .../segment-breakdown-interpretation.md | 12 +++----- .../segment-of-interest-selection.md | 4 +-- .../references/session-replay-analysis.md | 2 +- .../manage-experiment/references/sizing.md | 12 ++++---- .../references/statistical-model.md | 4 +-- .../references/why-no-statsig.md | 12 ++++---- 54 files changed, 198 insertions(+), 204 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index cac3775..ae87c0c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -76,9 +76,9 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index b052290..2d50ee0 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -77,7 +77,7 @@ Every primary and guardrail needs an explicit `direction`. Watch for the **laggi Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. -Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference, in cost order (cheapest first). Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). @@ -95,7 +95,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md index 5bbfebb..806729d 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md @@ -9,7 +9,7 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ## Glossary (interpret-specific) - **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. -- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **Significance.** The platform's per-row classification: significant-positive, significant-negative, or not-significant. Read it from the result — do not recompute. - **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. - **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. - **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. @@ -21,15 +21,17 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ### Polarity recipe (load-bearing — apply on every metric row) -The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. +This is the **canonical polarity recipe** for the skill — the interpret references point back here instead of restating it. -Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): +The platform's result buckets (positive / negative / no-effect) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. -- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. +Given a row's lift and the metric's direction ("up" = bigger is better, "down" = smaller is better; defaults to "up"): -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +- Lift missing or exactly zero → **neutral** (no measurement / no effect respectively). +- Direction "up" → **positive** if lift > 0, else **negative**. +- Direction "down" → **positive** if lift < 0, else **negative**. + +A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -47,7 +49,7 @@ Experiment-details has two parallel data paths — live (preferred) and cached. | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | -For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). +For multi-variant tests, the special success-without-a-single-variant choices (ship-without-a-variant, defer-the-decision), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). --- @@ -77,7 +79,7 @@ Apply the **polarity recipe** from Components to each non-control variant × pri #### 2c. Guardrail check -Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a positive-bucket row for a "down" guardrail is still a regression. #### 2d. Practical significance diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md index ffcf1d3..797eac6 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/advanced-features.md @@ -74,7 +74,7 @@ For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% ## Multiple testing correction — Bonferroni vs Benjamini-Hochberg -Covered in detail in [statistical-model.md](statistical-model.md). The short version: +Covered in detail in the statistical-model reference. The short version: - Enable when there are ≥2 primaries OR ≥2 non-control variants. - Default to Benjamini-Hochberg. More powerful with correlated primaries. @@ -106,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. +- ⛔ **Winsorization tail width above ~20%.** Almost always a misconfiguration — see Percentile guidance above. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md index 818b163..b5f82a7 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/health-check-interpretation.md @@ -96,7 +96,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. -If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to the why-no-statsig playbook (the interpret command links it) — different question. --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md index 40cf56e..8f3476e 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/hypothesis-framing.md @@ -20,7 +20,7 @@ All four properties of a good hypothesis — falsifiable, directional, mechanist | **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | | **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | | **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | -| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric can't measure the real effect (the metric isn't mature yet) and invites a noise-driven false read. | ## When the user gives you a one-liner diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md index 3a9e24c..bf548fa 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md @@ -13,15 +13,15 @@ Concluding an experiment is **irreversible**. Before invoking the decide action, A decide call expresses three things: 1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. -2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special choices below. 3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. ## Special variant choices for success When you have a winning result but no single variant to ship: -- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) -- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The decide action exposes a dedicated "ship without a variant" choice; the tool layer supplies the exact value.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The decide action exposes a "defer the variant decision" choice; the UI shows the experiment as deferred.) When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. @@ -30,7 +30,7 @@ When the verdict is KILL — no winner — record success as false. No variant k For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: - If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). -- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. +- If the team genuinely cannot pick, use the defer-the-decision choice above — better than fabricating a winner. A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md index 7088e3c..1c79f6d 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -20,11 +20,11 @@ If the user proposes a primary, sanity-check: - _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) - _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). - _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. -- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) +- _Does the metric have enough volume to detect the expected lift?_ (Volume drives the sizing math.) ## Guardrail metrics (0+, strongly recommended) -Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate** — the umbrella owns the threshold (rationale in the pitfall catalogue), and it's the most important single rule there. Standard guardrails by domain — pick at least one from the row that matches the change: @@ -49,7 +49,7 @@ Metrics for understanding **why** the primary moved, not for the ship decision. > **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: > _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ -This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). +This is the **Hypothesis ↔ metric mismatch** pitfall in the pre-launch pitfall catalogue. ## Sanity checklist diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md index 320cb8e..e88bc5d 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/per-metric-interpretation.md @@ -23,12 +23,12 @@ Translate a metric's lift, confidence interval, and p-value into a plain-languag ## The mental model -Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: +Each row (positive / negative / no-effect bucket) answers four questions: -1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). -2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). -3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. -4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. +1. **Did the lift go up or down?** — the bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the significance classification (or the bucket name itself: positive / negative buckets are significant, the no-effect bucket is not). +3. **Was the change in the goal direction?** — apply the polarity recipe with the metric's direction. +4. **Was the change big enough to matter?** — multiply lift by the control baseline value to get absolute impact, then judge against business context. A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. @@ -36,10 +36,10 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: +Apply the **canonical polarity recipe** (defined in the interpret command's Components): the bucket name is sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (Shared glossary in `SKILL.md`). Re-apply it on every row here — a positive-sign movement on a "down" metric is a regression, not a win. Examples worth remembering: -- A row in `summary.positive` with `direction: "down"` is a **regression**. -- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). +- A positive-bucket row on a "down" metric is a **regression**. +- A negative-bucket row on a "down" metric is a **win** (e.g. a -1% interstitials_shown lift means less interruption). --- @@ -47,7 +47,7 @@ Treat the bucket name (the positive / negative / no grouping) as sign-of-lift on Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. +The platform-specific trap worth flagging: the confidence figure shown on each result row is the **confidence level used** (e.g. 0.95), **not the CI width**. Easy to misread. For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. @@ -59,8 +59,8 @@ For the general meaning of a p-value (the probability under the null), trust the lift = (treatment_mean - control_mean) / control_mean ``` -- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. -- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the reported lift is correct. +- If a row's lift is missing, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." --- @@ -133,7 +133,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The tail-width setting defaults to 5 (5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A much higher tail width — capping more than ~20% of each side — is a misconfiguration; see the Misconfigurations notes in the health-check interpretation reference. --- @@ -153,7 +153,7 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see the why-no-statsig playbook. --- @@ -161,7 +161,7 @@ For the full walk-through on what to do about it (wait, extend, boost power, nar Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of the health-check interpretation reference. --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md index c20eb30..2d6b25a 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md index 476a26c..c5fbb42 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/prior-experiments.md @@ -53,7 +53,7 @@ Concretely, when you have a prior result that's relevant, the setup workflow cha | Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | | Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | | Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | -| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Step 4 — statistical model | Default to sequential / Benjamini-Hochberg (verify current) | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | | Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | ## When prior tests warn you away from testing at all diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md index 7e13471..1f685dd 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -13,7 +13,7 @@ Before any setup work, decide whether the user actually wants an **experiment** | Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | | Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | -The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. An experiment using the feature-flag collection method auto-creates a flag under the hood (verify current); not every feature flag use case needs an experiment. ## Disambiguation prompt diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md index 98c7bbc..c079df3 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion segment-of-interest selection reference covers how to pick the segments in the first place. --- @@ -18,11 +18,7 @@ Reading a segment breakdown well means recognizing which of those three you're l ## Per-segment polarity recipe — apply per row -The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. - -- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). -- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. -- Filter out the control row in each segment. +The **canonical polarity recipe** (interpret command Components) applies _inside_ each segment too — don't take a shortcut. For each segment × metric × non-control variant, translate the row's sign-of-lift into business polarity using the metric's direction, and filter out the control row in each segment. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. @@ -47,7 +43,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of the health-check interpretation reference — bucketing is usually the cause, not a real reversal. --- @@ -96,4 +92,4 @@ This is the everyday case of mixed effects. ## Platform support status -Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. When the experiment-details response doesn't return per-segment rows, fall back to running per-segment queries against the experiment's metrics and exposures, then interpret the resulting numbers with the rules above. If the user wants per-segment interpretation and segmented data isn't available, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md index d290517..dc4c6cd 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/segment-of-interest-selection.md @@ -2,7 +2,7 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. -The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +The companion segment-breakdown interpretation reference covers how to _read_ the per-segment results once you have them. ## Contents @@ -122,4 +122,4 @@ Pre-commitment is what separates "segmentation analysis" from "fishing." ## Then read the results -Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. +Once the segment breakdown is in hand, switch to the segment-breakdown interpretation reference (the interpret command links it). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md index 11d5158..b8be577 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per ### Variant-specific UI issues - **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. -- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to the health-check interpretation guidance. - **Treatment fired twice / persisted state across sessions** — implementation regression. --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md index 8a057d9..c79fc3c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/sizing.md @@ -29,13 +29,13 @@ Where: - `σ²` = variance of the metric (depends on metric type — see below). - `d` = MDE in the same units as the metric. -The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise z-score formula rather than the rounded constant. ## Variance by metric type - **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. - **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. -- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization cuts this. ## Worked example @@ -64,7 +64,7 @@ Underpowered experiments suffer from **winner's curse**: if you do reach signifi ## Achievable MDE for a running experiment (diagnosis form) -When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: +When diagnosing a live experiment that hasn't hit significance (the why-no-statsig playbook covers that path), you want the achievable MDE as a **relative** lift so it compares directly against the reported lift. Two unit traps make this wrong more often than not: - `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: @@ -105,7 +105,7 @@ Offer these in order of cost — cheap first. 1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. 2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. -3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. 4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. 5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. @@ -133,11 +133,11 @@ Use this for quick sanity-checking. Always confirm with a query against actual b For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. -Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method. ## Duration considerations - **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. - **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). -- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, use a date-based end condition and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. - **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md index 1d832e0..6ebf0bc 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/statistical-model.md @@ -85,7 +85,7 @@ Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. -The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. +The takeaway: by the time you're testing 5 primaries on a 4-arm experiment (3 non-control variants), more than half of the "wins" are noise. Two methods are available: @@ -107,4 +107,4 @@ When the user pushes you on the confidence level: - Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." - Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. -If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md index 4995f62..d30605c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) lives in the sizing reference — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. ## Contents @@ -23,7 +23,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to the health-check interpretation guidance — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -40,12 +40,12 @@ Before walking the reasons, fetch the experiment with its exposures and metric r - **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. - **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. - **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. -- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in the sizing reference). - **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. - **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. - **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). -For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. +For the closed-form power math (`n_required`, achievable `MDE_relative`), see the sizing reference — those numbers explain the not-significant verdict; they don't override it. --- @@ -101,7 +101,7 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). -**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in the per-metric interpretation reference. --- @@ -137,7 +137,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in the sizing reference. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index cac3775..ae87c0c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -76,9 +76,9 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index b052290..2d50ee0 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -77,7 +77,7 @@ Every primary and guardrail needs an explicit `direction`. Watch for the **laggi Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. -Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference, in cost order (cheapest first). Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). @@ -95,7 +95,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md index 5bbfebb..806729d 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md @@ -9,7 +9,7 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ## Glossary (interpret-specific) - **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. -- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **Significance.** The platform's per-row classification: significant-positive, significant-negative, or not-significant. Read it from the result — do not recompute. - **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. - **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. - **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. @@ -21,15 +21,17 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ### Polarity recipe (load-bearing — apply on every metric row) -The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. +This is the **canonical polarity recipe** for the skill — the interpret references point back here instead of restating it. -Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): +The platform's result buckets (positive / negative / no-effect) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. -- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. +Given a row's lift and the metric's direction ("up" = bigger is better, "down" = smaller is better; defaults to "up"): -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +- Lift missing or exactly zero → **neutral** (no measurement / no effect respectively). +- Direction "up" → **positive** if lift > 0, else **negative**. +- Direction "down" → **positive** if lift < 0, else **negative**. + +A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -47,7 +49,7 @@ Experiment-details has two parallel data paths — live (preferred) and cached. | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | -For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). +For multi-variant tests, the special success-without-a-single-variant choices (ship-without-a-variant, defer-the-decision), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). --- @@ -77,7 +79,7 @@ Apply the **polarity recipe** from Components to each non-control variant × pri #### 2c. Guardrail check -Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a positive-bucket row for a "down" guardrail is still a regression. #### 2d. Practical significance diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md index ffcf1d3..797eac6 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/advanced-features.md @@ -74,7 +74,7 @@ For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% ## Multiple testing correction — Bonferroni vs Benjamini-Hochberg -Covered in detail in [statistical-model.md](statistical-model.md). The short version: +Covered in detail in the statistical-model reference. The short version: - Enable when there are ≥2 primaries OR ≥2 non-control variants. - Default to Benjamini-Hochberg. More powerful with correlated primaries. @@ -106,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. +- ⛔ **Winsorization tail width above ~20%.** Almost always a misconfiguration — see Percentile guidance above. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md index 818b163..b5f82a7 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/health-check-interpretation.md @@ -96,7 +96,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. -If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to the why-no-statsig playbook (the interpret command links it) — different question. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md index 40cf56e..8f3476e 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/hypothesis-framing.md @@ -20,7 +20,7 @@ All four properties of a good hypothesis — falsifiable, directional, mechanist | **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | | **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | | **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | -| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric can't measure the real effect (the metric isn't mature yet) and invites a noise-driven false read. | ## When the user gives you a one-liner diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md index 3a9e24c..bf548fa 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md @@ -13,15 +13,15 @@ Concluding an experiment is **irreversible**. Before invoking the decide action, A decide call expresses three things: 1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. -2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special choices below. 3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. ## Special variant choices for success When you have a winning result but no single variant to ship: -- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) -- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The decide action exposes a dedicated "ship without a variant" choice; the tool layer supplies the exact value.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The decide action exposes a "defer the variant decision" choice; the UI shows the experiment as deferred.) When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. @@ -30,7 +30,7 @@ When the verdict is KILL — no winner — record success as false. No variant k For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: - If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). -- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. +- If the team genuinely cannot pick, use the defer-the-decision choice above — better than fabricating a winner. A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md index 7088e3c..1c79f6d 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -20,11 +20,11 @@ If the user proposes a primary, sanity-check: - _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) - _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). - _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. -- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) +- _Does the metric have enough volume to detect the expected lift?_ (Volume drives the sizing math.) ## Guardrail metrics (0+, strongly recommended) -Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate** — the umbrella owns the threshold (rationale in the pitfall catalogue), and it's the most important single rule there. Standard guardrails by domain — pick at least one from the row that matches the change: @@ -49,7 +49,7 @@ Metrics for understanding **why** the primary moved, not for the ship decision. > **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: > _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ -This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). +This is the **Hypothesis ↔ metric mismatch** pitfall in the pre-launch pitfall catalogue. ## Sanity checklist diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md index 320cb8e..e88bc5d 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/per-metric-interpretation.md @@ -23,12 +23,12 @@ Translate a metric's lift, confidence interval, and p-value into a plain-languag ## The mental model -Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: +Each row (positive / negative / no-effect bucket) answers four questions: -1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). -2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). -3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. -4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. +1. **Did the lift go up or down?** — the bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the significance classification (or the bucket name itself: positive / negative buckets are significant, the no-effect bucket is not). +3. **Was the change in the goal direction?** — apply the polarity recipe with the metric's direction. +4. **Was the change big enough to matter?** — multiply lift by the control baseline value to get absolute impact, then judge against business context. A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. @@ -36,10 +36,10 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: +Apply the **canonical polarity recipe** (defined in the interpret command's Components): the bucket name is sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (Shared glossary in `SKILL.md`). Re-apply it on every row here — a positive-sign movement on a "down" metric is a regression, not a win. Examples worth remembering: -- A row in `summary.positive` with `direction: "down"` is a **regression**. -- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). +- A positive-bucket row on a "down" metric is a **regression**. +- A negative-bucket row on a "down" metric is a **win** (e.g. a -1% interstitials_shown lift means less interruption). --- @@ -47,7 +47,7 @@ Treat the bucket name (the positive / negative / no grouping) as sign-of-lift on Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. +The platform-specific trap worth flagging: the confidence figure shown on each result row is the **confidence level used** (e.g. 0.95), **not the CI width**. Easy to misread. For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. @@ -59,8 +59,8 @@ For the general meaning of a p-value (the probability under the null), trust the lift = (treatment_mean - control_mean) / control_mean ``` -- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. -- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the reported lift is correct. +- If a row's lift is missing, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." --- @@ -133,7 +133,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The tail-width setting defaults to 5 (5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A much higher tail width — capping more than ~20% of each side — is a misconfiguration; see the Misconfigurations notes in the health-check interpretation reference. --- @@ -153,7 +153,7 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see the why-no-statsig playbook. --- @@ -161,7 +161,7 @@ For the full walk-through on what to do about it (wait, extend, boost power, nar Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of the health-check interpretation reference. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md index c20eb30..2d6b25a 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md index 476a26c..c5fbb42 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/prior-experiments.md @@ -53,7 +53,7 @@ Concretely, when you have a prior result that's relevant, the setup workflow cha | Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | | Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | | Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | -| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Step 4 — statistical model | Default to sequential / Benjamini-Hochberg (verify current) | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | | Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | ## When prior tests warn you away from testing at all diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md index 7e13471..1f685dd 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -13,7 +13,7 @@ Before any setup work, decide whether the user actually wants an **experiment** | Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | | Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | -The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. An experiment using the feature-flag collection method auto-creates a flag under the hood (verify current); not every feature flag use case needs an experiment. ## Disambiguation prompt diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md index 98c7bbc..c079df3 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion segment-of-interest selection reference covers how to pick the segments in the first place. --- @@ -18,11 +18,7 @@ Reading a segment breakdown well means recognizing which of those three you're l ## Per-segment polarity recipe — apply per row -The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. - -- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). -- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. -- Filter out the control row in each segment. +The **canonical polarity recipe** (interpret command Components) applies _inside_ each segment too — don't take a shortcut. For each segment × metric × non-control variant, translate the row's sign-of-lift into business polarity using the metric's direction, and filter out the control row in each segment. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. @@ -47,7 +43,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of the health-check interpretation reference — bucketing is usually the cause, not a real reversal. --- @@ -96,4 +92,4 @@ This is the everyday case of mixed effects. ## Platform support status -Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. When the experiment-details response doesn't return per-segment rows, fall back to running per-segment queries against the experiment's metrics and exposures, then interpret the resulting numbers with the rules above. If the user wants per-segment interpretation and segmented data isn't available, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md index d290517..dc4c6cd 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/segment-of-interest-selection.md @@ -2,7 +2,7 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. -The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +The companion segment-breakdown interpretation reference covers how to _read_ the per-segment results once you have them. ## Contents @@ -122,4 +122,4 @@ Pre-commitment is what separates "segmentation analysis" from "fishing." ## Then read the results -Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. +Once the segment breakdown is in hand, switch to the segment-breakdown interpretation reference (the interpret command links it). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md index 11d5158..b8be577 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per ### Variant-specific UI issues - **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. -- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to the health-check interpretation guidance. - **Treatment fired twice / persisted state across sessions** — implementation regression. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md index 8a057d9..c79fc3c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/sizing.md @@ -29,13 +29,13 @@ Where: - `σ²` = variance of the metric (depends on metric type — see below). - `d` = MDE in the same units as the metric. -The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise z-score formula rather than the rounded constant. ## Variance by metric type - **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. - **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. -- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization cuts this. ## Worked example @@ -64,7 +64,7 @@ Underpowered experiments suffer from **winner's curse**: if you do reach signifi ## Achievable MDE for a running experiment (diagnosis form) -When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: +When diagnosing a live experiment that hasn't hit significance (the why-no-statsig playbook covers that path), you want the achievable MDE as a **relative** lift so it compares directly against the reported lift. Two unit traps make this wrong more often than not: - `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: @@ -105,7 +105,7 @@ Offer these in order of cost — cheap first. 1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. 2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. -3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. 4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. 5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. @@ -133,11 +133,11 @@ Use this for quick sanity-checking. Always confirm with a query against actual b For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. -Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method. ## Duration considerations - **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. - **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). -- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, use a date-based end condition and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. - **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md index 1d832e0..6ebf0bc 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/statistical-model.md @@ -85,7 +85,7 @@ Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. -The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. +The takeaway: by the time you're testing 5 primaries on a 4-arm experiment (3 non-control variants), more than half of the "wins" are noise. Two methods are available: @@ -107,4 +107,4 @@ When the user pushes you on the confidence level: - Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." - Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. -If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md index 4995f62..d30605c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) lives in the sizing reference — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. ## Contents @@ -23,7 +23,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to the health-check interpretation guidance — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -40,12 +40,12 @@ Before walking the reasons, fetch the experiment with its exposures and metric r - **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. - **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. - **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. -- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in the sizing reference). - **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. - **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. - **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). -For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. +For the closed-form power math (`n_required`, achievable `MDE_relative`), see the sizing reference — those numbers explain the not-significant verdict; they don't override it. --- @@ -101,7 +101,7 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). -**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in the per-metric interpretation reference. --- @@ -137,7 +137,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in the sizing reference. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index cac3775..ae87c0c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -76,9 +76,9 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg; Bonferroni for strict family-wise control. +- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. +- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. ## Reference files diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index b052290..2d50ee0 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -77,7 +77,7 @@ Every primary and guardrail needs an explicit `direction`. Watch for the **laggi Pull baseline rate, variance, and daily traffic from Mixpanel. Don't guess. -Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference (accept a larger MDE → increase allocation → enable CUPED → pick a higher-volume primary → don't run). +Use the formulas in **Components**. Then compare the required sample to what the available traffic delivers inside an acceptable window (typically 2–4 weeks). If the achievable MDE exceeds the user's expected lift, the experiment is **underpowered** — surface immediately. Don't wave it through; offer the remediations from the sizing reference, in cost order (cheapest first). Sample-size floor: keep per-variant target above the platform's reliability floor (verify in product — historically ~350–400). Below the floor, the central limit theorem breaks down and the SRM check gets noisy. Full worked examples, baseline-by-rate lookup table, and the duration / seasonality rules in [../references/sizing.md](../references/sizing.md). @@ -95,7 +95,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The `percentile` field is the tail width to cap (default `5` = 5% tails); push back if the user sets a percentile above ~20 — more than 20% of values capped on each side throws away too much signal. +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md index 5bbfebb..806729d 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md @@ -9,7 +9,7 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ## Glossary (interpret-specific) - **Polarity.** Whether a movement is _good for the business_. Combines sign of lift with the metric's `direction` ("up" = bigger is better; "down" = smaller is better). See the **Polarity recipe** in Components. -- **Significance.** The platform's per-row classification: `YES_POSITIVE`, `YES_NEGATIVE`, or `NO`. Read from the response — do not recompute. +- **Significance.** The platform's per-row classification: significant-positive, significant-negative, or not-significant. Read it from the result — do not recompute. - **SRM (Sample Ratio Mismatch).** Variants received traffic in proportions that disagree with the configured split. **Kohavi's #1 trustworthiness check** — when SRM fails, downstream lift, p-values, and CIs cannot be trusted. - **Retro A/A (pre-experiment bias).** Re-runs the comparison on the pre-exposure period. A failure means cohorts already differed before treatment started. - **Twyman's Law.** "Any unusually clean or unusually large result is more likely a bug than a discovery." Apply on lifts > ~30% — usually a changed-denominator artifact. @@ -21,15 +21,17 @@ The umbrella `SKILL.md` defines the shared glossary (Variant, Primary/Guardrail/ ### Polarity recipe (load-bearing — apply on every metric row) -The platform's summary buckets (`positive` / `negative` / `no`) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. +This is the **canonical polarity recipe** for the skill — the interpret references point back here instead of restating it. -Given `lift` and the metric's `direction` ("up" or "down", defaults to "up"): +The platform's result buckets (positive / negative / no-effect) classify by **sign of lift**, NOT by business value. Translate each row through the recipe before drawing any conclusion. -- `lift is None` or `lift == 0` → **neutral** (no measurement / no effect respectively). -- `direction == "up"` → **positive** if `lift > 0`, else **negative**. -- `direction == "down"` → **positive** if `lift < 0`, else **negative**. +Given a row's lift and the metric's direction ("up" = bigger is better, "down" = smaller is better; defaults to "up"): -A row in `summary.positive` with `direction: "down"` is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +- Lift missing or exactly zero → **neutral** (no measurement / no effect respectively). +- Direction "up" → **positive** if lift > 0, else **negative**. +- Direction "down" → **positive** if lift < 0, else **negative**. + +A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. @@ -47,7 +49,7 @@ Experiment-details has two parallel data paths — live (preferred) and cached. | Trust ✓, target sample/duration not yet reached | **WAIT** (or extend, or restart with more power — see [../references/why-no-statsig.md](../references/why-no-statsig.md)). | | Trust ✗ | **DO NOT DECIDE.** Report the failure and recommend remediation from [../references/health-check-interpretation.md](../references/health-check-interpretation.md). | -For multi-variant tests, special variant constants (`__no_variant_shipped__`, `__defer_variant_decision__`), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). +For multi-variant tests, the special success-without-a-single-variant choices (ship-without-a-variant, defer-the-decision), and the exact decide-call shape, see [../references/lifecycle-handoff.md](../references/lifecycle-handoff.md). --- @@ -77,7 +79,7 @@ Apply the **polarity recipe** from Components to each non-control variant × pri #### 2c. Guardrail check -Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a row in `summary.positive` for a `direction: "down"` guardrail is still a regression. +Any guardrail significant in the wrong polarity? A guardrail regression → **ITERATE**, not ship. Guardrail polarity uses the same recipe — a positive-bucket row for a "down" guardrail is still a regression. #### 2d. Practical significance diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md index ffcf1d3..797eac6 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/advanced-features.md @@ -74,7 +74,7 @@ For very heavy tails (extreme whale distributions), `percentile=1` (cap each 1% ## Multiple testing correction — Bonferroni vs Benjamini-Hochberg -Covered in detail in [statistical-model.md](statistical-model.md). The short version: +Covered in detail in the statistical-model reference. The short version: - Enable when there are ≥2 primaries OR ≥2 non-control variants. - Default to Benjamini-Hochberg. More powerful with correlated primaries. @@ -106,6 +106,6 @@ Primary count ≥ 2 OR non-control variants ≥ 2? - ⛔ **CUPED on a new-user-only experiment.** No pre-exposure data; the feature does nothing. Worse, the user thinks they're being protected and ships an underpowered test. - ⛔ **Winsorization on a conversion metric.** Capping 0/1 values is meaningless. The setting either no-ops or, if a buggy implementation interprets it literally, makes the metric worse. -- ⛔ **Winsorization at a `percentile` above ~20.** Caps more than 20% of each tail — throws away too much signal. Almost always a misconfiguration. Confirm intent. +- ⛔ **Winsorization tail width above ~20%.** Almost always a misconfiguration — see Percentile guidance above. Confirm intent. - ⛔ **Multiple testing correction OFF on a 5-primary test.** Family-wise FPR balloons to ~22.6%. One in five "wins" is noise. - ⛔ **CUPED enabled "to be safe" on a metric where pre-exposure doesn't predict post-exposure.** Best case: no effect. Common case: the variance estimate gets noisier because the regression adjustment is fitting to noise. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md index 818b163..b5f82a7 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/health-check-interpretation.md @@ -96,7 +96,7 @@ The same statistical comparison run on the **pre-exposure** period revealed that 4. If the experiment is still ACTIVE: extend duration via an experiment update with a new end target. 5. If the experiment concluded too early: relaunch with longer planned duration. The setup-side skill covers the power-analysis math. -If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to [why-no-statsig.md](why-no-statsig.md) — different question. +If the user wants to talk about _why_ a primary metric is still inconclusive even when exposures look adequate, route to the why-no-statsig playbook (the interpret command links it) — different question. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md index 40cf56e..8f3476e 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/hypothesis-framing.md @@ -20,7 +20,7 @@ All four properties of a good hypothesis — falsifiable, directional, mechanist | **Falsifiable** | Could the data say "no"? | "Improving UX" can't be falsified. "Increasing weekly retention by ≥2pp" can. | | **Directional** | Is the predicted change up or down? | "Affecting cart size" leaves the polarity ambiguous; the system defaults to `direction: "up"` and the interpretation step misreads regressions as wins. | | **Mechanistic** | What's the proposed causal chain? | "Because users will see X and decide Y" is a mechanism. "We think it'll work" is not. Without a mechanism, the team can't tell when the metric they picked is actually downstream of the change. | -| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric guarantees an inconclusive result on the real effect plus a high chance of reaching false significance from noise. | +| **Bounded in time** | Does the predicted effect occur within a measurable window? | Day-30 LTV claims need a ≥30-day experiment. A 2-week test on a 30-day metric can't measure the real effect (the metric isn't mature yet) and invites a noise-driven false read. | ## When the user gives you a one-liner diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md index 3a9e24c..bf548fa 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md @@ -13,15 +13,15 @@ Concluding an experiment is **irreversible**. Before invoking the decide action, A decide call expresses three things: 1. **Did the experiment succeed?** A win for one of the treatments, or a deliberate stop. -2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special constants below. +2. **Which variant ships?** Required when success is true. Either a real variant key, or one of the two special choices below. 3. **Why?** A rationale message — what metrics were evaluated, the polarity reading, the tradeoffs accepted. The platform requires this on every decide call; treat it as a one-paragraph decision record, not a placeholder. ## Special variant choices for success When you have a winning result but no single variant to ship: -- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The platform exposes this as the constant `__no_variant_shipped__`.) -- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The platform exposes this as `__defer_variant_decision__` and shows `SUCCESS_DEFERRED` in the UI.) +- **Ship the change without picking a variant.** Use when the experiment validated a direction but the team will ship outside the experiment's variant set. (The decide action exposes a dedicated "ship without a variant" choice; the tool layer supplies the exact value.) +- **Defer the variant decision.** Use when you want to lock in the success verdict but the variant choice needs more discussion. (The decide action exposes a "defer the variant decision" choice; the UI shows the experiment as deferred.) When the verdict is KILL — no winner — record success as false. No variant key is needed in that case. @@ -30,7 +30,7 @@ When the verdict is KILL — no winner — record success as false. No variant k For a 3+ arm test, the decide action still names a single winning variant. If two treatments are roughly tied: - If both clear the practical-significance bar and shipping either is acceptable, pick on simplicity (smaller diff from control, lower implementation cost). -- If the team genuinely cannot pick, use the defer constant above — better than fabricating a winner. +- If the team genuinely cannot pick, use the defer-the-decision choice above — better than fabricating a winner. A multi-variant test where only one treatment is significantly different from control is a clean SHIP for that variant; the inconclusive arms are simply not the winner. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md index 7088e3c..1c79f6d 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up`, which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -20,11 +20,11 @@ If the user proposes a primary, sanity-check: - _Is this metric downstream of the change?_ (A pricing change cannot move "tutorial completion".) - _Does the metric exist for both control and treatment users?_ If the change creates new events that don't exist in control, lift is artificially infinite (changed-denominator). - _Is the metric's response window shorter than the experiment's duration?_ If not, the metric is lagging — pick a leading proxy. -- _Does the metric have enough volume to detect the expected lift?_ (Cross-reference `references/sizing.md`.) +- _Does the metric have enough volume to detect the expected lift?_ (Volume drives the sizing math.) ## Guardrail metrics (0+, strongly recommended) -Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate**, and it's the most important single rule in the pitfall catalogue. +Metrics that **must not regress**, even if primaries win. The trustworthiness backstop on a ship decision: a 5% relative regression on any guardrail blocks ship even if the primary wins. This is the **>5% guardrail hard-gate** — the umbrella owns the threshold (rationale in the pitfall catalogue), and it's the most important single rule there. Standard guardrails by domain — pick at least one from the row that matches the change: @@ -49,7 +49,7 @@ Metrics for understanding **why** the primary moved, not for the ship decision. > **Setup misconfiguration to flag.** If the user's hypothesis text names a metric that they then classify as secondary, ask: > _"You mentioned `` in your hypothesis. Should this be a primary metric? Secondary metrics don't influence ship/no-ship decisions, so if it matters for the outcome, promote it."_ -This is the **Hypothesis ↔ metric mismatch** pitfall — see [pitfalls.md](pitfalls.md). +This is the **Hypothesis ↔ metric mismatch** pitfall in the pre-launch pitfall catalogue. ## Sanity checklist diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md index 320cb8e..e88bc5d 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/per-metric-interpretation.md @@ -23,12 +23,12 @@ Translate a metric's lift, confidence interval, and p-value into a plain-languag ## The mental model -Each row in `summary.positive` / `summary.negative` / `summary.no` answers four questions: +Each row (positive / negative / no-effect bucket) answers four questions: -1. **Did the lift go up or down?** — the `summary` bucket name (sign-of-lift, not polarity). -2. **Was the change distinguishable from noise?** — the `significance` field (or the bucket name itself: rows in `summary.positive` / `summary.negative` are significant, rows in `summary.no` are not). -3. **Was the change in the goal direction?** — apply the polarity recipe with `metric.direction`. -4. **Was the change big enough to matter?** — multiply `lift` by the control baseline `value` to get absolute impact, then judge against business context. +1. **Did the lift go up or down?** — the bucket name (sign-of-lift, not polarity). +2. **Was the change distinguishable from noise?** — the significance classification (or the bucket name itself: positive / negative buckets are significant, the no-effect bucket is not). +3. **Was the change in the goal direction?** — apply the polarity recipe with the metric's direction. +4. **Was the change big enough to matter?** — multiply lift by the control baseline value to get absolute impact, then judge against business context. A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any one of those and you're shipping the wrong thing. @@ -36,10 +36,10 @@ A "win" requires **yes to (2)** AND **yes to (3)** AND **yes to (4)**. Skip any ## Polarity recipe -Treat the bucket name (the positive / negative / no grouping) as sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (defined in the Shared glossary in `SKILL.md`). A positive-sign movement on a `down`-direction metric is a regression, not a win. Examples worth remembering: +Apply the **canonical polarity recipe** (defined in the interpret command's Components): the bucket name is sign-of-lift only; the business verdict comes from combining that sign with the metric's **Direction** (Shared glossary in `SKILL.md`). Re-apply it on every row here — a positive-sign movement on a "down" metric is a regression, not a win. Examples worth remembering: -- A row in `summary.positive` with `direction: "down"` is a **regression**. -- A row in `summary.negative` with `direction: "down"` is a **win** (e.g. a `-1% interstitials_shown` lift means less interruption). +- A positive-bucket row on a "down" metric is a **regression**. +- A negative-bucket row on a "down" metric is a **win** (e.g. a -1% interstitials_shown lift means less interruption). --- @@ -47,7 +47,7 @@ Treat the bucket name (the positive / negative / no grouping) as sign-of-lift on Mixpanel runs a frequentist comparison at the experiment's configured confidence level — typically 0.95 (verify in product if results look off). If it differs from 0.95, call it out (`0.9` inflates false positives; `0.99` is conservative). -The platform-specific trap worth flagging: `liftConfidence` on a result row is the **confidence level used** (e.g. `0.95`), **not the CI width**. Easy to misread. +The platform-specific trap worth flagging: the confidence figure shown on each result row is the **confidence level used** (e.g. 0.95), **not the CI width**. Easy to misread. For the general meaning of a p-value (the probability under the null), trust the model's baseline knowledge — don't invent thresholds in either direction. @@ -59,8 +59,8 @@ For the general meaning of a p-value (the probability under the null), trust the lift = (treatment_mean - control_mean) / control_mean ``` -- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the `lift` field is correct. -- If `lift is None` in a row, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." +- **Total / sum metrics use exposure rebalancing.** If treatment has more exposed users than control, the raw sum will mechanically be higher. The platform computes lift per-exposure already; **don't manually divide raw totals when explaining results** — the reported lift is correct. +- If a row's lift is missing, **the calculation failed for that variant.** Surface the failure; do not interpret as "no effect." --- @@ -133,7 +133,7 @@ Different metric types behave differently; cite the relevant nuance in your verd ## Variance-reduction & outlier settings that change interpretation - **CUPED enabled**: mean is unchanged; variance reduced 30–70%; CIs narrower; power higher. Note: CUPED requires users to exist before the experiment — new-user-only experiments cannot use CUPED; if it's enabled there, it had no effect (mention as informational, not as a misconfiguration to fix). -- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The `percentile` field is the tail width (default `5` = 5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A `percentile` much higher than the default — capping more than ~20% of each tail — is a misconfiguration; see the **Misconfigurations** section in [health-check-interpretation.md](health-check-interpretation.md). +- **Winsorization enabled**: extreme values capped at both tails, pooled across variants. The tail-width setting defaults to 5 (5% tails). Lifts reflect typical-user behavior, not whale behavior. Bernoulli (conversion) metrics ignore Winsorization. A much higher tail width — capping more than ~20% of each side — is a misconfiguration; see the Misconfigurations notes in the health-check interpretation reference. --- @@ -153,7 +153,7 @@ If multiple-testing correction is off AND there are 2+ primaries × 1+ non-contr A "not significant" verdict means the experiment didn't have enough signal to distinguish the effect from noise at the chosen confidence level — **not that there is no effect.** Important when the user is about to call something a null result. -For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see [why-no-statsig.md](why-no-statsig.md). +For the full walk-through on what to do about it (wait, extend, boost power, narrow, accept null), see the why-no-statsig playbook. --- @@ -161,7 +161,7 @@ For the full walk-through on what to do about it (wait, extend, boost power, nar Concluding a Frequentist experiment before it reaches its configured target is a peeking event — per-metric significance verdicts become unreliable. Sequential experiments are designed for continuous monitoring and don't have this problem. -For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of [health-check-interpretation.md](health-check-interpretation.md). +For the full diagnosis when peeking is suspected, see the **Frequentist peeking** section of the health-check interpretation reference. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md index c20eb30..2d6b25a 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default `percentile=5` (cap each 5% tail). Push back if the user sets a `percentile` above ~20 — more than 20% of values capped on each side is almost always a misconfiguration. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md index 476a26c..c5fbb42 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/prior-experiments.md @@ -53,7 +53,7 @@ Concretely, when you have a prior result that's relevant, the setup workflow cha | Step 1 — hypothesis | Coach from scratch | Anchor on the prior's hypothesis; ask what's different | | Step 2 — metric selection | Suggest standard primaries/guardrails | Use the prior's metric set as the default; modify only with reason | | Step 3 — sizing | Query baseline + variance over the prior window | Use the prior's observed baseline and variance | -| Step 4 — statistical model | Default to sequential / benjamini-hochberg | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | +| Step 4 — statistical model | Default to sequential / Benjamini-Hochberg (verify current) | If the prior used a specific model and the team is comparing across tests, keep the same model for comparability | | Pitfall check | Run the standard catalogue | Cross-reference: did the prior have an SRM problem? A guardrail regression that should be set up as primary this time? | ## When prior tests warn you away from testing at all diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md index 7e13471..1f685dd 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/routing-xp-vs-ff.md @@ -13,7 +13,7 @@ Before any setup work, decide whether the user actually wants an **experiment** | Targeted access — "give beta access to these 50 design partners." | **Feature flag** (FF). | | Both — "ship to 10%, but also tell me if it moves checkout conversion." | **Experiment** with a phased rollout, or **FF + a separate experiment** later. | -The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. Every experiment uses a feature flag under the hood (Mixpanel auto-creates one when an experiment is created); not every feature flag use case needs an experiment. +The clean way to think about it: a feature flag is a **delivery mechanism**. An experiment is a **decision mechanism** built on top of one. An experiment using the feature-flag collection method auto-creates a flag under the hood (verify current); not every feature flag use case needs an experiment. ## Disambiguation prompt diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md index 98c7bbc..c079df3 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-breakdown-interpretation.md @@ -1,6 +1,6 @@ # Segment-Breakdown Interpretation -Read per-segment results once you have them. The companion reference [segment-of-interest-selection.md](segment-of-interest-selection.md) covers how to pick the segments in the first place. +Read per-segment results once you have them. The companion segment-of-interest selection reference covers how to pick the segments in the first place. --- @@ -18,11 +18,7 @@ Reading a segment breakdown well means recognizing which of those three you're l ## Per-segment polarity recipe — apply per row -The same recipe from the per-metric reference applies _inside_ each segment. Don't take a shortcut. - -- For each segment × metric × non-control variant, look at the row's `lift` and bucket (positive/negative/no). -- Translate sign-of-lift into business polarity using `metric.direction`. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. -- Filter out the control row in each segment. +The **canonical polarity recipe** (interpret command Components) applies _inside_ each segment too — don't take a shortcut. For each segment × metric × non-control variant, translate the row's sign-of-lift into business polarity using the metric's direction, and filter out the control row in each segment. **The bucket name is sign-of-lift, never the business verdict** — same trap as the overall summary. Surprisingly easy to forget when you're scanning a wide table — re-apply polarity per row. @@ -47,7 +43,7 @@ Each segment value needs its own meaningful per-variant sample for the per-segme | Every segment shows treatment winning, but the overall metric shows control winning (or vice versa) | **Simpson's paradox.** The variant mix differs across segments. Run per-segment SRM checks — this often signals a bucketing bug rather than a real effect. | | Two opposite-direction effects in different segments that roughly cancel overall | **Mixed effects.** The headline says "no effect" but real winners and losers are hiding. The product question is whether the gains outweigh the losses. | -When you spot Simpson's paradox, route the user to the **SRM** section of [health-check-interpretation.md](health-check-interpretation.md) — bucketing is usually the cause, not a real reversal. +When you spot Simpson's paradox, route the user to the **SRM** section of the health-check interpretation reference — bucketing is usually the cause, not a real reversal. --- @@ -96,4 +92,4 @@ This is the everyday case of mixed effects. ## Platform support status -Reading segment-level experiment results depends on the platform exposing per-segment metric rows. While that's still in progress, this skill may need to fall back to running per-segment queries against the experiment's metrics and exposures, then interpreting the resulting numbers with the rules above. If the experiment-details response doesn't return segmented data and the user wants per-segment interpretation, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. +Reading segment-level experiment results depends on the platform exposing per-segment metric rows. When the experiment-details response doesn't return per-segment rows, fall back to running per-segment queries against the experiment's metrics and exposures, then interpret the resulting numbers with the rules above. If the user wants per-segment interpretation and segmented data isn't available, say so explicitly and offer the per-segment query fallback — do not invent per-segment significance verdicts. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md index d290517..dc4c6cd 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/segment-of-interest-selection.md @@ -2,7 +2,7 @@ Pick 3–5 segments **likely to reveal a real effect difference** before slicing every available dimension and ending up p-hacking. -The companion reference [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md) covers how to _read_ the per-segment results once you have them. +The companion segment-breakdown interpretation reference covers how to _read_ the per-segment results once you have them. ## Contents @@ -122,4 +122,4 @@ Pre-commitment is what separates "segmentation analysis" from "fishing." ## Then read the results -Once the segment breakdown is in hand, switch to [segment-breakdown-interpretation.md](segment-breakdown-interpretation.md). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. +Once the segment breakdown is in hand, switch to the segment-breakdown interpretation reference (the interpret command links it). The reading rules (Simpson's paradox, per-segment polarity, sample-size floor per segment) live there. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md index 11d5158..b8be577 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/session-replay-analysis.md @@ -83,7 +83,7 @@ If treatment users _arrive_ at a screen more often but _complete_ at a lower per ### Variant-specific UI issues - **Treatment showed the wrong copy / wrong asset** — surprisingly common; treatment shipped, but to a subset of routes only. -- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to [health-check-interpretation.md](health-check-interpretation.md). +- **Treatment didn't render at all** — users in the treatment cohort saw the control UI (exposure-tracking bug; bucketing bug). If you see this, route back to the health-check interpretation guidance. - **Treatment fired twice / persisted state across sessions** — implementation regression. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md index 8a057d9..c79fc3c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/sizing.md @@ -29,13 +29,13 @@ Where: - `σ²` = variance of the metric (depends on metric type — see below). - `d` = MDE in the same units as the metric. -The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise formula in `references/statistical-model.md`. +The `16` is `(z_{α/2} + z_{β})² × 2` rounded to a workable constant — `(1.96 + 0.84)² × 2 = 15.68 ≈ 16`. Good enough for setup-phase reasoning; for ship-decision rigour use the precise z-score formula rather than the rounded constant. ## Variance by metric type - **Bernoulli (conversion rate).** `σ² = p(1−p)` where `p` is the baseline conversion rate. Variance peaks at `p = 0.5` (variance 0.25) and shrinks toward 0 at `p = 0` or `p = 1`. Lifts are easier to detect on rates near 50%, harder near the extremes. - **Poisson (event counts per user).** `σ² ≈ mean count per user`. High-count metrics need proportionally more sample. -- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization (`references/advanced-features.md`) cuts this. +- **Gaussian (revenue, time-on-page, etc.).** Compute `σ²` from historical data directly. Long-tailed distributions have high variance — Winsorization cuts this. ## Worked example @@ -64,7 +64,7 @@ Underpowered experiments suffer from **winner's curse**: if you do reach signifi ## Achievable MDE for a running experiment (diagnosis form) -When diagnosing a live experiment that hasn't hit significance (see [why-no-statsig.md](why-no-statsig.md)), you want the achievable MDE as a **relative** lift so it compares directly against the reported `lift`. Two unit traps make this wrong more often than not: +When diagnosing a live experiment that hasn't hit significance (the why-no-statsig playbook covers that path), you want the achievable MDE as a **relative** lift so it compares directly against the reported lift. Two unit traps make this wrong more often than not: - `MDE = 4σ / √n` above is **absolute** (metric units). Divide by the baseline to get a relative fraction: @@ -105,7 +105,7 @@ Offer these in order of cost — cheap first. 1. **Accept a larger MDE.** Only commit to ship if the effect is bigger. This costs nothing but redraws the success criterion; confirm the user is OK with shipping only on a larger lift. 2. **Increase traffic allocation to the experiment.** If other tests don't need the traffic, give this one more. -3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. See `references/advanced-features.md`. +3. **Use CUPED to reduce variance** (if pre-exposure data is available). 30–70% variance reduction translates directly into 30–70% smaller required sample. 4. **Pick a higher-volume primary metric** (if the hypothesis allows). Often there's a leading proxy with more volume than the lagging metric the team originally chose. 5. **Don't run the experiment.** Invest the engineering elsewhere. Sometimes the right answer. @@ -133,11 +133,11 @@ Use this for quick sanity-checking. Always confirm with a query against actual b For a multi-arm test (N non-control variants), the per-variant target grows with the number of pairwise comparisons being made (each treatment vs control). With multiple-testing correction enabled (which is the right default at 2+ variants), the per-test α tightens, which inflates required sample size further. -Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method — see `references/advanced-features.md`. +Rule of thumb: a 3-variant test (control + 2 treatments) needs about 1.3× the per-arm sample of a 2-variant test for the same MDE; a 4-variant test needs about 1.5×. Exact multipliers depend on the correction method. ## Duration considerations - **Minimum 1 week** — anything shorter misses weekly seasonality and conflates the day-of-week mix between control and treatment if traffic differs across days. - **Minimum 3 days for read-out** — even with sequential testing and big effects, results under 3 days are typically un-interpretable (cohort hasn't stabilised, day-of-week effects dominate, novelty effect not separated from treatment effect). -- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, set `endCondition: "days"` and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. +- **Multiples of the seasonal cycle.** If the primary metric has strong weekly seasonality, use a date-based end condition and choose 7, 14, 21, or 28 days so each variant sees the same mix of high- and low-traffic periods. - **Cap at ~6 weeks** for most tests — beyond this, novelty effects wear off, the user population drifts, and other experiments running in the same window create cross-test contamination. If the math says you need 8+ weeks, you're underpowered — pick a remediation from the list above. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md index 1d832e0..6ebf0bc 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/statistical-model.md @@ -85,7 +85,7 @@ Enable when there are ≥2 primary metrics OR ≥2 non-control variants. Without Derived from the standard `1 − (1 − α)^k` compounding for `k = primaries × non-control variants` independent tests at per-test α = 0.05. -The takeaway: by the time you're testing 5 primaries on a 3-arm experiment, more than half of the "wins" are noise. +The takeaway: by the time you're testing 5 primaries on a 4-arm experiment (3 non-control variants), more than half of the "wins" are noise. Two methods are available: @@ -107,4 +107,4 @@ When the user pushes you on the confidence level: - Raising α from 0.05 to 0.10 increases power (smaller required sample for the same MDE) but doubles the rate of false-positive "wins." - Lowering α from 0.05 to 0.01 cuts the false-positive rate fivefold but requires roughly 1.5× the sample for the same MDE. -If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED (`references/advanced-features.md`) or a higher-volume proxy metric. +If the user wants more power without raising α, the right move is **smaller MDE → bigger required sample**, not loosening significance. If sample is the binding constraint, reach for CUPED or a higher-volume proxy metric. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md index 4995f62..d30605c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/why-no-statsig.md @@ -2,7 +2,7 @@ Help the user decide between **wait**, **extend**, **boost power**, **narrow the hypothesis**, or **accept the null** — _without_ recomputing the platform's verdicts. -The actual stop / extend math (sample size, power, MDE) lives in [sizing.md](sizing.md) — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. +The actual stop / extend math (sample size, power, MDE) lives in the sizing reference — point the user there for the formulas. This reference explains _which_ lever to pull, not how to recompute one. ## Contents @@ -23,7 +23,7 @@ Inconclusive can mean two very different things: 1. **The experiment is genuinely too small to detect the effect** — this is what the rest of this document is about. 2. **The result isn't trustworthy at all** — SRM failing, broken data, peeked frequentist, etc. — and "inconclusive" is the wrong frame entirely. -Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to [health-check-interpretation.md](health-check-interpretation.md) — fixing the bucketing or the data is a prerequisite to talking about power. +Before answering "why no statsig?", run the **trustworthiness gate**. If anything fails, route to the health-check interpretation guidance — fixing the bucketing or the data is a prerequisite to talking about power. Also check: @@ -40,12 +40,12 @@ Before walking the reasons, fetch the experiment with its exposures and metric r - **Per-variant exposure counts** — the smallest arm is the binding constraint, not the total. - **Control baseline** for each inconclusive primary (its rate or value). If it's missing, query the metric scoped to the control variant over the experiment's dates. - **Observed lift** per primary — relative, `(treatment − control) / control`. Apply the polarity recipe before reading the sign. -- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in [sizing.md](sizing.md)). +- **Variance proxy by metric type** — Bernoulli `p(1−p)`, Poisson `mean`, Gaussian from the per-arm value and sample size (see the variance-by-metric-type table in the sizing reference). - **Configured MDE and end target** (sample-size or duration, whichever the experiment uses). If no MDE was set, ask the user for "the smallest lift worth shipping" — that's the operative MDE. - **Configured traffic split vs the actual exposure ratio** — a skewed split bottlenecks the test on the smaller arm even when SRM didn't fail. - **Overall exposure volume vs plan** (target per arm × arm count) — far below plan means exposures aren't flowing as configured (reason 5). -For the closed-form power math (`n_required`, achievable `MDE_relative`), see [sizing.md](sizing.md) — those numbers explain the `NO` verdict; they don't override it. +For the closed-form power math (`n_required`, achievable `MDE_relative`), see the sizing reference — those numbers explain the not-significant verdict; they don't override it. --- @@ -101,7 +101,7 @@ Never change traffic allocation mid-Frequentist test — it invalidates the SRM - The exposure event isn't firing where the user thinks it does (e.g. only on a deep-funnel page) → effective exposed cohort is much smaller than top-of-funnel traffic. Confirm with a query on the exposure event. - QA traffic isn't being excluded and you suspect internal traffic is dominating one variant → enable the QA exclusion on the next run (results then are cleaner but also smaller). -**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in [per-metric-interpretation.md](per-metric-interpretation.md). +**Triggered / dilution math** matters here too. If only a fraction of "exposed" users actually saw the change (e.g. they didn't reach the screen where the treatment differs), the population-level lift is diluted. See the triggered-analysis notes in the per-metric interpretation reference. --- @@ -137,7 +137,7 @@ Once you know which reason fits, the recommendation almost picks itself. | Exposure config is filtering | **NARROW the hypothesis** to the triggered cohort, or **EXTEND** to grow the triggered sample. | | Experiment finished, well-powered | **ACCEPT NULL.** "No effect" is a real finding when the experiment was sized for the MDE that matters. | -When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in [sizing.md](sizing.md). +When recommending EXTEND on an active experiment, the action is to update the experiment's end target (duration or sample size, whichever it was configured for). Don't fabricate the target number — derive it from the experiment's existing config, or use the power math in the sizing reference. --- From c32867e78985b3a210ad13658026f5a3272fce64 Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 16 Jun 2026 03:52:17 +0000 Subject: [PATCH 09/11] manage-experiment: address CRUD QA findings Live CRUD QA against the Mixpanel MCP surfaced gaps where the skill instructs behavior the tools handle differently than assumed: - Settings full-replace footgun (High): the experiment-update tool's `settings` payload is a full replace, so a one-field edit silently nulls `srm`, `excludeQA`, etc. Added a read-merge-write Behaviour rule to SKILL.md, a pointer in design step 8, and a launch-readiness warning row for dropped `srm` / `excludeQA`. - Metric `direction` not settable via inline metric creation (Medium): metrics default to `up`. Added a caveat in design step 3 to set `down` in the UI for down-polarity metrics and verify before launch. - Archive doesn't cascade to the backing flag (Low): noted in lifecycle-handoff that the auto-created flag is left behind. - Generic experiment-listing errors (Low): Behaviour rule 2 now covers the experiments-not-enabled failure case. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) --- plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md | 7 ++++--- .../skills/manage-experiment/commands/design.md | 6 +++++- .../skills/manage-experiment/commands/launch.md | 1 + .../manage-experiment/references/lifecycle-handoff.md | 2 ++ plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md | 7 ++++--- .../skills/manage-experiment/commands/design.md | 6 +++++- .../skills/manage-experiment/commands/launch.md | 1 + .../manage-experiment/references/lifecycle-handoff.md | 2 ++ plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md | 7 ++++--- .../skills/manage-experiment/commands/design.md | 6 +++++- .../skills/manage-experiment/commands/launch.md | 1 + .../manage-experiment/references/lifecycle-handoff.md | 2 ++ 12 files changed, 36 insertions(+), 12 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index ae87c0c..aacb935 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -145,9 +145,10 @@ All four commands use the same visual vocabulary so multi-command sessions read ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. -2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. -3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. -4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. This includes a failed experiment lookup (e.g. the listing tool errors on a project where experiments aren't enabled) — surface the error and stop; don't proceed as if no experiments exist. +3. **Editing settings is a full replace — read-merge-write.** The experiment-update tool's `settings` payload **replaces the entire settings object**; any field you omit is reset, silently dropping `srm`, `excludeQA`, `cuped`, `winsorization`, `preExperimentBias`, and `controlKey`. Before any settings edit, fetch the current experiment and re-send the **complete** settings object with your one change applied — never a partial `settings`. After the edit, re-verify `srm.enabled` and `excludeQA` survived. Losing `srm` (Kohavi's #1 trustworthiness check) to a one-field edit is the failure mode this rule exists to prevent. +4. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +5. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. --- diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index 2d50ee0..14f9b13 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -71,7 +71,9 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. + +Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). ### 4. Size the experiment with real data @@ -132,6 +134,8 @@ Saving the design as a `DRAFT` is reversible (the user can keep iterating, or de Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. +If the user iterates on an already-saved draft, apply the **read-merge-write** rule for settings (umbrella Behaviour rules) — re-send the full settings object on every edit so a one-field change doesn't silently drop `srm` / `excludeQA`. + After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. ### 9. Hand off to launch diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md index 4f8aecb..57e02a3 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md @@ -38,6 +38,7 @@ Run this against the experiment about to launch. Surface only what fires; order | Warning | The pre-launch pitfall catalogue reports a warning. | | Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | | Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | | FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md index bf548fa..d56a927 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/lifecycle-handoff.md @@ -37,3 +37,5 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. + +Concluding (or archiving) the experiment does **not** clean up the backing feature flag that `create` auto-provisioned — it's left disabled, not archived. If the team wants the flag gone too, archive it separately via the `manage-feature-flags` skill; don't assume the experiment's terminal state removed it. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index ae87c0c..aacb935 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -145,9 +145,10 @@ All four commands use the same visual vocabulary so multi-command sessions read ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. -2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. -3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. -4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. This includes a failed experiment lookup (e.g. the listing tool errors on a project where experiments aren't enabled) — surface the error and stop; don't proceed as if no experiments exist. +3. **Editing settings is a full replace — read-merge-write.** The experiment-update tool's `settings` payload **replaces the entire settings object**; any field you omit is reset, silently dropping `srm`, `excludeQA`, `cuped`, `winsorization`, `preExperimentBias`, and `controlKey`. Before any settings edit, fetch the current experiment and re-send the **complete** settings object with your one change applied — never a partial `settings`. After the edit, re-verify `srm.enabled` and `excludeQA` survived. Losing `srm` (Kohavi's #1 trustworthiness check) to a one-field edit is the failure mode this rule exists to prevent. +4. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +5. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. --- diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index 2d50ee0..14f9b13 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -71,7 +71,9 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. + +Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). ### 4. Size the experiment with real data @@ -132,6 +134,8 @@ Saving the design as a `DRAFT` is reversible (the user can keep iterating, or de Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. +If the user iterates on an already-saved draft, apply the **read-merge-write** rule for settings (umbrella Behaviour rules) — re-send the full settings object on every edit so a one-field change doesn't silently drop `srm` / `excludeQA`. + After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. ### 9. Hand off to launch diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md index 4f8aecb..57e02a3 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md @@ -38,6 +38,7 @@ Run this against the experiment about to launch. Surface only what fires; order | Warning | The pre-launch pitfall catalogue reports a warning. | | Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | | Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | | FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md index bf548fa..d56a927 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/lifecycle-handoff.md @@ -37,3 +37,5 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. + +Concluding (or archiving) the experiment does **not** clean up the backing feature flag that `create` auto-provisioned — it's left disabled, not archived. If the team wants the flag gone too, archive it separately via the `manage-feature-flags` skill; don't assume the experiment's terminal state removed it. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index ae87c0c..aacb935 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -145,9 +145,10 @@ All four commands use the same visual vocabulary so multi-command sessions read ## Behaviour rules 1. **Irreversible actions require explicit confirmation.** Creating an experiment (in `design`), launching one (in `launch`), terminating one mid-flight (in `monitor`), and concluding one (in `interpret`) are all irreversible. Show the proposed action, wait for the user to confirm with literal `CONFIRM` for the destructive ones. -2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. -3. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. -4. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. +2. **If a command can't complete, explain why.** Tell the user what failed and what they can try. Don't fail silently. This includes a failed experiment lookup (e.g. the listing tool errors on a project where experiments aren't enabled) — surface the error and stop; don't proceed as if no experiments exist. +3. **Editing settings is a full replace — read-merge-write.** The experiment-update tool's `settings` payload **replaces the entire settings object**; any field you omit is reset, silently dropping `srm`, `excludeQA`, `cuped`, `winsorization`, `preExperimentBias`, and `controlKey`. Before any settings edit, fetch the current experiment and re-send the **complete** settings object with your one change applied — never a partial `settings`. After the edit, re-verify `srm.enabled` and `excludeQA` survived. Losing `srm` (Kohavi's #1 trustworthiness check) to a one-field edit is the failure mode this rule exists to prevent. +4. **Experiment switching.** If the user wants to operate on a different experiment mid-session, ask which one and reset experiment-scoped context. +5. **Project switching.** If the user wants to operate on a different project mid-session, suggest starting a new conversation first. If they insist, resolve the new project and continue with that `project_id`. --- diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index 2d50ee0..14f9b13 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -71,7 +71,9 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. + +Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). ### 4. Size the experiment with real data @@ -132,6 +134,8 @@ Saving the design as a `DRAFT` is reversible (the user can keep iterating, or de Use the exact catalogue labels from [../references/pitfalls.md](../references/pitfalls.md) so the agent's pitfall messages stay consistent across the design and launch commands. +If the user iterates on an already-saved draft, apply the **read-merge-write** rule for settings (umbrella Behaviour rules) — re-send the full settings object on every edit so a one-field change doesn't silently drop `srm` / `excludeQA`. + After saving the draft, link it back to any prior experiment surfaced in step 1 — record the prior's ID, hypothesis, and outcome in the new experiment's description. That 30-second annotation pays back tenfold at interpretation time. ### 9. Hand off to launch diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md index 4f8aecb..57e02a3 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md @@ -38,6 +38,7 @@ Run this against the experiment about to launch. Surface only what fires; order | Warning | The pre-launch pitfall catalogue reports a warning. | | Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | | Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | | FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md index bf548fa..d56a927 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/lifecycle-handoff.md @@ -37,3 +37,5 @@ A multi-variant test where only one treatment is significantly different from co ## After concluding The decision record — the rationale message, the shipped variant, and the experiment's terminal status — becomes the durable artifact. If a follow-up question comes in about why this experiment was shipped, that record is the answer. + +Concluding (or archiving) the experiment does **not** clean up the backing feature flag that `create` auto-provisioned — it's left disabled, not archived. If the team wants the flag gone too, archive it separately via the `manage-feature-flags` skill; don't assume the experiment's terminal state removed it. From 59b642508117fd0fe07da6e553ad7ba0e53c01af Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 16 Jun 2026 04:13:52 +0000 Subject: [PATCH 10/11] manage-experiment: address review feedback (routing dedupe + glossary concision) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Resolves @zoeabrams-eng's two inline review threads on SKILL.md: - Routing duplication: "Pick the command" (step 3) no longer restates the explicit/implicit/phase-derived rules already in "Canonical commands" — it now references the table and mapping and applies them. - Glossary verbosity: trimmed the Winsorization entry to a one-liner plus a pointer to advanced-features.md (which already holds the full push-back rule), and tightened the Direction / CUPED / Lift / multiple-testing entries. Repointed the two remaining "umbrella glossary's push-back rule" references (design step 6, pitfalls.md) so the rule's canonical home is advanced-features.md; kept pitfalls' mention inline to avoid a reference->reference link chain. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../skills/manage-experiment/SKILL.md | 21 +++++++++---------- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/pitfalls.md | 2 +- .../skills/manage-experiment/SKILL.md | 21 +++++++++---------- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/pitfalls.md | 2 +- .../skills/manage-experiment/SKILL.md | 21 +++++++++---------- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/references/pitfalls.md | 2 +- 9 files changed, 36 insertions(+), 39 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index aacb935..83e3407 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -73,12 +73,12 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. -- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. +- **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping (pooled across variants) for heavy-tailed continuous metrics; meaningless on Bernoulli. The `percentile` field is the tail width per side (default `5` = 5% tails). The push-back rule (don't cap tails above ~20%) and full guidance live in [references/advanced-features.md](references/advanced-features.md). +- **Multiple-testing correction.** Tightens the per-test threshold when several primaries or non-control variants are tested together. Default Benjamini-Hochberg (verify current); Bonferroni for strict family-wise control. ## Reference files @@ -175,13 +175,12 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -Apply these rules in order; the first match wins. +Apply in order, first match wins. The trigger phrases and the state→command mapping both live in the **Canonical commands** section — don't restate them here, just apply them. -1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. -2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. -5. **Otherwise:** show the Command menu, take the user's choice. +1. **Explicit or implicit match** → the matched command (Canonical commands table). +2. **Phase-derived** (an experiment was resolved in step 2 and rules 1–2 didn't decide) → apply the state→command mapping. If `DRAFT` doesn't disambiguate design vs launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Ambiguous verbs** ("audit", "check", "review") → phase-derived routing if an experiment is in context; otherwise the menu. +4. **Otherwise** → show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index 14f9b13..c22650c 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -97,7 +97,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md index 2d6b25a..4804be1 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — capping more than a fifth of each side discards too much signal. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index aacb935..83e3407 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -73,12 +73,12 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. -- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. +- **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping (pooled across variants) for heavy-tailed continuous metrics; meaningless on Bernoulli. The `percentile` field is the tail width per side (default `5` = 5% tails). The push-back rule (don't cap tails above ~20%) and full guidance live in [references/advanced-features.md](references/advanced-features.md). +- **Multiple-testing correction.** Tightens the per-test threshold when several primaries or non-control variants are tested together. Default Benjamini-Hochberg (verify current); Bonferroni for strict family-wise control. ## Reference files @@ -175,13 +175,12 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -Apply these rules in order; the first match wins. +Apply in order, first match wins. The trigger phrases and the state→command mapping both live in the **Canonical commands** section — don't restate them here, just apply them. -1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. -2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. -5. **Otherwise:** show the Command menu, take the user's choice. +1. **Explicit or implicit match** → the matched command (Canonical commands table). +2. **Phase-derived** (an experiment was resolved in step 2 and rules 1–2 didn't decide) → apply the state→command mapping. If `DRAFT` doesn't disambiguate design vs launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Ambiguous verbs** ("audit", "check", "review") → phase-derived routing if an experiment is in context; otherwise the menu. +4. **Otherwise** → show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index 14f9b13..c22650c 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -97,7 +97,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md index 2d6b25a..4804be1 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — capping more than a fifth of each side discards too much signal. ### Multiple primaries, no correction — warning diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index aacb935..83e3407 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -73,12 +73,12 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better for a metric (`up`) or smaller is better (`down`). Cancel / error / latency / abandon / refund metrics need `down` set explicitly — leaving the default silently flips polarity at interpretation. -- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. -- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. -- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics. -- **Multiple-testing correction.** Adjusts per-test significance threshold when several primaries or several non-control variants are tested together. Default Benjamini-Hochberg (platform default — verify current); Bonferroni for strict family-wise control. +- **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. +- **Winsorization.** Outlier capping (pooled across variants) for heavy-tailed continuous metrics; meaningless on Bernoulli. The `percentile` field is the tail width per side (default `5` = 5% tails). The push-back rule (don't cap tails above ~20%) and full guidance live in [references/advanced-features.md](references/advanced-features.md). +- **Multiple-testing correction.** Tightens the per-test threshold when several primaries or non-control variants are tested together. Default Benjamini-Hochberg (verify current); Bonferroni for strict family-wise control. ## Reference files @@ -175,13 +175,12 @@ If the user is starting a new experiment from scratch (no existing experiment to ## 3. Pick the command -Apply these rules in order; the first match wins. +Apply in order, first match wins. The trigger phrases and the state→command mapping both live in the **Canonical commands** section — don't restate them here, just apply them. -1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command. -2. **Implicit:** message matches one canonical trigger phrase from the Components table → use that command. -3. **Phase-derived (only when an experiment was resolved in step 2):** apply the state→command mapping from the **Canonical commands** section. If `DRAFT` doesn't disambiguate between design and launch, ask: "Is the configuration final, or are you still iterating on it?" -4. **Ambiguous verbs** ("audit", "check", "review") — apply phase-derived routing if an experiment is in context, otherwise fall through to the menu. -5. **Otherwise:** show the Command menu, take the user's choice. +1. **Explicit or implicit match** → the matched command (Canonical commands table). +2. **Phase-derived** (an experiment was resolved in step 2 and rules 1–2 didn't decide) → apply the state→command mapping. If `DRAFT` doesn't disambiguate design vs launch, ask: "Is the configuration final, or are you still iterating on it?" +3. **Ambiguous verbs** ("audit", "check", "review") → phase-derived routing if an experiment is in context; otherwise the menu. +4. **Otherwise** → show the Command menu, take the user's choice. ## 4. Load and execute the command diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index 14f9b13..c22650c 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -97,7 +97,7 @@ Decision tree, the peeking-trap explanation, worked compounding-FPR numbers, and ### 6. Decide on advanced features - **CUPED** — enable when the primary metric correlates with pre-exposure behaviour AND all experiment users existed before start AND 2–4 weeks of stable pre-exposure history is available. Do not enable on new-user-only experiments, one-time-event metrics, or brand-new metrics. -- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the umbrella glossary's Winsorization push-back rule (don't cap tails above ~20%). +- **Winsorization** — enable for heavy-tailed continuous metrics (revenue, time-on-page, session duration). Do not enable on Bernoulli (conversion) metrics. The tail-width setting defaults to 5 (5% tails); apply the Winsorization push-back rule (don't cap tails above ~20%). When/why each is right and the common misconfigurations are in [../references/advanced-features.md](../references/advanced-features.md). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md index 2d6b25a..4804be1 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/pitfalls.md @@ -53,7 +53,7 @@ If the team genuinely wants to make that trade, they can disable the guardrail b ### High variance, no Winsorization — warning -**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — see the umbrella glossary's Winsorization push-back rule. +**At least one continuous-ish metric is configured AND Winsorization is off.** Outliers will inflate variance and widen confidence intervals; a handful of power users can dominate the per-arm mean. Enable Winsorization at the default tail width (5% tails). Push back on tail widths above ~20% — capping more than a fifth of each side discards too much signal. ### Multiple primaries, no correction — warning From e57732f141e8806e2ffefe1c5d3ef6a375f0c95e Mon Sep 17 00:00:00 2001 From: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Date: Tue, 16 Jun 2026 21:23:00 +0000 Subject: [PATCH 11/11] manage-experiment: note metric direction (polarity) is editable post-setup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Metric direction/polarity is a property of the saved metric and can now be corrected after setup via the metric-update tool, without recreating the metric or experiment. Update the skill to reflect this: - design: setting `down` no longer requires the Mixpanel UI — the metric-update tool works on the draft - glossary, metric-selection, launch readiness, interpret: note the in-place fix; full statement centralized in the Direction glossary entry, others trimmed to brief pointers - use the skill's descriptive "metric-update tool" voice instead of the literal Update-Metric tool ID Co-Authored-By: Claude Opus 4.8 (1M context) --- .../skills/manage-experiment/SKILL.md | 2 +- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 ++ .../manage-experiment/commands/launch.md | 20 +++++++++---------- .../references/metric-selection.md | 4 ++-- .../skills/manage-experiment/SKILL.md | 2 +- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 ++ .../manage-experiment/commands/launch.md | 20 +++++++++---------- .../references/metric-selection.md | 4 ++-- .../skills/manage-experiment/SKILL.md | 2 +- .../manage-experiment/commands/design.md | 2 +- .../manage-experiment/commands/interpret.md | 2 ++ .../manage-experiment/commands/launch.md | 20 +++++++++---------- .../references/metric-selection.md | 4 ++-- 15 files changed, 48 insertions(+), 42 deletions(-) diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md index 83e3407..cc34f6a 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/SKILL.md @@ -73,7 +73,7 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. Direction is a property of the saved metric, so a wrong polarity can be corrected after setup with the metric-update tool — no need to recreate the metric or the experiment. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md index c22650c..31f452f 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/design.md @@ -71,7 +71,7 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` after the draft is created — via the metric-update tool or the Mixpanel UI — and confirm it before launch; an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md index 806729d..6d01b1d 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/interpret.md @@ -33,6 +33,8 @@ Given a row's lift and the metric's direction ("up" = bigger is better, "down" = A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +If a metric's direction is plainly wrong at read time (e.g. an error / cancel / latency metric left at the default `up`), the polarity reads inverted. The metric-update tool can correct it in place — fix it and re-read rather than mentally inverting the row. + The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ### Data-source fallback diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md index 57e02a3..1c35859 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/commands/launch.md @@ -30,16 +30,16 @@ There is no "edit the variants" or "change the statistical model" operation post Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. -| Severity | Check | -| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | -| Blocker | The experiment has no primary metric. | -| Blocker | The configured allocation doesn't sum to 100% across variants. | -| Warning | The pre-launch pitfall catalogue reports a warning. | -| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | -| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | -| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | -| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | +| Severity | Check | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. Fixable in place via the metric-update tool — no recreate needed. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md index 1c79f6d..d320d52 100644 --- a/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp-eu/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Set it explicitly at setup time so the polarity stays correct through interpretation; if it's wrong, the metric-update tool can fix it in place later (direction lives on the saved metric). - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -36,7 +36,7 @@ Standard guardrails by domain — pick at least one from the row that matches th | Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | | Time-to-task / search / IA | Task abandonment rate, time-to-completion | -For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." A wrong direction is fixable later via the metric-update tool. Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md index 83e3407..cc34f6a 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/SKILL.md @@ -73,7 +73,7 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. Direction is a property of the saved metric, so a wrong polarity can be corrected after setup with the metric-update tool — no need to recreate the metric or the experiment. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md index c22650c..31f452f 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/design.md @@ -71,7 +71,7 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` after the draft is created — via the metric-update tool or the Mixpanel UI — and confirm it before launch; an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md index 806729d..6d01b1d 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/interpret.md @@ -33,6 +33,8 @@ Given a row's lift and the metric's direction ("up" = bigger is better, "down" = A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +If a metric's direction is plainly wrong at read time (e.g. an error / cancel / latency metric left at the default `up`), the polarity reads inverted. The metric-update tool can correct it in place — fix it and re-read rather than mentally inverting the row. + The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ### Data-source fallback diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md index 57e02a3..1c35859 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/commands/launch.md @@ -30,16 +30,16 @@ There is no "edit the variants" or "change the statistical model" operation post Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. -| Severity | Check | -| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | -| Blocker | The experiment has no primary metric. | -| Blocker | The configured allocation doesn't sum to 100% across variants. | -| Warning | The pre-launch pitfall catalogue reports a warning. | -| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | -| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | -| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | -| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | +| Severity | Check | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. Fixable in place via the metric-update tool — no recreate needed. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md index 1c79f6d..d320d52 100644 --- a/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp-in/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Set it explicitly at setup time so the polarity stays correct through interpretation; if it's wrong, the metric-update tool can fix it in place later (direction lives on the saved metric). - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -36,7 +36,7 @@ Standard guardrails by domain — pick at least one from the row that matches th | Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | | Time-to-task / search / IA | Task abandonment rate, time-to-completion | -For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." A wrong direction is fixable later via the metric-update tool. Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md index 83e3407..cc34f6a 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/SKILL.md @@ -73,7 +73,7 @@ Terms all four commands use without redefining. Phase-specific terms (hypothesis - **Primary** — drives the ship decision. Cap at 3; the platform applies multiple-testing correction across primaries when configured. - **Guardrail** — must not regress; a guardrail regression vetoes a ship even when primaries win. - **Secondary** — exploratory / diagnostic only, never decisional, no correction applied. -- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. +- **Direction.** Whether bigger is better (`up`) or smaller is better (`down`). Set `down` explicitly for cancel / error / latency / abandon / refund metrics — the default `up` silently flips polarity at interpretation. Direction is a property of the saved metric, so a wrong polarity can be corrected after setup with the metric-update tool — no need to recreate the metric or the experiment. - **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign is mechanical (up/down), not by itself a verdict. - **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation. - **CUPED.** Variance reduction using a pre-exposure baseline; cuts required sample ~30–70% when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md index c22650c..31f452f 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/design.md @@ -71,7 +71,7 @@ The hypothesis names a specific outcome. The primary metric must measure that ou - **Guardrails** (strongly recommended) cover the most likely failure mode of the change — see the guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). - **Secondaries** are diagnostic only. -Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` in the Mixpanel UI after the draft is created, and confirm it before launch — an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. +Every primary and guardrail needs an explicit `direction`. **Caveat — the experiment-create tool's inline metric input does not expose `direction`; every metric it creates defaults to `up`.** For any down-polarity metric (cancel / error / latency / abandon / refund / removed), set `direction` to `down` after the draft is created — via the metric-update tool or the Mixpanel UI — and confirm it before launch; an unset `down` silently flips the polarity verdict at interpretation. The `launch` readiness check re-flags any primary still left at the `up` default. Watch for the **lagging-indicator trap** (30-day retention as primary on a 2-week experiment) and the **changed-denominator trap** (metric defined only over treatment-exposed users — lift is artificially infinite). Full sanity checklist and standard guardrails-by-domain table in [../references/metric-selection.md](../references/metric-selection.md). diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md index 806729d..6d01b1d 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/interpret.md @@ -33,6 +33,8 @@ Given a row's lift and the metric's direction ("up" = bigger is better, "down" = A positive-bucket row on a "down" metric is a **regression**, not a win. Always filter out the control row first — the platform marks which variant is control. +If a metric's direction is plainly wrong at read time (e.g. an error / cancel / latency metric left at the default `up`), the polarity reads inverted. The metric-update tool can correct it in place — fix it and re-read rather than mentally inverting the row. + The platform auto-applies multiple-testing correction when the experiment is configured for Bonferroni or Benjamini-Hochberg — **don't re-correct**. ### Data-source fallback diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md index 57e02a3..1c35859 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/commands/launch.md @@ -30,16 +30,16 @@ There is no "edit the variants" or "change the statistical model" operation post Run this against the experiment about to launch. Surface only what fires; order blockers → warnings → fyi. -| Severity | Check | -| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | -| Blocker | The experiment has no primary metric. | -| Blocker | The configured allocation doesn't sum to 100% across variants. | -| Warning | The pre-launch pitfall catalogue reports a warning. | -| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | -| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. | -| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | -| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | +| Severity | Check | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Blocker | Pre-launch pitfall catalogue (insufficient duration, cohort too small) reports a blocker — see [../references/pitfalls.md](../references/pitfalls.md). | +| Blocker | The experiment has no primary metric. | +| Blocker | The configured allocation doesn't sum to 100% across variants. | +| Warning | The pre-launch pitfall catalogue reports a warning. | +| Warning | No guardrail metrics configured. Without guardrails, the regression hard-gate (see umbrella Cross-command policies) cannot protect the ship decision. | +| Warning | A primary metric has `direction` unset (defaults to `up`); cancel / error / latency / abandon / refund metrics need `down` set explicitly. Fixable in place via the metric-update tool — no recreate needed. | +| Warning | `srm.enabled` is false or `excludeQA` is unset — both are easily lost to a partial settings edit (see the umbrella's read-merge-write rule); re-confirm before the allocation locks. | +| FYI | The experiment isn't linked back to a prior experiment on the same feature, even though prior experiments exist. Recommend adding the link before launch. | The pitfall catalogue itself lives in [../references/pitfalls.md](../references/pitfalls.md) — don't duplicate the rules here; run them and report results. diff --git a/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md index 1c79f6d..d320d52 100644 --- a/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md +++ b/plugins/mixpanel-mcp/skills/manage-experiment/references/metric-selection.md @@ -7,7 +7,7 @@ Each metric serves exactly one of three roles. The hypothesis tells you which. The metrics whose movement decides ship / no-ship. They come straight from the hypothesis's "outcome will ``" clause. - **Cap at 3.** Each additional primary inflates the family-wise false-positive rate. With multiple-testing correction enabled (which is the right default at 2+ primaries), more primaries → tighter per-metric threshold → harder to detect any individual effect. Beyond 3 the math punishes you regardless of how well the test is run. -- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Setting it explicitly at setup time is the only way to keep the polarity correct through interpretation. +- **Explicit direction.** Every primary needs `up` or `down`. The platform's default is `up` (verify in product), which is wrong for cancel / error / latency / abandon / refund metrics. Set it explicitly at setup time so the polarity stays correct through interpretation; if it's wrong, the metric-update tool can fix it in place later (direction lives on the saved metric). - **Leading, not lagging.** A primary must be able to actually move within the planned experiment window. Match the metric's response window to the experiment's duration: - Onboarding-screen change → activation in the first session, not Week-4 retention. - Checkout button A/B → checkout conversion, not 30-day LTV. @@ -36,7 +36,7 @@ Standard guardrails by domain — pick at least one from the row that matches th | Trust / safety / moderation | Complaint rate, unsubscribe rate, support-ticket volume | | Time-to-task / search / IA | Task abandonment rate, time-to-completion | -For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." +For every guardrail, **set direction explicitly**. A guardrail named "errors" left at the default `up` will silently let regressions slip through interpretation as "wins." A wrong direction is fixable later via the metric-update tool. Same lagging-indicator rule applies: a guardrail that takes 30 days to react can't protect a 2-week experiment. If the user names retention or LTV as a guardrail on a short experiment, recommend a leading proxy (Day-1 or Day-7 retention) and demote the lagging metric to a post-launch monitor.