New skill: manage-experiment (umbrella over design + interpret) by elliotrfeinberg · Pull Request #30 · mixpanel/ai-plugins

elliotrfeinberg · 2026-06-09T23:53:48Z

Summary

Combines design-experiment (PR #24) and interpret-experiment (PR #23) into a single manage-experiment umbrella skill with subcommands, modeled on manage-lexicon. One front door for any experiment-related phrase; phase-derived disambiguation handles borderline phrasings like "audit my experiment" by checking experiment state (DRAFT → design; ACTIVE/CONCLUDED → interpret).

Why an umbrella

Single trigger surface — agent doesn't disambiguate between two skills on every experiment turn.
Cross-references between design and interpret become intra-skill (stable), not cross-skill (coupled releases).
Reserves room to add launch and monitor commands later without another rename.

What's in here

manage-experiment/
├── SKILL.md              umbrella: shared glossary, command menu, project + experiment resolution
├── commands/
│   ├── design.md         former design-experiment SKILL.md body
│   ├── launch.md         pre-launch readiness check + irreversible launch
│   ├── monitor.md        mid-flight safety: SRM, pace, guardrails, peeking discipline
│   └── interpret.md      former interpret-experiment SKILL.md body
└── references/           15 reference files, flat (no name collisions)

Command files preserve their original prose verbatim except for:

Removing the now-redundant frontmatter and "You are helping…" opener.
Lifting shared glossary terms (Variant, Primary/Guardrail/Secondary, Direction, Lift, MDE, CUPED, Winsorization, Multiple-testing correction) up to the umbrella.
Adjusting reference paths to ../references/.

Reference files unchanged except for two intra-skill link updates in why-no-statsig.md (was pointing at design-experiment skill; now points at sizing.md).

Sequencing

This PR is opened before PR #23, #24, and #25 merge. The plan:

Land PR Add experiment-results skill #23 (interpret-experiment).
Land PR Add experiment-setup skill #24 (design-experiment).
Land PR Add feature-flags skill #25 (manage-feature-flags).
Rebase this PR against master — the rebase will need to delete the now-existing design-experiment/ and interpret-experiment/ directories.
Sweep PR Add feature-flags skill #25's cross-references (design-experiment / interpret-experiment → manage-experiment) in a follow-up commit or separate small PR.

Opening as a draft so reviewers can pre-evaluate the umbrella pattern before that sequencing plays out.

Test plan

Live CRUD QA — executed 2026-06-16 against the Sandbox project (4032515) via the Mixpanel MCP

Ran the full experiment lifecycle the skill's four commands drive, end-to-end against live tools. All passed:

CREATE (design) — Create-Experiment produced a DRAFT and auto-provisioned a 50/50 control/treatment backing feature flag.
READ (interpret/monitor) — List-Experiments surfaced it; Get-Experiment round-tripped every setting; Get-Experiment with compute_exposures/compute_metrics returned the live results shape and degraded gracefully on a no-traffic project (live_results_errors: ["…No exposure data…"], no crash).
UPDATE — config edit — Update-Experiment renamed, bumped sampleSize, and added a secondary metric.
UPDATE — lifecycle transitions — launch (DRAFT→ACTIVE) and conclude (ACTIVE→CONCLUDED) both succeeded with correct start_date/end_date stamping.
DELETE (archive) — Update-Experiment action=archive soft-deleted it (excluded from the default list); cleaned up the orphaned backing flag via Update-Feature-Flag.
Supporting tools — Get-Experiment-Setup-Guidance and Run-Experiment-Pre-Launch-Checks work in-flow (the latter correctly flagged an underpowered blocker: 8k configured vs 57.6k/arm required for a 5% MDE at a 10% baseline).

QA surfaced findings worth addressing — see the QA-findings comment on this PR. Headline: Update-Experiment.settings is a full replace, not a merge — a partial settings edit silently nulled srm.enabled and excludeQA, and the skill never warns the agent to re-send the full settings object.

Skill QA — validated

make sync-skills — mixpanel-mcp / -eu / -in byte-identical (make check-skills-sync → "Skills are in sync")
/review-skill manage-experiment ≥ 90% — 94% (full run on e212dcc; the routing/glossary refinements since address review feedback and don't add content)
All intra-skill links resolve — all 15 references/ link targets exist, no broken ../references/... paths, and no reference→reference link chains (the "one level deep" rule holds)
Trigger surface: all four phrasings route correctly ("set up an A/B test" → design; "how did experiment X do", "why isn't this significant yet", "should we ship" → interpret) — desk-checked against the Canonical-commands triggers
Phase-derived routing — desk-checked the full state→command mapping: DRAFT → design (or launch when ready), ACTIVE mid-flight → monitor, ACTIVE past planned end / CONCLUDED → interpret. (These states were also exercised live in the CRUD QA above.)

Skill QA — pending

Cross-skill links to manage-feature-flags — the by-name references are well-formed, but the sibling skill isn't on this branch yet (lands with PR Add feature-flags skill #25); validate after the rebase

🤖 Generated with Claude Code

Combines the soon-to-land `design-experiment` (PR #24) and `interpret-experiment` (PR #23) skills into a single `manage-experiment` umbrella, modeled on `manage-lexicon`. One front door for any experiment-related phrase; subcommands for the two phases. ## Why Today an agent has to disambiguate between two skills on every experiment turn, and borderline phrases ("audit my experiment", "check on experiment X") route inconsistently. Folding both into one umbrella with `design` and `interpret` subcommands: - Gives a single trigger surface for the agent to route to. - Lets ambiguous phrases use phase-derived disambiguation (DRAFT → design; ACTIVE/CONCLUDED → interpret). - Removes the cross-skill reference coupling that currently requires coordinated edits across two PRs whenever either skill renames. - Reserves room to add `launch` and `monitor` commands later without another rename. ## Shape ``` manage-experiment/ ├── SKILL.md (umbrella: shared glossary, command menu, project + experiment resolution) ├── commands/ │ ├── design.md (former design-experiment SKILL.md body) │ └── interpret.md (former interpret-experiment SKILL.md body) └── references/ (15 references, flat; no name collisions) ``` The two command files preserve their original prose verbatim except for: - Removing the now-redundant frontmatter and opening "You are helping…" paragraph. - Lifting umbrella-able glossary terms (Variant, Primary/Guardrail/ Secondary, Direction, Lift, MDE, CUPED, Winsorization, Multiple- testing correction) up to the umbrella; keeping phase-specific terms (Hypothesis, Power, Underpowered, Sequential vs Frequentist for design; Polarity, Significance, SRM, Retro A/A, Twyman's Law, Trustworthiness gate for interpret) in their command files. - Adjusting reference paths to `../references/`. The 15 reference files are unchanged except for two intra-skill links in `why-no-statsig.md` that previously pointed at the `design-experiment` skill — now point directly at `sizing.md` (where the formulas live). The 3 references to `manage-feature-flags` (in SKILL.md, design.md, and routing-xp-vs-ff.md) are correct cross-skill pointers and unchanged. ## Sequencing This PR is opened **before** PR #23 and PR #24 merge. Once those land, this PR will need a rebase against master that deletes the now-existing `design-experiment/` and `interpret-experiment/` directories. Per the plan, doing the umbrella as a separate PR keeps the review focused on the routing pattern rather than tangled into either skill's prose review. The `manage-feature-flags` skill (PR #25) cross-references both old skill names; once both PRs land, those cross-refs will need a sweep to point at `manage-experiment`. README index updated: one `manage-experiment` row replaces the two prospective `design-experiment` and `interpret-experiment` rows. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ugh, dedupe match logic Applies the 3 issues surfaced by the /review-skill pass on manage-experiment: - D4 (Centralised logic): the umbrella SKILL.md had no map of what references/ contains. Add a "Reference files" table to SKILL.md listing each file, which command uses it, and its purpose. Mirrors the shape used by manage-feature-flags. - D7 (Qualified recommendations): the phase-derived rule in step 3 ("DRAFT → design, ACTIVE/CONCLUDED → interpret") read as if it always applied, but it only does when an experiment is in context. Restate step 3 as ordered rules with explicit fall-through, and add a row for ambiguous verbs like "audit" / "check" / "review". - D6 (Human-friendly identifiers): the "try ID first, then case-insensitive name match" rule was stated in three places (umbrella step 2, design step 8, interpret step 1). Drop the duplicates from the command files — interpret hands back to the umbrella, design retains only the "ask if not named" guidance (the lookup mechanics don't apply to free-text feature names). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… not cap-point (95) PR #30 review caught an agent-actionable accuracy bug. The skill docs described Winsorization as "default 95 (cap top/bottom 5%)" — natural phrasing for a human, but the actual `Update-Experiment` tool schema defines `WinsorizationSettingsInput.percentile` as the **tail width to cap on each side**, with default `5` and `exclusiveMaximum: 50`. An agent reading the docs would pass `95`, get rejected by validation, or misinterpret as inverted Winsorization. Updated 6 files to match the schema's convention: - SKILL.md shared glossary - commands/design.md step 6 (advanced features) - references/advanced-features.md "What it does", "How to enable", "Percentile guidance" (default flipped from "95th" to `percentile=5`, "push back below ~80" flipped to "push back above ~20") - references/health-check-interpretation.md "Extreme winsorization percentile" misconfiguration check - references/per-metric-interpretation.md "Variance-reduction & outlier settings that change interpretation" - references/pitfalls.md "High variance, no Winsorization" warning The numeric guidance ("push back if more than 20% of each tail is capped") is preserved; only the convention name flipped to match what the agent actually sends through the tool. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the manage-experiment umbrella (PR #30) from 2 commands (design, interpret) to 4 commands covering the full experiment lifecycle. Stacks on top of PR #30 so the umbrella structure and this extension can be reviewed separately. ## New commands ### commands/launch.md (~125 lines) Owns the irreversible DRAFT → ACTIVE transition. Runs the final launch-readiness check (lifted from design step 7 — that step now catches design-time issues only) and performs the launch with explicit CONFIRM. Includes the post-launch handoff: recommend a tracking dashboard + a calendar reminder for monitor. ### commands/monitor.md (~115 lines) Mid-flight safety checks. Answers "is it safe to keep this running?" which is distinct from interpret's "did it work?" Covers: - The peek-safety table: what's safe to look at mid-flight (SRM, sample pace, guardrails, Sequential primaries) and what isn't (Frequentist primaries — the peeking trap). - The terminate-early decision rules: trustworthiness failure, guardrail regression beyond tolerance, Sequential boundary crossed. - An explicit don't-peek-on-Frequentist pushback for users who ask to see primary results mid-flight. ## Knock-on changes - umbrella SKILL.md: 4-row commands table, 5-option menu, phase-derived routing extended to DRAFT-design vs DRAFT-launch and ACTIVE-monitor vs ACTIVE-interpret. Reference table "Used by" columns updated (pitfalls → design+launch, health-check → monitor+interpret, sizing → design+monitor+interpret). - commands/design.md step 7: reframed as "design-time sanity check" rather than "the pre-launch check". The full launch-readiness gate now lives in launch.md. Step 8 saves a DRAFT (reversible) rather than creating-and-launching. Added step 9: hand off to launch. - design.md opening line: updated to say "stops at DRAFT" instead of "don't create until confirmed". Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add launch and monitor commands to manage-experiment Extends the manage-experiment umbrella (PR #30) from 2 commands (design, interpret) to 4 commands covering the full experiment lifecycle. Stacks on top of PR #30 so the umbrella structure and this extension can be reviewed separately. ## New commands ### commands/launch.md (~125 lines) Owns the irreversible DRAFT → ACTIVE transition. Runs the final launch-readiness check (lifted from design step 7 — that step now catches design-time issues only) and performs the launch with explicit CONFIRM. Includes the post-launch handoff: recommend a tracking dashboard + a calendar reminder for monitor. ### commands/monitor.md (~115 lines) Mid-flight safety checks. Answers "is it safe to keep this running?" which is distinct from interpret's "did it work?" Covers: - The peek-safety table: what's safe to look at mid-flight (SRM, sample pace, guardrails, Sequential primaries) and what isn't (Frequentist primaries — the peeking trap). - The terminate-early decision rules: trustworthiness failure, guardrail regression beyond tolerance, Sequential boundary crossed. - An explicit don't-peek-on-Frequentist pushback for users who ask to see primary results mid-flight. ## Knock-on changes - umbrella SKILL.md: 4-row commands table, 5-option menu, phase-derived routing extended to DRAFT-design vs DRAFT-launch and ACTIVE-monitor vs ACTIVE-interpret. Reference table "Used by" columns updated (pitfalls → design+launch, health-check → monitor+interpret, sizing → design+monitor+interpret). - commands/design.md step 7: reframed as "design-time sanity check" rather than "the pre-launch check". The full launch-readiness gate now lives in launch.md. Step 8 saves a DRAFT (reversible) rather than creating-and-launching. Added step 9: hand off to launch. - design.md opening line: updated to say "stops at DRAFT" instead of "don't create until confirmed". Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review issues: lift cross-command policies to umbrella, unify emoji, hedge cross-skill refs Applies the 5 issues surfaced by the /review-skill pass on the 4-command manage-experiment: ## Major (D4) The 5% guardrail hard-gate and the peek-safety table were duplicated across design / launch / monitor — three places to keep load-bearing thresholds in sync. - Add a "Cross-command policies" section to the umbrella SKILL.md that owns the guardrail hard-gate threshold and the peek-safety table. Also moves the emoji convention there so all commands share one vocabulary. - design.md "The >5% guardrail hard-gate" → "Guardrails (the hard-gate enforced downstream)" — keeps the design-time guidance, defers the threshold to the umbrella. - design.md step 5: "peeking is safe by design" → "peek-safety table in the umbrella's Cross-command policies covers why". - launch.md readiness checklist: ">5% hard-gate from the design command" → "regression hard-gate (see umbrella Cross-command policies)". - monitor.md Components: drop the entire "What's safe to peek at" table, reference the umbrella. Glossary defers "peeking trap" to the umbrella. Terminate-early rules cite the umbrella threshold. ## Minor (D3) design.md output-style section had a duplicate hand-off line restating what step 9 already covers — removed. ## Minor (D7) launch.md step 5 dashboard recommendation was unqualified — added "recommend running create-dashboard in a follow-up session" so the agent doesn't try to invoke it inline and interrupt the launch flow. ## Suggestion (D3) launch.md irreversibility section mentioned "pause/resume via the underlying feature flag" without naming the skill — added explicit "handled by the manage-feature-flags skill" cross-ref. ## Emoji convention (D6) Declared once in the umbrella's Cross-command policies. Existing emoji usage across design/launch/monitor already matches; interpret has no symbolic checklist output (its output is prose verdict + per- metric breakdown), which is correct. ## Knock-on numbers Skill core: 692 → 720 lines (umbrella +38 with the new policies section, monitor -10 from removed table, others flat). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * PR #31 polish: define peeking trap in umbrella, add calendar-reminder example /review-document feedback on PR #31: - D6: monitor.md deferred the "peeking trap" definition to the umbrella's Cross-command policies, but the umbrella's peek-safety section only showed consequences and never defined the term. Added a one-line definition before the table. - D8: launch.md step 5 calendar-reminder recommendation was abstract. Added concrete phrasing the agent can hand to the user so it doesn't improvise the wording inconsistently across sessions. Synced to eu/in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port the three pieces from the analytics monorepo's statsig-diagnosis guidance (tal-multi-546) that PR #30's why-no-statsig.md lacked, then reconcile them with this repo's skill quality rubric. Port: - sizing.md: relative, unit-safe achievable-MDE formulas (MDE_relative, n_required) with the sigma=sqrt(sigma^2) caution, framed as explaining the NO verdict rather than overriding it. - why-no-statsig.md: a "pull the diagnostic inputs first" step and a priority-ordered "name the single most-likely blocker" table so the answer is one blocker with numbers, not a generic option list. Rubric alignment (review-skill: 60% capped -> ~98%): - Reframe the diagnostic-inputs step to intent (no named tool call or response field paths) so it stays engine-agnostic, lifting the cap. - Fix advanced-features.md winsorization percentile contradiction (tail-width 5 convention vs stale 95/90 boundary references). - Fix dead/stale cross-references in per-metric-interpretation.md and routing-xp-vs-ff.md to point at sections/commands that exist. - Add a Contents block to every 100+ line reference. - Collapse the state->command mapping duplicated 3x in SKILL.md (now 200). Synced byte-identical across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…g tool The Search-Prior-Experiments engine tool (analytics #95682) ranks stored experiments by overlap on the draft's metric set, flag key, and hypothesis rather than free-text keyword matching. Update the design command and the prior-experiments reference so they steer the agent to hand the tool the draft it's building instead of constructing single-keyword queries. Engine-agnostic wording — describes the signals, not tool names or field paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

linear-code · 2026-06-11T21:52:43Z

MULTI-546

The description was 1326 chars (over the Agent Skills 1024 limit) because it enumerated ~15 verbatim trigger phrasings. Trim to a representative handful and fold Bonferroni/Benjamini-Hochberg into "multiple-testing correction"; the Canonical commands table in the body already carries the full keyword set. Now 921 chars. Synced across all three plugin variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

zoeabrams-eng · 2026-06-15T14:48:04Z

I'm taking over the review from @gonzalo. @elliotrfeinberg did you do the steps listed in the test plan? it says /review-skill manage-experiment ≥ 90% in the test plan and I'm getting 70% when I run review-skill?

…ences Fixes from the review-skill rubric pass and the per-reference review: - DRY/portability: de-link all reference->reference chains so references stay one level deep; strip tool-docs leakage (significance enum, summary.* buckets, liftConfidence, decide constants) to intent-level. - Centralise the Winsorization >~20% push-back rule in the SKILL.md glossary; centralise the polarity recipe in the interpret command; both now referenced rather than restated. - Source defaults: qualify the Benjamini-Hochberg platform default, CUPED 30-70% range, and the `up` direction default with "verify". - Accuracy fixes in references: correct the family-wise FPR claim (3-arm -> 4-arm), drop in-file Winsorization duplication, fix the "constants"->"choices" dangling referent in lifecycle-handoff, soften absolutist hypothesis-framing phrasing, de-changelog the segment-breakdown platform-status note. Applied identically across the mixpanel-mcp, -eu, and -in variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

elliotrfeinberg · 2026-06-16T01:12:07Z

@zoeabrams-eng thanks for taking this over.

You're right that a cold run was landing under the bar — and the bigger issue is that review-skill is model-judged, so the score drifts run-to-run (we got 85% on our first pass; you got 70%). Rather than wave that off as variance, we re-ran it, pulled the concrete findings, and fixed them.

Latest review-skill manage-experiment after the fixes (pushed in e212dcc): 94%.

#	Dimension	Score	Weight
D1	Spec Compliance	100%	10%
D2	Description & Triggers	100%	15%
D3	Structure & Readability	86%	15%
D4	DRY & Portability	88%	20%
D5	Safety & Data Integrity	100%	10%
D6	UX & Communication	100%	10%
D7	Domain Expertise	100%	10%
D8	Content Quality	88%	10%
	Overall	94%

The two Major FAILs that were dragging the first run down:

Tool-docs leakage — the references restated MCP response/argument shapes (significance enum, summary.* buckets, liftConfidence, the __no_variant_shipped__/__defer_variant_decision__ decide constants). Rewritten to intent-level.
References not one level deep — reference files linked onward to other reference files. De-linked; cross-references now route through the command "Going deeper" tables.

Also folded in: centralised the Winsorization push-back rule and the polarity recipe (were duplicated), sourced the Benjamini-Hochberg / CUPED / direction defaults with "verify" qualifiers, and fixed a real arithmetic slip in statistical-model.md (a family-wise-error claim said "3-arm" where the math is a 4-arm experiment). We also ran a per-file pass over all 15 references/ docs (avg ~96%; the math-heavy ones verified cell-by-cell).

All changes are applied byte-identically across the mixpanel-mcp, -eu, and -in variants.

One honest caveat: the one check still not at PASS is "references are one level deep" — our reference files still point at each other in prose ("see the sizing reference"). We removed the clickable cross-links; the prose pointers remain because they genuinely aid navigation and the rubric's hard rule targets link-chains. Happy to strip those entirely if you'd prefer a clean pass on that check.

If you still see a low score after pulling e212dcc, paste your dimension breakdown and we'll reconcile — run-to-run drift on the judged checks (D3/D4/D8) is the most likely culprit.

Assisted by Claude

zoeabrams-eng · 2026-06-16T02:00:15Z

Great! Will take another look now. And, could you update the test plan with the items completed and their latest results? The unchecked boxes made it a hard for me to tell what is already validated.

elliotrfeinberg · 2026-06-16T02:02:59Z

Great! Will take another look now. And, could you update the test plan with the items completed and their latest results? The unchecked boxes made it a hard for me to tell what is already validated.

Will rerun the test plan, haven't run it since I make the latest round of reference changes

zoeabrams-eng · 2026-06-16T02:08:33Z

+
+1. **Explicit:** user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command.


These are repetitive with "Canonical Commands". Recommend only describing the explicit and implicit routing in one place.

Done in 59b6425. "Pick the command" no longer restates the explicit/implicit/phase-derived rules — it now points at the Canonical commands table and state→command mapping and just applies them in order. The trigger phrases and the mapping live in one place (Canonical commands); the procedure lives in one place (this step).

Assisted by Claude

elliotrfeinberg · 2026-06-16T02:20:13Z

QA: experiment CRUD via `manage-experiment` (live run)

We exercised the full experiment lifecycle the skill drives, end-to-end against the Sandbox project (4032515) through the Mixpanel MCP. The CRUD path itself is solid — create → list/get → edit → launch → conclude → archive all worked, and the interpret read path degrades gracefully on a no-traffic project. Findings below, by severity.

🔴 High — `Update-Experiment.settings` is a full replace, not a merge

Editing one field (we changed sampleSize) without re-sending the rest of settings silently nulled srm.enabled and excludeQA (also controlKey). Confirmed on a follow-up Get-Experiment, not just response trimming.

This is dangerous in the skill's own flow: design step 8 invites the user to "keep iterating" on the draft, and launch summarizes settings before going live. An agent that makes a one-field edit between design and launch will silently drop SRM — Kohavi's #1 trustworthiness check, which the interpret command then relies on. The skill never tells the agent to read-merge-write settings.

Suggested fix: add a rule to design/launch (or the umbrella behaviour rules): when updating an existing experiment's settings, fetch current settings first and re-send the full object — never send a partial settings payload. Re-verify srm/excludeQA in the launch readiness summary.

🟠 Medium — metric `direction` can't be set through the MCP

The inline metric input on Create-Experiment/Update-Experiment exposes only metricType / eventName / aggregationMethod — no direction. Every metric we created defaulted to direction: "up", including a Product Removed guardrail where "up" is the regression. Yet the shared glossary, design step 3, the launch readiness check, and the server-side setup guidance all stress setting direction explicitly for down-polarity metrics — and the interpret polarity recipe is load-bearing on it.

The skill instructs behavior the tool can't perform via inline creation. Suggested fix: have design note that direction for down-metrics must be set on a saved metric / in the UI (and flag the tool gap upstream), so the polarity bug the skill warns about isn't baked in at creation.

🟡 Low

Winsorization percentile — skill is right, server guidance is wrong. The skill (umbrella glossary + design) correctly treats percentile as tail-width (default 5, schema rejects ≥50, push back >20%), matching the Create-Experiment schema. The server-side Get-Experiment-Setup-Guidance still says "default 95, never < 80," which is impossible against that schema. No PR change needed — worth fixing the server guidance so the two stop contradicting.
Archive doesn't cascade to the backing flag. Archiving the experiment left its auto-created feature flag behind (disabled, not archived) — we archived it manually. A future delete/cleanup path could handle the flag too.
List-Experiments on experiments-disabled projects returns a generic "Unexpected error" (seen on the E-Commerce and AI demo projects). The umbrella's step-1 error handling only covers the tool-not-found case; a generic list failure would currently fall through unhandled.

Net: ship-worthy CRUD, but the settings-merge gap (High) is worth a skill-side guardrail before this lands.

Assisted by Claude

zoeabrams-eng · 2026-06-16T02:49:27Z

+- **Lift.** `(treatment_mean − control_mean) / control_mean`. The sign of lift is mechanical (up/down); it is not by itself a verdict.
+- **MDE (Minimum Detectable Effect).** The smallest lift the experiment is sized to detect. Set during design, enforced at interpretation.
+- **CUPED.** Variance-reduction technique using pre-exposure baseline. Cuts required sample 30–70% (typical, empirical range) when the metric correlates with pre-exposure behaviour. Inert on new-user-only cohorts.
+- **Winsorization.** Outlier capping applied pooled across variants. The `percentile` field is the **tail width** to cap on each side (default `5` caps below the 5th and above the 95th — i.e. the 5% tails). The schema rejects `percentile` ≥ 50. **Push back on tail widths above ~20%** — capping more than a fifth of each side discards too much signal; this is the canonical push-back rule the commands inherit. Cuts variance on heavy-tailed continuous metrics; meaningless on Bernoulli metrics.


The content after the first sentence seems like a better fit for a reference file.

And generally other definitions in the glossary could be more concise.

Done in 59b6425. The Winsorization entry is now one line (definition + tail-width default) with the push-back rule and full guidance pointed to advanced-features.md, which already held them. Also tightened Direction, CUPED, Lift, and the multiple-testing entries for concision. The full mechanics that were inline now live only in the reference.

Assisted by Claude

Live CRUD QA against the Mixpanel MCP surfaced gaps where the skill instructs behavior the tools handle differently than assumed: - Settings full-replace footgun (High): the experiment-update tool's `settings` payload is a full replace, so a one-field edit silently nulls `srm`, `excludeQA`, etc. Added a read-merge-write Behaviour rule to SKILL.md, a pointer in design step 8, and a launch-readiness warning row for dropped `srm` / `excludeQA`. - Metric `direction` not settable via inline metric creation (Medium): metrics default to `up`. Added a caveat in design step 3 to set `down` in the UI for down-polarity metrics and verify before launch. - Archive doesn't cascade to the backing flag (Low): noted in lifecycle-handoff that the auto-created flag is left behind. - Generic experiment-listing errors (Low): Behaviour rule 2 now covers the experiments-not-enabled failure case. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

elliotrfeinberg · 2026-06-16T03:52:39Z

Addressed the skill-side findings in c32867e:

High (settings full-replace): added a read-merge-write Behaviour rule to SKILL.md, a pointer in design step 8, and a launch-readiness warning row that re-checks srm / excludeQA before the allocation locks.
Medium (direction not settable inline): design step 3 now tells the agent to set down in the UI for down-polarity metrics and verify before launch; the existing launch warning re-flags any primary left at the up default.
Low: lifecycle-handoff notes the backing flag isn't cleaned up by concluding/archiving; Behaviour rule 2 now covers experiment-listing errors on projects where experiments aren't enabled.

The winsorization percentile contradiction is left untouched here — the skill is already correct against the schema; that fix belongs in the server-side Get-Experiment-Setup-Guidance. Synced across mixpanel-mcp / -eu / -in.

Assisted by Claude

@zoeabrams-eng

… concision) Resolves @zoeabrams-eng's two inline review threads on SKILL.md: - Routing duplication: "Pick the command" (step 3) no longer restates the explicit/implicit/phase-derived rules already in "Canonical commands" — it now references the table and mapping and applies them. - Glossary verbosity: trimmed the Winsorization entry to a one-liner plus a pointer to advanced-features.md (which already holds the full push-back rule), and tightened the Direction / CUPED / Lift / multiple-testing entries. Repointed the two remaining "umbrella glossary's push-back rule" references (design step 6, pitfalls.md) so the rule's canonical home is advanced-features.md; kept pitfalls' mention inline to avoid a reference->reference link chain. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

elliotrfeinberg · 2026-06-16T04:26:21Z

Ran the two pending routing checks (desk-checks against the Canonical-commands triggers + state→command mapping; the skill isn't installable from a branch, so this is the manual check the test plan called for). Both pass — test plan updated.

Trigger surface — all four route correctly:

Phrase	Routes to
"set up an A/B test"	`design` (matches `set up`)
"how did experiment X do"	`interpret` (read-results)
"why isn't this significant yet"	`interpret` (statsig / why-no-statsig)
"should we ship"	`interpret` (matches `ship`)

Phase-derived — the full mapping is consistent (and we exercised these states live in the CRUD QA): DRAFT → design (or launch when the config is final), ACTIVE mid-flight → monitor, ACTIVE past planned end / CONCLUDED → interpret.

One thing the check caught: the old test-plan line read "ACTIVE → interpret," which was a leftover from when this was a two-command skill. The four-command version correctly splits ACTIVE into monitor (mid-flight) vs interpret (past end) — fixed the wording in the description.

Only remaining item is the cross-skill manage-feature-flags link check, which is genuinely gated on PR #25 landing and this branch rebasing — the references are well-formed by-name mentions, but the sibling skill isn't on this branch yet.

Assisted by Claude

zoeabrams-eng

LGTM

elliotrfeinberg · 2026-06-16T04:54:28Z

Follow-up on the metric-polarity thread: we dug into the MCP server to see whether polarity could be set through the metric tools.

What we found:

The up/down polarity from the QA finding is direction on the experiment metric (persisted at definition.display.direction). The read path already surfaces it; only the write path (ExperimentMetricInput) drops it.
A saved metric's goals is a separate target-tracking feature (target value + checkpoints), and its own direction field is deprecated — so it isn't experiment polarity.

We landed a server PR adding the saved-metric goals/target support to the metric tools (Create/Update/Get-Metric): mixpanel/analytics#96199. The experiment-metric direction write gap is noted there as a separate change if we want it.

No action needed on this skill PR — flagging for traceability since the finding originated here.

Assisted by Claude

…setup Metric direction/polarity is a property of the saved metric and can now be corrected after setup via the metric-update tool, without recreating the metric or experiment. Update the skill to reflect this: - design: setting `down` no longer requires the Mixpanel UI — the metric-update tool works on the draft - glossary, metric-selection, launch readiness, interpret: note the in-place fix; full statement centralized in the Direction glossary entry, others trimmed to brief pointers - use the skill's descriptive "metric-update tool" voice instead of the literal Update-Metric tool ID Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve README.md skills table conflict by keeping both the new manage-experiment and manage-feature-flags rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

elliotrfeinberg · 2026-06-17T18:03:41Z

@zoeabrams-eng sorry, can I get another review, I merged in the manage-feature-flag PR and have to merge in the main branch here. Apparently this repo requires re-reviews for any changes after an approval

elliotrfeinberg and others added 2 commits June 9, 2026 23:53

elliotrfeinberg marked this pull request as ready for review June 10, 2026 00:42

elliotrfeinberg mentioned this pull request Jun 10, 2026

Add launch + monitor commands to manage-experiment #31

Merged

8 tasks

elliotrfeinberg requested a review from gslopez June 10, 2026 01:07

This was referenced Jun 10, 2026

Add experiment-results skill #23

Closed

Add experiment-setup skill #24

Closed

elliotrfeinberg and others added 3 commits June 10, 2026 11:12

zoeabrams-eng requested review from zoeabrams-eng and removed request for gslopez June 12, 2026 20:17

zoeabrams-eng reviewed Jun 16, 2026

View reviewed changes

zoeabrams-eng self-requested a review June 16, 2026 04:29

zoeabrams-eng previously approved these changes Jun 16, 2026

View reviewed changes

elliotrfeinberg dismissed zoeabrams-eng’s stale review via e57732f June 16, 2026 21:23

Merge branch 'main' into manage-experiment-umbrella

752db1a

Resolve README.md skills table conflict by keeping both the new manage-experiment and manage-feature-flags rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

elliotrfeinberg enabled auto-merge (squash) June 17, 2026 18:13

zoeabrams-eng approved these changes Jun 17, 2026

View reviewed changes

elliotrfeinberg merged commit 2561fb5 into main Jun 17, 2026
1 check passed

elliotrfeinberg deleted the manage-experiment-umbrella branch June 17, 2026 18:48


		1. Explicit: user names a phase (`/design`, "launch this experiment", "monitor experiment X", "interpret the results") → use that command.

Conversation

elliotrfeinberg commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why an umbrella

What's in here

Sequencing

Test plan

Live CRUD QA — executed 2026-06-16 against the Sandbox project (4032515) via the Mixpanel MCP

Skill QA — validated

Skill QA — pending

Uh oh!

linear-code Bot commented Jun 11, 2026

Uh oh!

zoeabrams-eng commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliotrfeinberg commented Jun 16, 2026

Uh oh!

zoeabrams-eng commented Jun 16, 2026

Uh oh!

elliotrfeinberg commented Jun 16, 2026

Uh oh!

zoeabrams-eng Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

elliotrfeinberg Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

elliotrfeinberg commented Jun 16, 2026

QA: experiment CRUD via manage-experiment (live run)

🔴 High — Update-Experiment.settings is a full replace, not a merge

🟠 Medium — metric direction can't be set through the MCP

🟡 Low

Uh oh!

zoeabrams-eng Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

zoeabrams-eng Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

elliotrfeinberg Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

elliotrfeinberg commented Jun 16, 2026

Uh oh!

elliotrfeinberg commented Jun 16, 2026

Uh oh!

zoeabrams-eng left a comment

Choose a reason for hiding this comment

Uh oh!

elliotrfeinberg commented Jun 16, 2026

Uh oh!

elliotrfeinberg commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elliotrfeinberg commented Jun 9, 2026 •

edited

Loading

zoeabrams-eng commented Jun 15, 2026 •

edited

Loading

QA: experiment CRUD via `manage-experiment` (live run)

🔴 High — `Update-Experiment.settings` is a full replace, not a merge

🟠 Medium — metric `direction` can't be set through the MCP