Add experiment-setup skill#24
Closed
elliotrfeinberg wants to merge 8 commits into
Closed
Conversation
Authors a single skill covering setup-phase expertise for Mixpanel experiments: hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), XP-vs-FF routing, prior-experiment reuse, and pre-launch pitfalls. Replaces several per-capability tools per MULTI-581. Structured for progressive disclosure: the spine lives in SKILL.md, with deep-dive references for each topic and eval fixtures scaffolded from PRD customer scenarios (Pelando, Confetti, Polarsteps — real quotes still need to be filled in). Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills. Linear: https://linear.app/mixpanel/issue/MULTI-581 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace explicit tool name references (search_prior_experiments, validate_experiment, run_pre_launch_checks, Run-Query) with agent-agnostic phrasing per the convention from #22. Skills describe capabilities ("query the metric", "search prior experiments", "the platform's pre-launch validation") rather than specific tool calls. - Drop evals/ directory — this repo doesn't run evals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The spine is always loaded; references are lazy. Move spine content that duplicated reference material out: - Drop hypothesis examples + 5-item commitment list (in hypothesis-framing.md) - Drop sizing worked example arithmetic (in sizing.md) - Collapse Step 4 confidence-level + multiple-testing detail to one line each + pointer to statistical-model.md - Collapse the 3-paragraph "Advanced features" detail to two lines + pointer to advanced-features.md - Drop the enumerated warnings list and edge-cases section (covered in pitfalls.md / sizing.md / metric-selection.md) - Drop the redundant "References" appendix at the bottom (every reference is already linked inline from its step) SKILL.md: 243 → 151 lines. Hypothesis template, metric roles, sizing formula, statistical-model defaults, pitfall blockers, and the output summary template all preserved. All 8 references still linked from the spine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaan-barmore-genc-mixpanel
previously approved these changes
Jun 5, 2026
kaan-barmore-genc-mixpanel
left a comment
There was a problem hiding this comment.
Fine, although same concern I had on #23 with files being duplicated across regions.
Surfaces sibling-skill boundaries (experiment-results for post-launch, feature-flags for plain rollouts) at routing time. The exclusions were buried in the body where the routing pass never reaches them. Sync mixpanel-mcp-eu and mixpanel-mcp-in. Assisted by Claude
Proactively applies the same themes @gslopez raised on PR #23 (`experiment-results` → `interpret-experiment`) to this skill. - Rename: `experiment-setup` → `design-experiment` (verb-noun, matches `create-dashboard` / `manage-lexicon` / `interpret-experiment`). - Restructure SKILL.md into intro / Glossary / Components / Steps shape. Glossary front-loads Variant, Primary/Guardrail/Secondary, Direction, MDE, Power, Underpowered, CUPED, Winsorization, multiple-testing correction, sequential vs frequentist. Components defines sizing formulas, the >5% guardrail hard-gate, and the pre-launch pitfall catalogue once. - Strip field-path leaks everywhere: `settings.testingModel`, `settings.endCondition`, `settings.confidenceLevel`, `settings.multipleTestingCorrection`, `settings.cuped.*`, `settings.winsorization.*`, `len(primary_metrics)`, and the internal pitfall `kind` enums (`underpowered_duration_insufficient`, `cohort_too_small`, etc.) are now plain language. - Strip the `pitfall_prose.py` cross-reference from pitfalls.md — that file lives in the analytics repo and isn't accessible to the skill. - Hedge unsourced defaults: 95th percentile Winsorization, 10k per-variant default, ~350-400 floor, 0.95 confidence level all now read "verify in product" / "platform default". - Drop "Open this when…" / "All four properties matter" preambles from hypothesis-framing, metric-selection, prior-experiments. - Add irreversible-action confirmation: SKILL.md step 8 explicitly waits for user confirmation before creating the experiment. - Add disambiguation guard + human-friendly identifier guidance. - Fix `manage-feature-flags` cross-reference in routing-xp-vs-ff.md to point at the actual `feature-flags` skill name. Cross-PR note: the `interpret-experiment` references in PR #23 link to this skill as `experiment-setup` in three places. Will need a follow-up sed after both PRs merge. `grep -rE 'settings\.|len\(.*\)|pitfall_prose' design-experiment/` returns zero hits. Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude
Applies the second-pass review findings on the freshly-restructured skill.
- Drop the one surviving field-path leak (`num_arms × target_sample_size`)
from the cohort-too-small pitfall recommendation; plain language now.
- Defuse the description acronyms — CUPED / Winsorization / Bonferroni each
gets an expansion on first use ("variance reduction via CUPED", "outlier
capping via Winsorization", "false-positive correction via Bonferroni /
Benjamini-Hochberg"). Same fix for "MDE" → "minimum detectable effect"
and "sequential or fixed" → "sequential or fixed-horizon" in the trigger
phrasings.
- Add the formula source to the family-wise FPR table in statistical-model.md:
"Derived from 1 − (1 − α)^k for k = primaries × non-control variants
independent tests at per-test α = 0.05." Lets a skeptical reader recompute.
- Compress SKILL.md Step 5 from 11 lines to a 4-bullet dispatch (testing
model / end condition / confidence / correction), each with the default
inline. Decision tree, peeking trap, worked numbers live in the reference,
which is what references are for.
- Compress pitfalls.md catalogue: 8 entries × 3-paragraph scaffolding
(Trigger / Why it matters / Recommendation) → 8 entries × bold-condition
sentence + free paragraph. Same compression as the phase-3 pass on
interpret-experiment. pitfalls.md: 131 → 93 lines (-29%).
- Drop the guardrails-by-domain parenthetical from SKILL.md Step 3; point
at the table in metric-selection.md instead. Same content, no duplication.
- Replace the `<pitfall name>` placeholder in the Step 8 confirmation card
with a real example ("Missing guardrails — no guardrail metrics
configured…") and a one-line guidance to use the catalogue's exact labels
for consistency across sessions.
Skill total: 967 → 932 lines (-35).
Sync via make sync-skills FORCE=1; make check-skills-sync passes.
Assisted by Claude
elliotrfeinberg
added a commit
that referenced
this pull request
Jun 9, 2026
…eriment The setup-side skill was renamed from `experiment-setup` to `design-experiment` on its own PR (#24) to follow the verb-noun convention. Update this skill's cross-references to match. Sites updated: - SKILL.md description's negative-trigger sentence - references/why-no-statsig.md (two mentions) Sync via make sync-skills FORCE=1; make check-skills-sync passes. Assisted by Claude
elliotrfeinberg
added a commit
that referenced
this pull request
Jun 9, 2026
…Components/Steps Apply the same themes of feedback from PR #23 (interpret-experiment) and PR #24 (design-experiment) to the feature-flags skill: - Rename `feature-flags` → `manage-feature-flags` to match the verb-noun convention of `create-dashboard`, `manage-lexicon`, `interpret-experiment`, `design-experiment`. - Restructure SKILL.md from a 3-spine shape (Routing/Lifecycle/Hygiene) into the canonical Glossary / Components / Steps shape used by manage-lexicon and design-experiment. Steps run in order; vocabulary is defined up front. - Strip field-path leaks throughout — references no longer document `flagType: "feature_gate"`, `rolloutPercentage`, `status: "enabled"`, `ruleset.X` field paths. Those belong to tool docs, not the skill. - Fix stale cross-references: `experiment-results` → `interpret-experiment` and `experiment-setup` → `design-experiment` to match the sibling PRs. - Add irreversible-action confirmation discipline for archive: identify the flag by human-friendly name (not UUID), state the consequence, confirm SDK call sites are gone, get explicit yes — for bulk archives too. - Hedge defaults and compress preambles to match the editorial standard set by PR #24. - Drop the setup field reference and update field reference tables (those belong to tool docs). README index updated. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The companion PR (#25) renamed the skill `feature-flags` → `manage-feature-flags` to match the verb-noun convention of `create-dashboard`, `manage-lexicon`, `interpret-experiment`, and `design-experiment`. This PR's design-experiment skill had three stale references to the old name: - SKILL.md description (negative-trigger clause) - SKILL.md step 1 routing prose - SKILL.md "Related skills" list - references/routing-xp-vs-ff.md hand-off prose Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks
Contributor
Author
|
Closing in favor of the umbrella experiment skill: #30 |
elliotrfeinberg
added a commit
that referenced
this pull request
Jun 17, 2026
* Add feature-flags skill
Authors a single home for feature-flag expertise: picking the right
flag-shaped tool (Feature Gate vs Dynamic Config vs Experiment),
naming and keying, staged rollouts, the kill switch, exposure
debugging, archive/restore semantics, and SDK call patterns.
Replaces the Get-Feature-Flag-Guidance MCP resource + passthrough
tool from MULTI-24.
Structured for progressive disclosure: the spine in SKILL.md covers
the 3-way routing decision, the 5-step lifecycle, and the
List-Feature-Flags-first hygiene rule. Deep-dive references cover
setup, staged rollout, hygiene/cleanup, SDK/exposure debugging, the
status state machine + Update-Feature-Flag call shapes, and
experiment-linked flag handling. Three eval fixtures guard the most
common failure modes: routing ambiguity, disabled-flag zero
exposures, and stale-flag triage.
Content ports from the canonical
ai/copilot/agent/tools/feature-flags/{setup,lifecycle}-guidance.md
in mixpanel/analytics. Loads via the progressive-discovery skill
pattern rather than the readFileSync .md pattern MULTI-24 used.
Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills.
Linear: https://linear.app/mixpanel/issue/MULTI-583
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Remove MCP tool name references and delete eval fixtures
- Replace explicit MCP tool names (Create-Feature-Flag, Update-Feature-Flag,
Get-Feature-Flag, List-Feature-Flags, Update-Experiment) with
agent-agnostic phrasing per the convention established in #22.
Skills describe actions ("update the flag's status", "list the project's
flags") rather than specific MCP tool calls.
- Drop evals/ directory — this repo doesn't run evals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Trim SKILL.md by deferring duplicated content to references
The spine is always loaded; references are lazy. Trim spine content
that duplicated reference material:
- Collapse the 5 lifecycle steps from multi-paragraph each to one
numbered list with single-sentence-each-plus-pointer (details still
in staged-rollout.md / lifecycle-and-state-machine.md /
hygiene-and-cleanup.md).
- Tighten Spine 3 (prior-work check) to one paragraph; the three
cases are still enumerated in hygiene-and-cleanup.md.
- Drop the 11-line "Common pitfalls" cheat sheet at the bottom —
every pitfall is already covered in the relevant reference, and
the Quick lookups table routes the agent there.
SKILL.md: 161 → 86 lines. Three-spine structure, routing table,
quick-lookups router, and the archive≠kill-switch reminder all
preserved. All 6 references still linked from the spine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Address PR review feedback on feature-flags skill
- Replace $feature_flag_called with $experiment_started as the
exposure event name throughout SKILL.md and sdk-and-exposure.md.
- Fix newly-enabled flag rollout default: starts at 0%, not 1.0;
requires a manual bump.
- Soften "server-side changes can go faster" to require sufficient
monitoring before accelerating ramps.
- Add short-lived feature-gate flags to the healthy-project list
in hygiene-and-cleanup.md.
- Require a codebase reference check before disabling or archiving
any flag, to catch flags the SDK still reads in production.
- Rewrite SDK call shape section to describe the two read entry
points (value-only, full variant) without language-specific
code snippets; defer per-language signatures to docs.mixpanel.com.
- Clarify fallback semantics include the not-in-rollout case.
- Drop the serving_method field from the setup reference and the
diagnostic checklist since it is not togglable in the UX.
* Add negative-trigger guidance to feature-flags description; sync variants
The exclusions for experiment-results and experiment-setup were buried
in the body; the routing pass only sees the description. Moving them up
prevents the skill from being picked for "should we ship?" or "what MDE
can I detect?" questions.
Also: variants were drifted from mixpanel-mcp in the PR baseline
(SKILL.md, hygiene-and-cleanup.md, routing-and-setup.md,
sdk-and-exposure.md). Resync to make check-skills-sync pass in CI.
Assisted by Claude
* Rename feature-flags → manage-feature-flags; restructure to Glossary/Components/Steps
Apply the same themes of feedback from PR #23 (interpret-experiment) and
PR #24 (design-experiment) to the feature-flags skill:
- Rename `feature-flags` → `manage-feature-flags` to match the verb-noun
convention of `create-dashboard`, `manage-lexicon`, `interpret-experiment`,
`design-experiment`.
- Restructure SKILL.md from a 3-spine shape (Routing/Lifecycle/Hygiene) into
the canonical Glossary / Components / Steps shape used by manage-lexicon
and design-experiment. Steps run in order; vocabulary is defined up front.
- Strip field-path leaks throughout — references no longer document
`flagType: "feature_gate"`, `rolloutPercentage`, `status: "enabled"`,
`ruleset.X` field paths. Those belong to tool docs, not the skill.
- Fix stale cross-references: `experiment-results` → `interpret-experiment`
and `experiment-setup` → `design-experiment` to match the sibling PRs.
- Add irreversible-action confirmation discipline for archive: identify the
flag by human-friendly name (not UUID), state the consequence, confirm
SDK call sites are gone, get explicit yes — for bulk archives too.
- Hedge defaults and compress preambles to match the editorial standard set
by PR #24.
- Drop the setup field reference and update field reference tables (those
belong to tool docs).
README index updated. Synced to mixpanel-mcp-eu and mixpanel-mcp-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Address review minors: hedge cadence default, add flag-lookup discipline, trim filler
Apply the 4 issues surfaced by the hardcore /review-skill pass on
manage-feature-flags:
- D7 (Defaults are sourced): The `1% → 10% → 50% → 100%` cadence was stated
as if Mixpanel imposed it. Reframe as a recommended default that works for
most teams, with an explicit "calibrate to product risk and the monitoring
you actually have" hedge. Applied in SKILL.md step 5 and staged-rollout.md.
- D6 (Human-friendly identifiers): Add a "Identifying flags the user mentions"
note to SKILL.md telling the agent to match by key first, then case-insensitive
name, and to ask for disambiguation rather than UUID.
- D8 (Concise): Trim the "Why incremental rollout" preamble in
staged-rollout.md (the agent already knows the concept). Trim the
"healthy project has..." rhetorical scaffolding in hygiene-and-cleanup.md.
- D8 (Concise): Tighten the auto-key generation subsection in
routing-and-setup.md from 4 details to the 2 that actually surprise users.
Total: 614 → 604 lines. Synced to mixpanel-mcp-eu and mixpanel-mcp-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Clarify the opening of two reference files
Two PR25 reference files opened with mechanics instead of stating what
the file is. Add a one-line "this is X" framing before the first
substantive paragraph so the reader knows what they'll get:
- lifecycle-and-state-machine.md: "This is the flag state machine plus
the rules for which update calls preserve which fields."
- experiment-linked-flags.md: "This is the playbook for flags governed
by an experiment — how to spot one, which edits the experiment will
overwrite, and when to hand off."
Synced to mixpanel-mcp-eu and mixpanel-mcp-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* PR #25 polish + stale-ref sweep
/review-document feedback on PR #25:
- D1: staged-rollout.md cadence table read as authoritative. Added a
one-line "wait times are conservative defaults; calibrate to your
monitoring latency and product risk" hedge below the table.
- D3: hygiene-and-cleanup.md "Before any new flag" section duplicated
SKILL.md step 2. Replaced with a stub that points at the canonical
version, since hygiene-and-cleanup primarily addresses *existing*
flag cleanup.
- D7: lifecycle-and-state-machine.md ASCII state diagram had
inconsistent label widths. Normalized the transition labels.
Cross-PR stale-ref sweep:
- 9 references to `interpret-experiment` / `design-experiment` across
SKILL.md (description, when-to-use, components, routing table, mis-
routing rules) and references/experiment-linked-flags.md (4 spots).
- All updated to point at the umbrella skill `manage-experiment`,
which is what outside callers should know about. The dual
design-vs-interpret distinction is internal to that skill's
subcommands.
Synced to mixpanel-mcp-eu and mixpanel-mcp-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* manage-feature-flags: address skill + reference review findings
Skill review:
- Trim description from 1087 to 958 chars (was over the 1024 spec limit)
- De-duplicate the terminal-states enumeration: SKILL.md step 8 now a
spine + link; full triage lives only in hygiene-and-cleanup.md
- Convert lateral reference->reference links to plain prose so each
reference is one hop from SKILL.md
- Hedge unsourced product/implementation claims (split tolerance,
server-side enforcement, update-routing behavior, experiment
transition table)
Reference reviews:
- Fix broken external docs link (/docs/feature-flags 404 -> /docs/featureflags)
- Collapse the threefold operational/shipped triage in hygiene-and-cleanup.md
- Trim mid-stage ship/hold/roll-back overlap with SKILL.md in staged-rollout.md
- Clarify that an experiment-managed conclude preserves bucketing,
unlike a manual flag disable
- Add a concrete Dynamic Config variant-value example
Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* manage-feature-flags: add exposure ingestion-lag caveat to no-exposures checklist
Address review feedback that an LLM may not know how long to wait before
treating a zero exposure count as a real failure. Note that exposure
events are not instant, so the agent should allow a few minutes after
enabling (and after real traffic hits the code path) before working the
diagnostic checklist.
Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* manage-feature-flags: correct exposure ingestion-lag window to up to 24h
Per Mixpanel docs, events usually surface in seconds-to-minutes but the
general ingestion delay can reach ~24h with SDK batching, server-side
buffering, or third-party pipelines. Point to Live View to confirm
real-time arrival rather than waiting on the report.
Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
elliotrfeinberg
added a commit
that referenced
this pull request
Jun 17, 2026
* New skill: manage-experiment (umbrella over design + interpret) Combines the soon-to-land `design-experiment` (PR #24) and `interpret-experiment` (PR #23) skills into a single `manage-experiment` umbrella, modeled on `manage-lexicon`. One front door for any experiment-related phrase; subcommands for the two phases. ## Why Today an agent has to disambiguate between two skills on every experiment turn, and borderline phrases ("audit my experiment", "check on experiment X") route inconsistently. Folding both into one umbrella with `design` and `interpret` subcommands: - Gives a single trigger surface for the agent to route to. - Lets ambiguous phrases use phase-derived disambiguation (DRAFT → design; ACTIVE/CONCLUDED → interpret). - Removes the cross-skill reference coupling that currently requires coordinated edits across two PRs whenever either skill renames. - Reserves room to add `launch` and `monitor` commands later without another rename. ## Shape ``` manage-experiment/ ├── SKILL.md (umbrella: shared glossary, command menu, project + experiment resolution) ├── commands/ │ ├── design.md (former design-experiment SKILL.md body) │ └── interpret.md (former interpret-experiment SKILL.md body) └── references/ (15 references, flat; no name collisions) ``` The two command files preserve their original prose verbatim except for: - Removing the now-redundant frontmatter and opening "You are helping…" paragraph. - Lifting umbrella-able glossary terms (Variant, Primary/Guardrail/ Secondary, Direction, Lift, MDE, CUPED, Winsorization, Multiple- testing correction) up to the umbrella; keeping phase-specific terms (Hypothesis, Power, Underpowered, Sequential vs Frequentist for design; Polarity, Significance, SRM, Retro A/A, Twyman's Law, Trustworthiness gate for interpret) in their command files. - Adjusting reference paths to `../references/`. The 15 reference files are unchanged except for two intra-skill links in `why-no-statsig.md` that previously pointed at the `design-experiment` skill — now point directly at `sizing.md` (where the formulas live). The 3 references to `manage-feature-flags` (in SKILL.md, design.md, and routing-xp-vs-ff.md) are correct cross-skill pointers and unchanged. ## Sequencing This PR is opened **before** PR #23 and PR #24 merge. Once those land, this PR will need a rebase against master that deletes the now-existing `design-experiment/` and `interpret-experiment/` directories. Per the plan, doing the umbrella as a separate PR keeps the review focused on the routing pattern rather than tangled into either skill's prose review. The `manage-feature-flags` skill (PR #25) cross-references both old skill names; once both PRs land, those cross-refs will need a sweep to point at `manage-experiment`. README index updated: one `manage-experiment` row replaces the two prospective `design-experiment` and `interpret-experiment` rows. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review minors: add reference index, clarify routing fall-through, dedupe match logic Applies the 3 issues surfaced by the /review-skill pass on manage-experiment: - D4 (Centralised logic): the umbrella SKILL.md had no map of what references/ contains. Add a "Reference files" table to SKILL.md listing each file, which command uses it, and its purpose. Mirrors the shape used by manage-feature-flags. - D7 (Qualified recommendations): the phase-derived rule in step 3 ("DRAFT → design, ACTIVE/CONCLUDED → interpret") read as if it always applied, but it only does when an experiment is in context. Restate step 3 as ordered rules with explicit fall-through, and add a row for ambiguous verbs like "audit" / "check" / "review". - D6 (Human-friendly identifiers): the "try ID first, then case-insensitive name match" rule was stated in three places (umbrella step 2, design step 8, interpret step 1). Drop the duplicates from the command files — interpret hands back to the umbrella, design retains only the "ask if not named" guidance (the lookup mechanics don't apply to free-text feature names). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix Winsorization percentile convention: agent passes tail-width (5), not cap-point (95) PR #30 review caught an agent-actionable accuracy bug. The skill docs described Winsorization as "default 95 (cap top/bottom 5%)" — natural phrasing for a human, but the actual `Update-Experiment` tool schema defines `WinsorizationSettingsInput.percentile` as the **tail width to cap on each side**, with default `5` and `exclusiveMaximum: 50`. An agent reading the docs would pass `95`, get rejected by validation, or misinterpret as inverted Winsorization. Updated 6 files to match the schema's convention: - SKILL.md shared glossary - commands/design.md step 6 (advanced features) - references/advanced-features.md "What it does", "How to enable", "Percentile guidance" (default flipped from "95th" to `percentile=5`, "push back below ~80" flipped to "push back above ~20") - references/health-check-interpretation.md "Extreme winsorization percentile" misconfiguration check - references/per-metric-interpretation.md "Variance-reduction & outlier settings that change interpretation" - references/pitfalls.md "High variance, no Winsorization" warning The numeric guidance ("push back if more than 20% of each tail is capped") is preserved; only the convention name flipped to match what the agent actually sends through the tool. Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add launch + monitor commands to manage-experiment (#31) * Add launch and monitor commands to manage-experiment Extends the manage-experiment umbrella (PR #30) from 2 commands (design, interpret) to 4 commands covering the full experiment lifecycle. Stacks on top of PR #30 so the umbrella structure and this extension can be reviewed separately. ## New commands ### commands/launch.md (~125 lines) Owns the irreversible DRAFT → ACTIVE transition. Runs the final launch-readiness check (lifted from design step 7 — that step now catches design-time issues only) and performs the launch with explicit CONFIRM. Includes the post-launch handoff: recommend a tracking dashboard + a calendar reminder for monitor. ### commands/monitor.md (~115 lines) Mid-flight safety checks. Answers "is it safe to keep this running?" which is distinct from interpret's "did it work?" Covers: - The peek-safety table: what's safe to look at mid-flight (SRM, sample pace, guardrails, Sequential primaries) and what isn't (Frequentist primaries — the peeking trap). - The terminate-early decision rules: trustworthiness failure, guardrail regression beyond tolerance, Sequential boundary crossed. - An explicit don't-peek-on-Frequentist pushback for users who ask to see primary results mid-flight. ## Knock-on changes - umbrella SKILL.md: 4-row commands table, 5-option menu, phase-derived routing extended to DRAFT-design vs DRAFT-launch and ACTIVE-monitor vs ACTIVE-interpret. Reference table "Used by" columns updated (pitfalls → design+launch, health-check → monitor+interpret, sizing → design+monitor+interpret). - commands/design.md step 7: reframed as "design-time sanity check" rather than "the pre-launch check". The full launch-readiness gate now lives in launch.md. Step 8 saves a DRAFT (reversible) rather than creating-and-launching. Added step 9: hand off to launch. - design.md opening line: updated to say "stops at DRAFT" instead of "don't create until confirmed". Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review issues: lift cross-command policies to umbrella, unify emoji, hedge cross-skill refs Applies the 5 issues surfaced by the /review-skill pass on the 4-command manage-experiment: ## Major (D4) The 5% guardrail hard-gate and the peek-safety table were duplicated across design / launch / monitor — three places to keep load-bearing thresholds in sync. - Add a "Cross-command policies" section to the umbrella SKILL.md that owns the guardrail hard-gate threshold and the peek-safety table. Also moves the emoji convention there so all commands share one vocabulary. - design.md "The >5% guardrail hard-gate" → "Guardrails (the hard-gate enforced downstream)" — keeps the design-time guidance, defers the threshold to the umbrella. - design.md step 5: "peeking is safe by design" → "peek-safety table in the umbrella's Cross-command policies covers why". - launch.md readiness checklist: ">5% hard-gate from the design command" → "regression hard-gate (see umbrella Cross-command policies)". - monitor.md Components: drop the entire "What's safe to peek at" table, reference the umbrella. Glossary defers "peeking trap" to the umbrella. Terminate-early rules cite the umbrella threshold. ## Minor (D3) design.md output-style section had a duplicate hand-off line restating what step 9 already covers — removed. ## Minor (D7) launch.md step 5 dashboard recommendation was unqualified — added "recommend running create-dashboard in a follow-up session" so the agent doesn't try to invoke it inline and interrupt the launch flow. ## Suggestion (D3) launch.md irreversibility section mentioned "pause/resume via the underlying feature flag" without naming the skill — added explicit "handled by the manage-feature-flags skill" cross-ref. ## Emoji convention (D6) Declared once in the umbrella's Cross-command policies. Existing emoji usage across design/launch/monitor already matches; interpret has no symbolic checklist output (its output is prose verdict + per- metric breakdown), which is correct. ## Knock-on numbers Skill core: 692 → 720 lines (umbrella +38 with the new policies section, monitor -10 from removed table, others flat). Synced to mixpanel-mcp-eu and mixpanel-mcp-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * PR #31 polish: define peeking trap in umbrella, add calendar-reminder example /review-document feedback on PR #31: - D6: monitor.md deferred the "peeking trap" definition to the umbrella's Cross-command policies, but the umbrella's peek-safety section only showed consequences and never defined the term. Added a one-line definition before the table. - D8: launch.md step 5 calendar-reminder recommendation was abstract. Added concrete phrasing the agent can hand to the user so it doesn't improvise the wording inconsistently across sessions. Synced to eu/in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * manage-experiment: port statsig diagnosis depth + align to skill rubric Port the three pieces from the analytics monorepo's statsig-diagnosis guidance (tal-multi-546) that PR #30's why-no-statsig.md lacked, then reconcile them with this repo's skill quality rubric. Port: - sizing.md: relative, unit-safe achievable-MDE formulas (MDE_relative, n_required) with the sigma=sqrt(sigma^2) caution, framed as explaining the NO verdict rather than overriding it. - why-no-statsig.md: a "pull the diagnostic inputs first" step and a priority-ordered "name the single most-likely blocker" table so the answer is one blocker with numbers, not a generic option list. Rubric alignment (review-skill: 60% capped -> ~98%): - Reframe the diagnostic-inputs step to intent (no named tool call or response field paths) so it stays engine-agnostic, lifting the cap. - Fix advanced-features.md winsorization percentile contradiction (tail-width 5 convention vs stale 95/90 boundary references). - Fix dead/stale cross-references in per-metric-interpretation.md and routing-xp-vs-ff.md to point at sections/commands that exist. - Add a Contents block to every 100+ line reference. - Collapse the state->command mapping duplicated 3x in SKILL.md (now 200). Synced byte-identical across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * manage-experiment: align prior-experiments lookup with overlap-ranking tool The Search-Prior-Experiments engine tool (analytics #95682) ranks stored experiments by overlap on the draft's metric set, flag key, and hypothesis rather than free-text keyword matching. Update the design command and the prior-experiments reference so they steer the agent to hand the tool the draft it's building instead of constructing single-keyword queries. Engine-agnostic wording — describes the signals, not tool names or field paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * manage-experiment: trim description under the 1024-char spec limit The description was 1326 chars (over the Agent Skills 1024 limit) because it enumerated ~15 verbatim trigger phrasings. Trim to a representative handful and fold Bonferroni/Benjamini-Hochberg into "multiple-testing correction"; the Canonical commands table in the body already carries the full keyword set. Now 921 chars. Synced across all three plugin variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * manage-experiment: address skill-review findings across SKILL + references Fixes from the review-skill rubric pass and the per-reference review: - DRY/portability: de-link all reference->reference chains so references stay one level deep; strip tool-docs leakage (significance enum, summary.* buckets, liftConfidence, decide constants) to intent-level. - Centralise the Winsorization >~20% push-back rule in the SKILL.md glossary; centralise the polarity recipe in the interpret command; both now referenced rather than restated. - Source defaults: qualify the Benjamini-Hochberg platform default, CUPED 30-70% range, and the `up` direction default with "verify". - Accuracy fixes in references: correct the family-wise FPR claim (3-arm -> 4-arm), drop in-file Winsorization duplication, fix the "constants"->"choices" dangling referent in lifecycle-handoff, soften absolutist hypothesis-framing phrasing, de-changelog the segment-breakdown platform-status note. Applied identically across the mixpanel-mcp, -eu, and -in variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * manage-experiment: address CRUD QA findings Live CRUD QA against the Mixpanel MCP surfaced gaps where the skill instructs behavior the tools handle differently than assumed: - Settings full-replace footgun (High): the experiment-update tool's `settings` payload is a full replace, so a one-field edit silently nulls `srm`, `excludeQA`, etc. Added a read-merge-write Behaviour rule to SKILL.md, a pointer in design step 8, and a launch-readiness warning row for dropped `srm` / `excludeQA`. - Metric `direction` not settable via inline metric creation (Medium): metrics default to `up`. Added a caveat in design step 3 to set `down` in the UI for down-polarity metrics and verify before launch. - Archive doesn't cascade to the backing flag (Low): noted in lifecycle-handoff that the auto-created flag is left behind. - Generic experiment-listing errors (Low): Behaviour rule 2 now covers the experiments-not-enabled failure case. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * manage-experiment: address review feedback (routing dedupe + glossary concision) Resolves @zoeabrams-eng's two inline review threads on SKILL.md: - Routing duplication: "Pick the command" (step 3) no longer restates the explicit/implicit/phase-derived rules already in "Canonical commands" — it now references the table and mapping and applies them. - Glossary verbosity: trimmed the Winsorization entry to a one-liner plus a pointer to advanced-features.md (which already holds the full push-back rule), and tightened the Direction / CUPED / Lift / multiple-testing entries. Repointed the two remaining "umbrella glossary's push-back rule" references (design step 6, pitfalls.md) so the rule's canonical home is advanced-features.md; kept pitfalls' mention inline to avoid a reference->reference link chain. Synced across mixpanel-mcp / -eu / -in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * manage-experiment: note metric direction (polarity) is editable post-setup Metric direction/polarity is a property of the saved metric and can now be corrected after setup via the metric-update tool, without recreating the metric or experiment. Update the skill to reflect this: - design: setting `down` no longer requires the Mixpanel UI — the metric-update tool works on the draft - glossary, metric-selection, launch readiness, interpret: note the in-place fix; full statement centralized in the Direction glossary entry, others trimmed to brief pointers - use the skill's descriptive "metric-update tool" voice instead of the literal Update-Metric tool ID Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
experiment-setupskill atplugins/mixpanel-mcp/skills/experiment-setup/, ported from the staging branch in mixpanel/analytics#95516.What the skill does
A single skill covering setup-phase expertise for Mixpanel experiments — hypothesis framing, metric roles, statistical model, sizing, advanced features (CUPED / Winsorization / Bonferroni), XP-vs-FF routing, prior-experiment reuse, and pre-launch pitfalls. Replaces the per-capability tools listed in MULTI-581 (
get_experiment_setup_guidance,check_statistical_config,recommend_advanced_features,parse_hypothesis,run_pre_launch_checks,explain_health_check).Structured for progressive disclosure:
SKILL.md— skill entry point. Routes to references on demand.references/hypothesis-framing.md— coaches the "Changing X will increase Y because Z" structure.references/metric-selection.md— primary / guardrail / secondary roles, lagging-indicator trap.references/sizing.md— MDE, power, baseline rate, traffic — Kohavi's inverted formula.references/statistical-model.md— sequential vs fixed-horizon, confidence level.references/advanced-features.md— CUPED, Winsorization, Bonferroni / Benjamini-Hochberg.references/routing-xp-vs-ff.md— when the user wants an experiment vs a plain feature flag.references/prior-experiments.md— how to fold prior results into a new design.references/pitfalls.md— the deterministic check catalogue and the >5% guardrail hard-gate rationale.evals/— three fixtures (Pelando, Confetti, Polarsteps) seeded from PRD scenarios.Content ports from
ai/engine/tools/experiments/_guidance/setup.mdand_shared/pitfall_prose.pyinmixpanel/analytics.Sync
mixpanel-mcp-euandmixpanel-mcp-inare synced viamake sync-skills FORCE=1.make check-skills-syncpasses locally. The top-levelREADME.mdskills table includes the new row.Open follow-ups (from the staging README)
SKILL.mdandreferences/prior-experiments.mdreference tool names from MULTI-588 / MULTI-589 (search_prior_experiments,validate_experiment). If those names change before merge, update both.Test plan
SKILL.mdend-to-end; confirm trigger phrasings don't over-trigger onanalyze-experimentrequests.ai/engine/tools/experiments/_guidance/setup.md.evals/README and three fixture files are usable as a starting scaffold.references/pitfalls.mdmatches product's customer-facing copy.🤖 Generated with Claude Code
Assisted by Claude