Skip to content

Add experiment-results skill#23

Closed
elliotrfeinberg wants to merge 12 commits into
mainfrom
experiment-results-skill
Closed

Add experiment-results skill#23
elliotrfeinberg wants to merge 12 commits into
mainfrom
experiment-results-skill

Conversation

@elliotrfeinberg

Copy link
Copy Markdown
Contributor

Summary

Adds a new experiment-results skill at plugins/mixpanel-mcp/skills/experiment-results/, ported from the staging branch in mixpanel/analytics#95517.

What the skill does

A single home for results- and health-phase interpretation expertise. The agent loads the skill and reads the verdicts that Get-Experiment returns — it never recomputes thresholds (SRM p-value, significance, sufficient-exposures, etc.). Replaces the interpretation portion of MULTI-472, MULTI-475, MULTI-476, MULTI-477, MULTI-478, MULTI-546.

Structured for progressive disclosure:

  • SKILL.md — 5-step decision tree (Trustworthiness → Significance → Guardrails → Practical → Verdict), polarity recipe, verdict table, misconfig pointers.
  • references/health-check-interpretation.md — SRM / Retro A/A / peeking / insufficient-exposures / broken-data / <3-day-window root causes, ordered most → least likely, with Kohavi framing.
  • references/per-metric-interpretation.md — translating lift / CI / p-value into a plain-language verdict; polarity, magnitude, Twyman's Law, distribution-type nuances, CUPED/Winsorization effects.
  • references/why-no-statsig.md — five real reasons an experiment hasn't hit statsig and what to do about each.
  • references/segment-of-interest-selection.md — how to pick 3–5 segments worth slicing on, hypothesis-driven priorities.
  • references/segment-breakdown-interpretation.md — reading per-segment results, Simpson's paradox vs heterogeneity vs noise.
  • references/session-replay-analysis.md — qualitative analysis via paired control/treatment replay cohorts.
  • references/get-experiment-fields.md — field map for Get-Experiment's response.
  • evals/ — three fixtures seeded from PRD customer quotes (Pelando "+2 others", Confetti "8 metrics for new visitors", Polarsteps "no documented workaround").

The content is grounded in the canonical playbook at ai/engine/tools/experiments/_guidance/results_interpretation.md in mixpanel/analytics.

Sync

mixpanel-mcp-eu and mixpanel-mcp-in are kept in sync by the originating commit. make check-skills-sync passes locally.

Test plan

  • Review SKILL.md for the decision-tree spine and verdict table
  • Review the 6 references/*.md files for accuracy against the canonical _guidance playbook
  • Review the 3 evals/*.yaml fixtures
  • Confirm make check-skills-sync passes in CI

🤖 Generated with Claude Code

Assisted by Claude

mixpanel-claude-code-agent Bot and others added 2 commits June 4, 2026 23:31
Authors a single home for all results- and health-phase expertise: the
agent loads this skill and reads the verdicts that Get-Experiment
returns, rather than recomputing thresholds. Replaces the
interpretation portion of several superseded per-capability tools.

Skill is structured for progressive disclosure: the spine (5-step
decision tree, polarity recipe, ship/iterate/kill/wait verdict) lives
in SKILL.md, and deep-dive references cover health-check causes,
per-metric phrasing, why-no-statsig, segment-of-interest selection,
segment-breakdown reading, session-replay analysis, and the
Get-Experiment field map. Eval fixtures seeded from PRD customer
quotes (Pelando "+2 others", Confetti "8 metrics for new visitors",
Polarsteps "no documented workaround").

Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills.

Linear: https://linear.app/mixpanel/issue/MULTI-582

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace explicit MCP tool names (Get-Experiment, Update-Experiment,
  Run-Query, Get-Feature-Flag, Get-Experiment-Setup-Guidance) with
  agent-agnostic phrasing per the convention from #22. Skills describe
  actions ("request experiment details", "query the metric", "update
  the experiment") rather than specific tool calls.
- Rename references/get-experiment-fields.md → experiment-fields.md so
  the filename doesn't echo a specific MCP tool name.
- Drop evals/ directory — this repo doesn't run evals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elliotrfeinberg elliotrfeinberg marked this pull request as ready for review June 5, 2026 01:20
The spine is always loaded; references are lazy. Move spine content
that duplicated reference material out:
- Drop the 47-line ASCII decision tree (numbered list reads equivalently)
- Replace the Step 1 trustworthiness field table with a one-line gate
  (full table lives in experiment-fields.md + health-check-interpretation.md)
- Compress Step 4 baseline-lookup detail to a pointer (full procedure
  in per-metric-interpretation.md)
- Move multi-variant + decide-call shape + special variant constants
  to experiment-fields.md §Lifecycle hand-off (where they already are)
- Drop the 16-line common-pitfalls cheat sheet (each pitfall is
  covered in the relevant reference)

SKILL.md: 236 → 110 lines. Decision-tree spine and verdict table
preserved; polarity recipe stays inline since it's load-bearing for
every step. All references still linked from the spine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to have the same files duplicated across regions? Is there a way to do shared ones? For Experiments, the region doesn't really matter.

Surfaces the setup-skill boundary at routing time. The exclusion
existed in the body ("Do not trigger for experiment setup questions")
but the agent never reached it during skill selection.

Sync mixpanel-mcp-eu and mixpanel-mcp-in.

Assisted by Claude
Comment thread plugins/mixpanel-mcp/skills/experiment-results/SKILL.md Outdated

## When to use this skill

Trigger when the user asks anything about reading an experiment's results or its health. Common phrasings:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the description? In theory, if Claude is reading here is because it already decided to load the skill


## How to read experiment-details output

Always request experiment details with `compute_exposures=true, compute_metrics=true`. The response has two parallel data paths — live and cached. **Always prefer live, fall back to cache, surface errors.**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try to use human language, then when using MCP, or agent interface or headless, it should understand how to do it by understanding the schemas

| Per-metric per-variant stats | `live_metrics[metricId][variant]` | `results_cache.metrics[metricId][variant]` |
| Bucketed summary | recompute from `live_metrics` | `results_cache.summary` |

If `live_results_errors` is non-null, use the cache, caveat that data is stale, and surface the error — the underlying failure may need fixing before any decision. If **both** live and cache are empty for a metric, say "no result was computed" and recommend a re-sync. **Never** silently treat as "no effect."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, iiuc, mcp returns either

{live_<something>: {...data}}

or

{
  live_results_errors: <error>,
  results_cache: {..data}
}

right?

Some suggestions

  1. MCP: I'd expect the output to be only one with a plain schema, and maybe a source: cache or source: live. Another option would be the tool to have a parameter tool_name(source: live) or tool_name(source: cache)?
  2. The schema format/meaning should be part of MCP. What belongs to the skill is how to react. I can see something like
# Get the experiment results
Make sure you already have the experiment data the user wants to interpret in your context. If not, ask the user any information needed and get it.

Ideally you'll use live results, but if they are cached, display this dialog to the user and ask for explicit confirmation

<code block>
This experiment doesn't have live data because ....

</code block>



Run in order. **Stop at the first failure** — do not proceed if a step flags a problem.

1. **Trustworthiness gate** — SRM ok? Exposures sufficient? Retro A/A clean? Minimum duration met (~3 days)? No misconfig? If any fail → STOP and open [references/health-check-interpretation.md](references/health-check-interpretation.md).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be because i don't have enough context, but i have no idea what you're saying here. In these cases sometimes to have a Glossary at the beginning of the skill with common things makes the skill easier to maintain.

# Glossary
This skill runs some specific analysis for ...... Here is a quick list of the concepts we'll apply
1. Polarity: formula that xxxxx
2. Variant
3. Primary

4. **Practical significance** — convert lift into absolute terms (`baseline_value × lift`). Statistically significant ≠ ships.
5. **Verdict** — see table below.

### Polarity recipe (load-bearing — keep in mind for every metric)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh here is the polarity recipe. Then, I suggest you change the structure of the skill to

<intro>

# Components
- Define all the formulas, templates and pieces you will use later

# Steps

<Top down section with what you want the skill to do>

see plugins/mixpanel-mcp-eu/skills/manage-lexicon/SKILL.md
and this comment #17 (comment)

@@ -0,0 +1,158 @@
# Experiment-Details Field Map

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file shouldn't be necessary right? the MCP should have a good json structure a tool documentation that explains what each thing is

@@ -0,0 +1,158 @@
# Health-Check Interpretation

Open this when Step 1 of the Decision Tree flags a failure (SRM, Retro A/A, insufficient exposures, peeking, broken-data, < 3-day window, or any misconfiguration). The goal is to turn the platform's already-computed verdict into a plain-language explanation, an ordered list of likely causes, and a recommended next action.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are reading this file, it means it was already open right? I don't think the LLM will see this until it decided to read the reference.

the same comments I gave you for the main skill apply here, and siblings

Addresses @gslopez's review on PR #23.

- Rename skill from `experiment-results` to `interpret-experiment` (verb-noun,
  matches `create-dashboard` / `manage-lexicon` / `deep-research`).
- Restructure SKILL.md into intro / Glossary / Components / Steps shape (mirrors
  `manage-lexicon`). Glossary defines Variant, Primary/Guardrail/Secondary,
  Lift, Polarity, Significance, SRM, Retro A/A, Twyman, CUPED, Winsorization,
  MDE, Trustworthiness gate so later steps can use the terms without redefining.
- Drop API-parameter phrasing (`compute_exposures=true`, decide-call shape).
  Replace with intent — the tool layer maps it to the right call.
- Drop the duplicated negative-trigger paragraph from the SKILL body (already
  in `description:` frontmatter where the loader sees it).
- Define the polarity recipe once in Components; replace the verbatim duplicate
  in per-metric-interpretation.md with a back-reference.
- Replace the live/cache field-path table with a single fallback rule in
  Components; the field schema is the tool's job, not the skill's.
- Add explicit "confirm with the user before concluding — irreversible" to the
  verdict table and to the Steps output shape.
- Delete `experiment-fields.md` (was duplicating tool-response schema docs).
  Promote its domain content into `lifecycle-handoff.md` (decide-call rationale,
  multi-variant choice, special variant constants).
- Break the overloaded misconfig bullets in health-check §7 into seven
  Condition / Interpretation / Action sub-sections; rename §7 to
  "Misconfigurations".
- Replace `§7` / `§SRM` / `§Misconfig` numeric/abbrev cross-refs with section
  titles so they don't rot when sections reorder.
- Trim generic p-value content in per-metric-interpretation.md to the
  Mixpanel-specific traps (Welch's t, `liftConfidence` is the confidence level
  not the CI width).
- Drop "Open this when…" preambles from every reference — the LLM is already
  reading the file by the time it opens.
- Sync eu/in plugin copies via `make sync-skills FORCE=1`.

Assisted by Claude
Addresses Phase 1 of the hardcore /review-skill pass.

- Drop stale "Step 1 of the Decision Tree" cross-references in SKILL.md
  Glossary, why-no-statsig.md, and segment-of-interest-selection.md. The new
  spine numbers the trustworthiness gate as step 2, but the name "trustworthiness
  gate" is what's stable — use the name.
- Drop the embedded ~30s retry interval in health-check-interpretation.md §5.
  Retry policy belongs to the tool layer; "retry once, then surface" is enough
  for the skill.
- Hedge five unsourced defaults (Welch's t-test choice, 95% winsorization
  default, ~350 per-variant exposure floor cited in three places, Bonferroni
  correction on multi-variant SRM). Each one becomes "the platform's
  configured/default X — verify in product" instead of a flat assertion.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
Addresses Phase 2 of the hardcore /review-skill pass. Removes the field-path
schema leaks gslopez originally flagged on PR #23, which had survived the
phase-1 cleanup.

- Rewrite every section header in health-check-interpretation.md (sections 1-6
  + the seven §7 misconfig sub-sections) from "Verdict to consume: <field
  path>" to plain-language intent. Same for the Condition/Interpretation/Action
  scaffolding in §7 — collapsed to "When: <plain condition>" + a free
  paragraph, dropping the labels the reader doesn't need.
- Rewrite every "What to look at" bullet in why-no-statsig.md (reasons 1-5)
  from field-path triplets to intent. Same for the "First, rule out a broken
  result" checks and the EXTEND action.
- Remove the remaining settings.* / live_* / results_cache.* references in
  per-metric-interpretation.md (baseline-fetch, variance/outlier discussion,
  multiple-comparisons section, testing-model section) and SKILL.md's polarity
  recipe (multiple-testing correction).
- Remove field-path leaks from lifecycle-handoff.md (decision-record) and
  session-replay-analysis.md (example user-facing quote).
- Add a one-sentence disambiguation guard at the top of SKILL.md Step 1: if
  the user hasn't named a specific experiment, ask before fetching.
- Expand "SRM" on first mention in the description and replace "hasn't
  reached statistical significance" with "isn't showing a clear winner yet"
  for non-expert legibility. Glossary inside SKILL.md still does the heavy
  definition.

After this commit, `grep -rE 'live_|results_cache|exposures_cache|settings\.<…>'
plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
Final pass from the hardcore /review-skill audit.

- per-metric-interpretation.md: collapse "Significance = NO does NOT mean
  'no effect'" (was a 14-line duplicate of why-no-statsig.md's options list)
  into 3 lines with a back-reference. Same for "Frequentist vs Sequential —
  what affects per-metric reading" (was 8 lines duplicating
  health-check-interpretation.md §4) → 4 lines with a back-reference.
- health-check-interpretation.md §7 Misconfigurations: drop the
  When/Interpretation/Action triple-label scaffolding. Each of the 7 sub-
  sections is now a single bold "condition" sentence opening a single
  paragraph of consequence + action. Same information, ~25 lines lighter.

Skill total: 988 → 950 lines (-38). health-check-interpretation.md:
206 → 178 (-14%). per-metric-interpretation.md: 181 → 169.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
Final micro-pass from the third hardcore /review-skill audit. Four surgical
edits; ~6 lines of net change.

- SKILL.md polarity recipe: drop the lingering `settings.controlKey`
  reference ("use settings.controlKey" → "the platform marks which variant
  is control"). Same fix in per-metric-interpretation.md's tier table for
  the surviving `multipleTestingCorrection` reference.
- why-no-statsig.md output shape: drop the "which fields told you" phrasing,
  which undid phase-2 right at the moment of summary. The example numbers
  stay; the field-citation framing goes.
- SKILL.md step 1: add one sentence to the disambiguation guard naming the
  identifier-matching convention (ID first, then case-insensitive name).
- health-check-interpretation.md and per-metric-interpretation.md: drop the
  duplicate "Never recompute thresholds" preamble paragraph — the rule lives
  in SKILL.md and is loaded with the spine. The references no longer need to
  restate it.

`grep -rE 'live_|results_cache|exposures_cache|settings\.<…>|multipleTestingCorrection'
plugins/mixpanel-mcp/skills/interpret-experiment/` returns zero hits.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
elliotrfeinberg added a commit that referenced this pull request Jun 9, 2026
Proactively applies the same themes @gslopez raised on PR #23
(`experiment-results` → `interpret-experiment`) to this skill.

- Rename: `experiment-setup` → `design-experiment` (verb-noun, matches
  `create-dashboard` / `manage-lexicon` / `interpret-experiment`).
- Restructure SKILL.md into intro / Glossary / Components / Steps shape.
  Glossary front-loads Variant, Primary/Guardrail/Secondary, Direction, MDE,
  Power, Underpowered, CUPED, Winsorization, multiple-testing correction,
  sequential vs frequentist. Components defines sizing formulas, the >5%
  guardrail hard-gate, and the pre-launch pitfall catalogue once.
- Strip field-path leaks everywhere: `settings.testingModel`,
  `settings.endCondition`, `settings.confidenceLevel`,
  `settings.multipleTestingCorrection`, `settings.cuped.*`,
  `settings.winsorization.*`, `len(primary_metrics)`, and the internal pitfall
  `kind` enums (`underpowered_duration_insufficient`, `cohort_too_small`,
  etc.) are now plain language.
- Strip the `pitfall_prose.py` cross-reference from pitfalls.md — that
  file lives in the analytics repo and isn't accessible to the skill.
- Hedge unsourced defaults: 95th percentile Winsorization, 10k per-variant
  default, ~350-400 floor, 0.95 confidence level all now read "verify in
  product" / "platform default".
- Drop "Open this when…" / "All four properties matter" preambles from
  hypothesis-framing, metric-selection, prior-experiments.
- Add irreversible-action confirmation: SKILL.md step 8 explicitly waits
  for user confirmation before creating the experiment.
- Add disambiguation guard + human-friendly identifier guidance.
- Fix `manage-feature-flags` cross-reference in routing-xp-vs-ff.md to
  point at the actual `feature-flags` skill name.

Cross-PR note: the `interpret-experiment` references in PR #23 link to
this skill as `experiment-setup` in three places. Will need a follow-up
sed after both PRs merge.

`grep -rE 'settings\.|len\(.*\)|pitfall_prose' design-experiment/` returns
zero hits.

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
…eriment

The setup-side skill was renamed from `experiment-setup` to
`design-experiment` on its own PR (#24) to follow the
verb-noun convention. Update this skill's cross-references to match.

Sites updated:
- SKILL.md description's negative-trigger sentence
- references/why-no-statsig.md (two mentions)

Sync via make sync-skills FORCE=1; make check-skills-sync passes.

Assisted by Claude
elliotrfeinberg added a commit that referenced this pull request Jun 9, 2026
…Components/Steps

Apply the same themes of feedback from PR #23 (interpret-experiment) and
PR #24 (design-experiment) to the feature-flags skill:

- Rename `feature-flags` → `manage-feature-flags` to match the verb-noun
  convention of `create-dashboard`, `manage-lexicon`, `interpret-experiment`,
  `design-experiment`.
- Restructure SKILL.md from a 3-spine shape (Routing/Lifecycle/Hygiene) into
  the canonical Glossary / Components / Steps shape used by manage-lexicon
  and design-experiment. Steps run in order; vocabulary is defined up front.
- Strip field-path leaks throughout — references no longer document
  `flagType: "feature_gate"`, `rolloutPercentage`, `status: "enabled"`,
  `ruleset.X` field paths. Those belong to tool docs, not the skill.
- Fix stale cross-references: `experiment-results` → `interpret-experiment`
  and `experiment-setup` → `design-experiment` to match the sibling PRs.
- Add irreversible-action confirmation discipline for archive: identify the
  flag by human-friendly name (not UUID), state the consequence, confirm
  SDK call sites are gone, get explicit yes — for bulk archives too.
- Hedge defaults and compress preambles to match the editorial standard set
  by PR #24.
- Drop the setup field reference and update field reference tables (those
  belong to tool docs).

README index updated. Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…wn-interpretation

The disclaimer about per-segment platform support was the second
paragraph, separating the file's purpose from its content with five
lines of caveats. Moved to a "Platform support status" section at the
end of the file so the reader hits the mental model immediately.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elliotrfeinberg

Copy link
Copy Markdown
Contributor Author

Closing in favor of the umbrella experiment skill: #30

elliotrfeinberg added a commit that referenced this pull request Jun 17, 2026
* Add feature-flags skill

Authors a single home for feature-flag expertise: picking the right
flag-shaped tool (Feature Gate vs Dynamic Config vs Experiment),
naming and keying, staged rollouts, the kill switch, exposure
debugging, archive/restore semantics, and SDK call patterns.
Replaces the Get-Feature-Flag-Guidance MCP resource + passthrough
tool from MULTI-24.

Structured for progressive disclosure: the spine in SKILL.md covers
the 3-way routing decision, the 5-step lifecycle, and the
List-Feature-Flags-first hygiene rule. Deep-dive references cover
setup, staged rollout, hygiene/cleanup, SDK/exposure debugging, the
status state machine + Update-Feature-Flag call shapes, and
experiment-linked flag handling. Three eval fixtures guard the most
common failure modes: routing ambiguity, disabled-flag zero
exposures, and stale-flag triage.

Content ports from the canonical
ai/copilot/agent/tools/feature-flags/{setup,lifecycle}-guidance.md
in mixpanel/analytics. Loads via the progressive-discovery skill
pattern rather than the readFileSync .md pattern MULTI-24 used.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in via make sync-skills.

Linear: https://linear.app/mixpanel/issue/MULTI-583

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove MCP tool name references and delete eval fixtures

- Replace explicit MCP tool names (Create-Feature-Flag, Update-Feature-Flag,
  Get-Feature-Flag, List-Feature-Flags, Update-Experiment) with
  agent-agnostic phrasing per the convention established in #22.
  Skills describe actions ("update the flag's status", "list the project's
  flags") rather than specific MCP tool calls.
- Drop evals/ directory — this repo doesn't run evals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Trim SKILL.md by deferring duplicated content to references

The spine is always loaded; references are lazy. Trim spine content
that duplicated reference material:
- Collapse the 5 lifecycle steps from multi-paragraph each to one
  numbered list with single-sentence-each-plus-pointer (details still
  in staged-rollout.md / lifecycle-and-state-machine.md /
  hygiene-and-cleanup.md).
- Tighten Spine 3 (prior-work check) to one paragraph; the three
  cases are still enumerated in hygiene-and-cleanup.md.
- Drop the 11-line "Common pitfalls" cheat sheet at the bottom —
  every pitfall is already covered in the relevant reference, and
  the Quick lookups table routes the agent there.

SKILL.md: 161 → 86 lines. Three-spine structure, routing table,
quick-lookups router, and the archive≠kill-switch reminder all
preserved. All 6 references still linked from the spine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address PR review feedback on feature-flags skill

- Replace $feature_flag_called with $experiment_started as the
  exposure event name throughout SKILL.md and sdk-and-exposure.md.
- Fix newly-enabled flag rollout default: starts at 0%, not 1.0;
  requires a manual bump.
- Soften "server-side changes can go faster" to require sufficient
  monitoring before accelerating ramps.
- Add short-lived feature-gate flags to the healthy-project list
  in hygiene-and-cleanup.md.
- Require a codebase reference check before disabling or archiving
  any flag, to catch flags the SDK still reads in production.
- Rewrite SDK call shape section to describe the two read entry
  points (value-only, full variant) without language-specific
  code snippets; defer per-language signatures to docs.mixpanel.com.
- Clarify fallback semantics include the not-in-rollout case.
- Drop the serving_method field from the setup reference and the
  diagnostic checklist since it is not togglable in the UX.

* Add negative-trigger guidance to feature-flags description; sync variants

The exclusions for experiment-results and experiment-setup were buried
in the body; the routing pass only sees the description. Moving them up
prevents the skill from being picked for "should we ship?" or "what MDE
can I detect?" questions.

Also: variants were drifted from mixpanel-mcp in the PR baseline
(SKILL.md, hygiene-and-cleanup.md, routing-and-setup.md,
sdk-and-exposure.md). Resync to make check-skills-sync pass in CI.

Assisted by Claude

* Rename feature-flags → manage-feature-flags; restructure to Glossary/Components/Steps

Apply the same themes of feedback from PR #23 (interpret-experiment) and
PR #24 (design-experiment) to the feature-flags skill:

- Rename `feature-flags` → `manage-feature-flags` to match the verb-noun
  convention of `create-dashboard`, `manage-lexicon`, `interpret-experiment`,
  `design-experiment`.
- Restructure SKILL.md from a 3-spine shape (Routing/Lifecycle/Hygiene) into
  the canonical Glossary / Components / Steps shape used by manage-lexicon
  and design-experiment. Steps run in order; vocabulary is defined up front.
- Strip field-path leaks throughout — references no longer document
  `flagType: "feature_gate"`, `rolloutPercentage`, `status: "enabled"`,
  `ruleset.X` field paths. Those belong to tool docs, not the skill.
- Fix stale cross-references: `experiment-results` → `interpret-experiment`
  and `experiment-setup` → `design-experiment` to match the sibling PRs.
- Add irreversible-action confirmation discipline for archive: identify the
  flag by human-friendly name (not UUID), state the consequence, confirm
  SDK call sites are gone, get explicit yes — for bulk archives too.
- Hedge defaults and compress preambles to match the editorial standard set
  by PR #24.
- Drop the setup field reference and update field reference tables (those
  belong to tool docs).

README index updated. Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review minors: hedge cadence default, add flag-lookup discipline, trim filler

Apply the 4 issues surfaced by the hardcore /review-skill pass on
manage-feature-flags:

- D7 (Defaults are sourced): The `1% → 10% → 50% → 100%` cadence was stated
  as if Mixpanel imposed it. Reframe as a recommended default that works for
  most teams, with an explicit "calibrate to product risk and the monitoring
  you actually have" hedge. Applied in SKILL.md step 5 and staged-rollout.md.
- D6 (Human-friendly identifiers): Add a "Identifying flags the user mentions"
  note to SKILL.md telling the agent to match by key first, then case-insensitive
  name, and to ask for disambiguation rather than UUID.
- D8 (Concise): Trim the "Why incremental rollout" preamble in
  staged-rollout.md (the agent already knows the concept). Trim the
  "healthy project has..." rhetorical scaffolding in hygiene-and-cleanup.md.
- D8 (Concise): Tighten the auto-key generation subsection in
  routing-and-setup.md from 4 details to the 2 that actually surprise users.

Total: 614 → 604 lines. Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Clarify the opening of two reference files

Two PR25 reference files opened with mechanics instead of stating what
the file is. Add a one-line "this is X" framing before the first
substantive paragraph so the reader knows what they'll get:

- lifecycle-and-state-machine.md: "This is the flag state machine plus
  the rules for which update calls preserve which fields."
- experiment-linked-flags.md: "This is the playbook for flags governed
  by an experiment — how to spot one, which edits the experiment will
  overwrite, and when to hand off."

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR #25 polish + stale-ref sweep

/review-document feedback on PR #25:

- D1: staged-rollout.md cadence table read as authoritative. Added a
  one-line "wait times are conservative defaults; calibrate to your
  monitoring latency and product risk" hedge below the table.
- D3: hygiene-and-cleanup.md "Before any new flag" section duplicated
  SKILL.md step 2. Replaced with a stub that points at the canonical
  version, since hygiene-and-cleanup primarily addresses *existing*
  flag cleanup.
- D7: lifecycle-and-state-machine.md ASCII state diagram had
  inconsistent label widths. Normalized the transition labels.

Cross-PR stale-ref sweep:

- 9 references to `interpret-experiment` / `design-experiment` across
  SKILL.md (description, when-to-use, components, routing table, mis-
  routing rules) and references/experiment-linked-flags.md (4 spots).
- All updated to point at the umbrella skill `manage-experiment`,
  which is what outside callers should know about. The dual
  design-vs-interpret distinction is internal to that skill's
  subcommands.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* manage-feature-flags: address skill + reference review findings

Skill review:
- Trim description from 1087 to 958 chars (was over the 1024 spec limit)
- De-duplicate the terminal-states enumeration: SKILL.md step 8 now a
  spine + link; full triage lives only in hygiene-and-cleanup.md
- Convert lateral reference->reference links to plain prose so each
  reference is one hop from SKILL.md
- Hedge unsourced product/implementation claims (split tolerance,
  server-side enforcement, update-routing behavior, experiment
  transition table)

Reference reviews:
- Fix broken external docs link (/docs/feature-flags 404 -> /docs/featureflags)
- Collapse the threefold operational/shipped triage in hygiene-and-cleanup.md
- Trim mid-stage ship/hold/roll-back overlap with SKILL.md in staged-rollout.md
- Clarify that an experiment-managed conclude preserves bucketing,
  unlike a manual flag disable
- Add a concrete Dynamic Config variant-value example

Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* manage-feature-flags: add exposure ingestion-lag caveat to no-exposures checklist

Address review feedback that an LLM may not know how long to wait before
treating a zero exposure count as a real failure. Note that exposure
events are not instant, so the agent should allow a few minutes after
enabling (and after real traffic hits the code path) before working the
diagnostic checklist.

Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* manage-feature-flags: correct exposure ingestion-lag window to up to 24h

Per Mixpanel docs, events usually surface in seconds-to-minutes but the
general ingestion delay can reach ~24h with SDK batching, server-side
buffering, or third-party pipelines. Point to Live View to confirm
real-time arrival rather than waiting on the report.

Synced across mixpanel-mcp{,-eu,-in}; make check-skills-sync passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
elliotrfeinberg added a commit that referenced this pull request Jun 17, 2026
* New skill: manage-experiment (umbrella over design + interpret)

Combines the soon-to-land `design-experiment` (PR #24) and
`interpret-experiment` (PR #23) skills into a single `manage-experiment`
umbrella, modeled on `manage-lexicon`. One front door for any
experiment-related phrase; subcommands for the two phases.

## Why

Today an agent has to disambiguate between two skills on every
experiment turn, and borderline phrases ("audit my experiment",
"check on experiment X") route inconsistently. Folding both into one
umbrella with `design` and `interpret` subcommands:

- Gives a single trigger surface for the agent to route to.
- Lets ambiguous phrases use phase-derived disambiguation (DRAFT →
  design; ACTIVE/CONCLUDED → interpret).
- Removes the cross-skill reference coupling that currently requires
  coordinated edits across two PRs whenever either skill renames.
- Reserves room to add `launch` and `monitor` commands later without
  another rename.

## Shape

```
manage-experiment/
├── SKILL.md              (umbrella: shared glossary, command menu, project + experiment resolution)
├── commands/
│   ├── design.md         (former design-experiment SKILL.md body)
│   └── interpret.md      (former interpret-experiment SKILL.md body)
└── references/           (15 references, flat; no name collisions)
```

The two command files preserve their original prose verbatim except for:
- Removing the now-redundant frontmatter and opening "You are helping…"
  paragraph.
- Lifting umbrella-able glossary terms (Variant, Primary/Guardrail/
  Secondary, Direction, Lift, MDE, CUPED, Winsorization, Multiple-
  testing correction) up to the umbrella; keeping phase-specific terms
  (Hypothesis, Power, Underpowered, Sequential vs Frequentist for
  design; Polarity, Significance, SRM, Retro A/A, Twyman's Law,
  Trustworthiness gate for interpret) in their command files.
- Adjusting reference paths to `../references/`.

The 15 reference files are unchanged except for two intra-skill links
in `why-no-statsig.md` that previously pointed at the `design-experiment`
skill — now point directly at `sizing.md` (where the formulas live).

The 3 references to `manage-feature-flags` (in SKILL.md, design.md, and
routing-xp-vs-ff.md) are correct cross-skill pointers and unchanged.

## Sequencing

This PR is opened **before** PR #23 and PR #24 merge. Once those land,
this PR will need a rebase against master that deletes the now-existing
`design-experiment/` and `interpret-experiment/` directories. Per the
plan, doing the umbrella as a separate PR keeps the review focused on
the routing pattern rather than tangled into either skill's prose review.

The `manage-feature-flags` skill (PR #25) cross-references both old
skill names; once both PRs land, those cross-refs will need a sweep
to point at `manage-experiment`.

README index updated: one `manage-experiment` row replaces the two
prospective `design-experiment` and `interpret-experiment` rows.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review minors: add reference index, clarify routing fall-through, dedupe match logic

Applies the 3 issues surfaced by the /review-skill pass on manage-experiment:

- D4 (Centralised logic): the umbrella SKILL.md had no map of what
  references/ contains. Add a "Reference files" table to SKILL.md
  listing each file, which command uses it, and its purpose. Mirrors
  the shape used by manage-feature-flags.
- D7 (Qualified recommendations): the phase-derived rule in step 3
  ("DRAFT → design, ACTIVE/CONCLUDED → interpret") read as if it
  always applied, but it only does when an experiment is in context.
  Restate step 3 as ordered rules with explicit fall-through, and
  add a row for ambiguous verbs like "audit" / "check" / "review".
- D6 (Human-friendly identifiers): the "try ID first, then
  case-insensitive name match" rule was stated in three places (umbrella
  step 2, design step 8, interpret step 1). Drop the duplicates from
  the command files — interpret hands back to the umbrella, design
  retains only the "ask if not named" guidance (the lookup mechanics
  don't apply to free-text feature names).

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix Winsorization percentile convention: agent passes tail-width (5), not cap-point (95)

PR #30 review caught an agent-actionable accuracy bug. The skill docs
described Winsorization as "default 95 (cap top/bottom 5%)" — natural
phrasing for a human, but the actual `Update-Experiment` tool schema
defines `WinsorizationSettingsInput.percentile` as the **tail width to
cap on each side**, with default `5` and `exclusiveMaximum: 50`. An
agent reading the docs would pass `95`, get rejected by validation, or
misinterpret as inverted Winsorization.

Updated 6 files to match the schema's convention:

- SKILL.md shared glossary
- commands/design.md step 6 (advanced features)
- references/advanced-features.md "What it does", "How to enable",
  "Percentile guidance" (default flipped from "95th" to `percentile=5`,
  "push back below ~80" flipped to "push back above ~20")
- references/health-check-interpretation.md "Extreme winsorization
  percentile" misconfiguration check
- references/per-metric-interpretation.md "Variance-reduction & outlier
  settings that change interpretation"
- references/pitfalls.md "High variance, no Winsorization" warning

The numeric guidance ("push back if more than 20% of each tail is
capped") is preserved; only the convention name flipped to match what
the agent actually sends through the tool.

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add launch + monitor commands to manage-experiment (#31)

* Add launch and monitor commands to manage-experiment

Extends the manage-experiment umbrella (PR #30) from 2 commands
(design, interpret) to 4 commands covering the full experiment
lifecycle. Stacks on top of PR #30 so the umbrella structure and
this extension can be reviewed separately.

## New commands

### commands/launch.md (~125 lines)

Owns the irreversible DRAFT → ACTIVE transition. Runs the final
launch-readiness check (lifted from design step 7 — that step now
catches design-time issues only) and performs the launch with
explicit CONFIRM. Includes the post-launch handoff: recommend a
tracking dashboard + a calendar reminder for monitor.

### commands/monitor.md (~115 lines)

Mid-flight safety checks. Answers "is it safe to keep this running?"
which is distinct from interpret's "did it work?" Covers:
- The peek-safety table: what's safe to look at mid-flight (SRM,
  sample pace, guardrails, Sequential primaries) and what isn't
  (Frequentist primaries — the peeking trap).
- The terminate-early decision rules: trustworthiness failure,
  guardrail regression beyond tolerance, Sequential boundary crossed.
- An explicit don't-peek-on-Frequentist pushback for users who ask
  to see primary results mid-flight.

## Knock-on changes

- umbrella SKILL.md: 4-row commands table, 5-option menu, phase-derived
  routing extended to DRAFT-design vs DRAFT-launch and ACTIVE-monitor
  vs ACTIVE-interpret. Reference table "Used by" columns updated
  (pitfalls → design+launch, health-check → monitor+interpret,
  sizing → design+monitor+interpret).
- commands/design.md step 7: reframed as "design-time sanity check"
  rather than "the pre-launch check". The full launch-readiness gate
  now lives in launch.md. Step 8 saves a DRAFT (reversible) rather
  than creating-and-launching. Added step 9: hand off to launch.
- design.md opening line: updated to say "stops at DRAFT" instead
  of "don't create until confirmed".

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review issues: lift cross-command policies to umbrella, unify emoji, hedge cross-skill refs

Applies the 5 issues surfaced by the /review-skill pass on the 4-command
manage-experiment:

## Major (D4)

The 5% guardrail hard-gate and the peek-safety table were duplicated
across design / launch / monitor — three places to keep load-bearing
thresholds in sync.

- Add a "Cross-command policies" section to the umbrella SKILL.md that
  owns the guardrail hard-gate threshold and the peek-safety table.
  Also moves the emoji convention there so all commands share one
  vocabulary.
- design.md "The >5% guardrail hard-gate" → "Guardrails (the hard-gate
  enforced downstream)" — keeps the design-time guidance, defers the
  threshold to the umbrella.
- design.md step 5: "peeking is safe by design" → "peek-safety table
  in the umbrella's Cross-command policies covers why".
- launch.md readiness checklist: ">5% hard-gate from the design command"
  → "regression hard-gate (see umbrella Cross-command policies)".
- monitor.md Components: drop the entire "What's safe to peek at"
  table, reference the umbrella. Glossary defers "peeking trap" to
  the umbrella. Terminate-early rules cite the umbrella threshold.

## Minor (D3)

design.md output-style section had a duplicate hand-off line
restating what step 9 already covers — removed.

## Minor (D7)

launch.md step 5 dashboard recommendation was unqualified — added
"recommend running create-dashboard in a follow-up session" so the
agent doesn't try to invoke it inline and interrupt the launch flow.

## Suggestion (D3)

launch.md irreversibility section mentioned "pause/resume via the
underlying feature flag" without naming the skill — added explicit
"handled by the manage-feature-flags skill" cross-ref.

## Emoji convention (D6)

Declared once in the umbrella's Cross-command policies. Existing
emoji usage across design/launch/monitor already matches; interpret
has no symbolic checklist output (its output is prose verdict + per-
metric breakdown), which is correct.

## Knock-on numbers

Skill core: 692 → 720 lines (umbrella +38 with the new policies
section, monitor -10 from removed table, others flat).

Synced to mixpanel-mcp-eu and mixpanel-mcp-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR #31 polish: define peeking trap in umbrella, add calendar-reminder example

/review-document feedback on PR #31:

- D6: monitor.md deferred the "peeking trap" definition to the umbrella's
  Cross-command policies, but the umbrella's peek-safety section only
  showed consequences and never defined the term. Added a one-line
  definition before the table.
- D8: launch.md step 5 calendar-reminder recommendation was abstract.
  Added concrete phrasing the agent can hand to the user so it doesn't
  improvise the wording inconsistently across sessions.

Synced to eu/in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* manage-experiment: port statsig diagnosis depth + align to skill rubric

Port the three pieces from the analytics monorepo's statsig-diagnosis
guidance (tal-multi-546) that PR #30's why-no-statsig.md lacked, then
reconcile them with this repo's skill quality rubric.

Port:
- sizing.md: relative, unit-safe achievable-MDE formulas (MDE_relative,
  n_required) with the sigma=sqrt(sigma^2) caution, framed as explaining
  the NO verdict rather than overriding it.
- why-no-statsig.md: a "pull the diagnostic inputs first" step and a
  priority-ordered "name the single most-likely blocker" table so the
  answer is one blocker with numbers, not a generic option list.

Rubric alignment (review-skill: 60% capped -> ~98%):
- Reframe the diagnostic-inputs step to intent (no named tool call or
  response field paths) so it stays engine-agnostic, lifting the cap.
- Fix advanced-features.md winsorization percentile contradiction
  (tail-width 5 convention vs stale 95/90 boundary references).
- Fix dead/stale cross-references in per-metric-interpretation.md and
  routing-xp-vs-ff.md to point at sections/commands that exist.
- Add a Contents block to every 100+ line reference.
- Collapse the state->command mapping duplicated 3x in SKILL.md (now 200).

Synced byte-identical across mixpanel-mcp / -eu / -in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* manage-experiment: align prior-experiments lookup with overlap-ranking tool

The Search-Prior-Experiments engine tool (analytics #95682) ranks stored
experiments by overlap on the draft's metric set, flag key, and hypothesis
rather than free-text keyword matching. Update the design command and the
prior-experiments reference so they steer the agent to hand the tool the
draft it's building instead of constructing single-keyword queries.
Engine-agnostic wording — describes the signals, not tool names or field
paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* manage-experiment: trim description under the 1024-char spec limit

The description was 1326 chars (over the Agent Skills 1024 limit) because it
enumerated ~15 verbatim trigger phrasings. Trim to a representative handful
and fold Bonferroni/Benjamini-Hochberg into "multiple-testing correction";
the Canonical commands table in the body already carries the full keyword
set. Now 921 chars. Synced across all three plugin variants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* manage-experiment: address skill-review findings across SKILL + references

Fixes from the review-skill rubric pass and the per-reference review:

- DRY/portability: de-link all reference->reference chains so references
  stay one level deep; strip tool-docs leakage (significance enum,
  summary.* buckets, liftConfidence, decide constants) to intent-level.
- Centralise the Winsorization >~20% push-back rule in the SKILL.md
  glossary; centralise the polarity recipe in the interpret command;
  both now referenced rather than restated.
- Source defaults: qualify the Benjamini-Hochberg platform default,
  CUPED 30-70% range, and the `up` direction default with "verify".
- Accuracy fixes in references: correct the family-wise FPR claim
  (3-arm -> 4-arm), drop in-file Winsorization duplication, fix the
  "constants"->"choices" dangling referent in lifecycle-handoff,
  soften absolutist hypothesis-framing phrasing, de-changelog the
  segment-breakdown platform-status note.

Applied identically across the mixpanel-mcp, -eu, and -in variants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* manage-experiment: address CRUD QA findings

Live CRUD QA against the Mixpanel MCP surfaced gaps where the skill
instructs behavior the tools handle differently than assumed:

- Settings full-replace footgun (High): the experiment-update tool's
  `settings` payload is a full replace, so a one-field edit silently
  nulls `srm`, `excludeQA`, etc. Added a read-merge-write Behaviour
  rule to SKILL.md, a pointer in design step 8, and a launch-readiness
  warning row for dropped `srm` / `excludeQA`.
- Metric `direction` not settable via inline metric creation (Medium):
  metrics default to `up`. Added a caveat in design step 3 to set
  `down` in the UI for down-polarity metrics and verify before launch.
- Archive doesn't cascade to the backing flag (Low): noted in
  lifecycle-handoff that the auto-created flag is left behind.
- Generic experiment-listing errors (Low): Behaviour rule 2 now covers
  the experiments-not-enabled failure case.

Synced across mixpanel-mcp / -eu / -in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* manage-experiment: address review feedback (routing dedupe + glossary concision)

Resolves @zoeabrams-eng's two inline review threads on SKILL.md:

- Routing duplication: "Pick the command" (step 3) no longer restates
  the explicit/implicit/phase-derived rules already in "Canonical
  commands" — it now references the table and mapping and applies them.
- Glossary verbosity: trimmed the Winsorization entry to a one-liner
  plus a pointer to advanced-features.md (which already holds the full
  push-back rule), and tightened the Direction / CUPED / Lift /
  multiple-testing entries.

Repointed the two remaining "umbrella glossary's push-back rule"
references (design step 6, pitfalls.md) so the rule's canonical home is
advanced-features.md; kept pitfalls' mention inline to avoid a
reference->reference link chain.

Synced across mixpanel-mcp / -eu / -in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* manage-experiment: note metric direction (polarity) is editable post-setup

Metric direction/polarity is a property of the saved metric and can now be
corrected after setup via the metric-update tool, without recreating the
metric or experiment. Update the skill to reflect this:

- design: setting `down` no longer requires the Mixpanel UI — the
  metric-update tool works on the draft
- glossary, metric-selection, launch readiness, interpret: note the
  in-place fix; full statement centralized in the Direction glossary entry,
  others trimmed to brief pointers
- use the skill's descriptive "metric-update tool" voice instead of the
  literal Update-Metric tool ID

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Elliot Feinberg <5232369+elliotrfeinberg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants