From cc61a50f10758301c210eefdc8b08c383b996928 Mon Sep 17 00:00:00 2001 From: bertilhatt Date: Sat, 21 Mar 2026 19:41:01 +0100 Subject: [PATCH 1/2] docs: address documentation gaps from three-year client support audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds callouts, warnings, and new sections across 25 documentation pages to address gaps identified during a systematic three-year audit of ~75 client Slack channels. Changes include: - Diagnostic query approximation callout (excludes CUPED++, winsorization, mixed-assignment) - Run Log as source of truth when Diagnostics is insufficient - Backfill triage: Eppo pipeline failed (no backfill) vs upstream data fixed (full refresh) - Entry point replaces assignment timestamp; filter-only alternative via Assignment SQL - SSC does not account for CUPED++ (qualify run-time estimates) - Self-service warehouse table cleanup (Admin → Pipeline Update Schedules) - SA migration checklist for warehouse connection pages - ConfigurationStore environment isolation warning (iOS/Android) - Holdout allocation band fragmentation and date immutability - Flag archival irreversibility and key reuse - JSON variation type coercion (int → double) - Layer parameter lock on active layers - Winsorization truncation (not discard) and Diagnostic exclusion - Assignment logging required fields and experiment field inclusion - Global Lift unique entity metric non-additivity caveat - Interaction effects limited to Simple metrics - Sequential test three-trigger readiness model - CDN stale-if-error 429 gap - SSO SP-initiated only constraint Made-with: Cursor --- docs/administration/authentication.md | 4 + .../connecting-dwh/bigquery.md | 11 ++ .../connecting-dwh/redshift.md | 11 ++ .../connecting-dwh/snowflake.md | 11 ++ docs/data-management/data-pipeline.md | 22 ++- .../filter-assignments-by-entry-point.md | 8 ++ .../configuration/index.md | 2 +- docs/experiment-analysis/diagnostics.md | 27 ++-- .../reading-results/global-lift.md | 8 +- .../reading-results/progress-bar.md | 14 +- .../concepts/flag-variations.md | 4 + .../concepts/holdout-config.md | 10 ++ .../concepts/mutual_exclusion.md | 4 + docs/feature-flagging/concepts/targeting.md | 3 + docs/feature-flagging/index.md | 4 + .../interaction-effects-asymmetry.md | 130 ++++++++++++++++++ .../running-well-powered-experiments.md | 14 ++ docs/guides/debugging-metrics.md | 16 +++ docs/sdks/architecture/deployment-modes.md | 10 ++ .../client-sdks/android/initialization.mdx | 4 + docs/sdks/client-sdks/ios/initialization.mdx | 4 + docs/sdks/event-logging/assignment-logging.md | 11 ++ docs/statistics/confidence-intervals/index.md | 4 + .../sample-size-calculator/usage.md | 6 + 24 files changed, 327 insertions(+), 15 deletions(-) create mode 100644 docs/guides/advanced-experimentation/interaction-effects-asymmetry.md diff --git a/docs/administration/authentication.md b/docs/administration/authentication.md index eb54f62e..ccf77da8 100644 --- a/docs/administration/authentication.md +++ b/docs/administration/authentication.md @@ -18,3 +18,7 @@ Eppo supports the following enterprise authentication options: - OpenID Connect Follow the guides linked above or reach out to your Eppo team if you would like one of these options configured for your users. + +:::info SSO login flow +Eppo supports **SP-initiated** (Service Provider-initiated) SSO login only. Users must start the login flow from the Eppo login page (`eppo.cloud`), not from the identity provider's app dashboard. IdP-initiated login (clicking the Eppo tile in Okta, Azure AD, etc.) can result in a login loop and is not supported. +::: diff --git a/docs/data-management/connecting-dwh/bigquery.md b/docs/data-management/connecting-dwh/bigquery.md index dabd5ce7..035be121 100644 --- a/docs/data-management/connecting-dwh/bigquery.md +++ b/docs/data-management/connecting-dwh/bigquery.md @@ -76,3 +76,14 @@ Now that you have a proper Service Account created for Eppo with adequate privil ### Updating Credentials Credentials can be updated at any time within the Admin panel of the app. + +### Rotating service accounts + +When switching to a new service account: + +1. **Grant the new service account the same IAM roles** as the existing one — at minimum, BigQuery Data Viewer on the datasets referenced in your definitions, plus BigQuery Data Editor on the `eppo_output` dataset. +2. **Upload the new service account key** in the Eppo Admin panel and click "Test Connection" to verify. +3. **Trigger a test refresh** on one experiment to confirm the pipeline runs end-to-end. +4. **Revoke the old service account key** only after verifying the new one works. + +If you see permission errors after switching, the most common cause is missing IAM grants on the new service account. diff --git a/docs/data-management/connecting-dwh/redshift.md b/docs/data-management/connecting-dwh/redshift.md index c0c1b4bd..d6e7c5b8 100644 --- a/docs/data-management/connecting-dwh/redshift.md +++ b/docs/data-management/connecting-dwh/redshift.md @@ -169,3 +169,14 @@ Now that you have a proper User created for Eppo with adequate privileges, you c ### Updating Credentials Credentials can be updated at any time within the Admin panel of the app. + +### Rotating service accounts + +When switching to a new service account or database user: + +1. **Grant the new user the same permissions** as the existing one — read access to all schemas and tables referenced in your definitions, plus write access to the `eppo_output` schema. +2. **Update the connection** in the Eppo Admin panel and click "Test Connection" to verify. +3. **Trigger a test refresh** on one experiment to confirm the pipeline runs end-to-end. +4. **Revoke old credentials** only after verifying the new account works. + +If you see `Object does not exist or not authorized` errors after switching, the most common cause is missing grants on the new user. diff --git a/docs/data-management/connecting-dwh/snowflake.md b/docs/data-management/connecting-dwh/snowflake.md index 12900934..70f42ebf 100644 --- a/docs/data-management/connecting-dwh/snowflake.md +++ b/docs/data-management/connecting-dwh/snowflake.md @@ -148,3 +148,14 @@ MIIFJDBWBg... ### Updating Credentials Credentials can be updated at any time within the Admin panel of the app. + +### Rotating service accounts + +When switching to a new service account (e.g., rotating credentials or migrating to a different account): + +1. **Grant the new service account the same permissions** as the existing one. At minimum, the new account needs read access to all schemas and tables referenced in your Fact SQL and Assignment SQL definitions, plus write access to the `EPPO_OUTPUT` schema (or equivalent). +2. **Update the connection** in the Eppo Admin panel with the new credentials and click "Test Connection" to verify. +3. **Trigger a test refresh** on one experiment to confirm the pipeline runs successfully end-to-end with the new account. +4. **Revoke the old credentials** only after verifying the new account works correctly. + +If you see `Object does not exist or not authorized` errors after switching, the most common cause is missing grants on the new service account. Mirror all grants from the old account before removing it. diff --git a/docs/data-management/data-pipeline.md b/docs/data-management/data-pipeline.md index 269812ae..92511083 100644 --- a/docs/data-management/data-pipeline.md +++ b/docs/data-management/data-pipeline.md @@ -24,12 +24,24 @@ Note that the y axis shows the compute time accrued by that task type. That is, ### Incremental refreshes -Eppo's scheduled jobs will run an incremental refresh that only scans recent data. By default, this lookback window will include data starting 48 hours before the last successful run (to change this time window, reach out to your Eppo contact or email support@geteppo.com). New metrics and metrics/facts with an updated definition will automatically be backfilled from the start of the experiment. Further, if a job fails on a given day, the next scheduled job will automatically go back and re-run metrics for that day. +Eppo's scheduled jobs will run an incremental refresh that only scans recent data. By default, this lookback window will include data starting 24 hours before the last successful run (to change this time window, reach out to your Eppo contact or email support@geteppo.com). New metrics and metrics/facts with an updated definition will automatically be backfilled from the start of the experiment. Further, if a job fails on a given day, the next scheduled job will automatically go back and re-run metrics for that day. You can also trigger a refresh in the UI by clicking "refresh experiment results" on the metric scorecard. This will compute new metrics from scratch and update existing metrics based on the incremental logic described above. If you'd instead like to force a full refresh and recompute all metrics from the start of the experiment, click "update now" under "results last updated". ![Data pipeline chart](/img/data-management/pipeline/refresh.png) +### When do I need a full refresh or backfill? + +Not every data issue requires a full backfill. Use this decision tree to determine the right action: + +- **Eppo's pipeline failed (e.g., warehouse timeout, permission error) but your underlying data is correct:** You generally do **not** need a backfill. The incremental lookback window (default 24 hours) will automatically re-process the missed period on the next successful run. Verify the next scheduled run completes successfully. + +- **Your upstream data was wrong and has now been corrected (e.g., a broken ETL was fixed, late-arriving data has landed):** You likely **do** need a full refresh to recompute metrics from the affected date. Trigger a full refresh from the experiment's results page ("update now" under "results last updated"), or use the API: `POST /api/v1/experiment-results/update/{experiment_id}`. Both endpoints accept a `lookback_date` query parameter (ISO 8601 format, e.g. `?lookback_date=2025-06-01T00:00:00Z`) to recompute results starting from a specific date instead of reprocessing the entire experiment. You can also pass `full_refresh=true` to force a non-incremental refresh. + +- **You changed a metric definition or Fact SQL:** New and updated metric definitions are automatically backfilled from the start of the experiment on the next pipeline run (scheduled, triggered via the API, or triggered manually from the UI). No manual action is needed. + +- **You're unsure whether data has been corrected upstream:** Before triggering a full refresh, confirm with your data team that the source tables now contain the correct data for the affected period. A full refresh against still-broken data will not help. + ### Pipeline steps @@ -91,6 +103,14 @@ As we’ve detailed, Eppo doesn’t export individual data from your warehouse. If you have any question about our privacy practices, please reach out. +### Intermediate tables and views + +Eppo creates intermediate tables and views in a dedicated schema (typically `EPPO_OUTPUT`) in your warehouse. Over time — especially in long-running workspaces with many experiments — these can accumulate into thousands of objects. This is expected behavior and does not affect experiment results. + +To manage this, Eppo provides an **automatic warehouse table cleanup** setting. Navigate to **Admin → Pipeline Update Schedules** and enable **"Automatically clean up old warehouse tables"**. You configure a retention period (e.g., 90 days) — Eppo will then drop any `EPPO_OUTPUT` tables that haven't been updated within that window. The cleanup runs on the 1st of every month. By default, tables used by Explore charts and the Sample Size Calculator are preserved; you can opt in to cleaning those up as well with separate toggles. + +Do not manually drop tables from the `EPPO_OUTPUT` schema — active experiments may depend on them. Use the built-in cleanup automation instead, which only removes tables outside the retention window. + ## Clustered Analysis Pipeline Clustered analysis experiments have a few more steps in the data pipeline to aggregate from the subentity level to the cluster level. See diagram below with additional steps highlighted. diff --git a/docs/experiment-analysis/configuration/filter-assignments-by-entry-point.md b/docs/experiment-analysis/configuration/filter-assignments-by-entry-point.md index 67affba6..122601c5 100644 --- a/docs/experiment-analysis/configuration/filter-assignments-by-entry-point.md +++ b/docs/experiment-analysis/configuration/filter-assignments-by-entry-point.md @@ -6,6 +6,14 @@ For some experiments, subjects are assigned to a variant in one place, but are n Eppo provides the ability to filter an assignment source by an [Entry Point](/statistics/sample-size-calculator/setup#creating-entry-points) (also known as a qualifying event) when configuring an experiment. This ensures that only the subjects assigned to that entry point are analyzed in the experiment, based on the logged events for that entry point. All decisions (inclusion into the experiment, time-framed metrics) are based on the timestamp of the entry point. +:::caution Entry points change when exposure starts, not just who is included +When you add an entry point filter, the **entry point timestamp replaces the assignment timestamp** as the start of each subject's analysis window. This means metric events are measured relative to when the subject triggered the entry point, not when they were originally assigned. + +If you only want to filter which subjects are included (without shifting the analysis window), use an Assignment SQL filter or a targeting rule instead, or make sure the entry point and the assignment timestamps match. + +The entry point entity must match the assignment entity for the join to work correctly. +::: + First you’ll need both an assignment source and an entry point source configured. Then, when setting up an experiment, check the box marked “Filter assignments by entry points” in the **Logging & Experiment Key** section: ![Choose assignment SQL](/img/building-experiments/select-assignment-source.png) diff --git a/docs/experiment-analysis/configuration/index.md b/docs/experiment-analysis/configuration/index.md index 3002f85f..08ddbb72 100644 --- a/docs/experiment-analysis/configuration/index.md +++ b/docs/experiment-analysis/configuration/index.md @@ -20,7 +20,7 @@ On the side panel, you'll be prompted to enter some information about the experi 2. The [Entity](/data-management/definitions/entities) on which the experiment was randomized (user, device, workspace, etc.) 3. Which [Assignment Source](/data-management/definitions/assignment-sql) has assignment logs for the experiment 4. An optional [entry point](/statistics/sample-size-calculator/setup#what-is-an-entry-point) on which to filter experiment assignments. This will limit the experiment analysis to subjects (e.g., users) that hit the specified entry point. You can read more about filtering experiment assignments [here](/experiment-analysis/configuration/filter-assignments-by-entry-point). -5. The experiment key of interest. The drop-down will show flags created in Eppo as well as other experiment keys in the selected Assignment Source. If your experiment key does not show up in the drop-down you can also enter it manually. +5. The experiment key of interest. The drop-down will show flags created in Eppo as well as other experiment keys in the selected Assignment Source. If your experiment key does not show up in the drop-down you can also enter it manually. Note that experiment keys are **not required to be unique** across experiments — the same key can appear in multiple assignment sources or be reused over time. If you use the API to create or query experiments programmatically, ensure you account for this by also specifying the assignment source or date range to disambiguate. 6. For experiments randomized with Eppo's feature flags, you'll also specify the [Allocation](/feature-flagging/#allocations) you want to analyze (one flag can be used to run multiple experiments) 7. A hypothesis for the experiment. You can also add this later when creating an experiment [report](/experiment-analysis/reporting/experiment-reports) diff --git a/docs/experiment-analysis/diagnostics.md b/docs/experiment-analysis/diagnostics.md index edf4acae..813cebb3 100644 --- a/docs/experiment-analysis/diagnostics.md +++ b/docs/experiment-analysis/diagnostics.md @@ -30,15 +30,7 @@ Validity of experimental results crucially relies on proper randomization of sub ### Traffic imbalance diagnostic -The traffic imbalance diagnostic runs a test to see whether the randomization works as expected and the number of subjects assigned to each variation is as expected. This indicates that there is likely an issue with the randomization of subjects (e.g. a bug in the randomization code), which can invalidate the results of an experiment. - -We run this traffic imbalance test by running a [Pearson’s chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) with $\alpha = 0.001$ on active variations, using the assignment weights for each variant (default is equal split across variations), which we convert to probabilities. This is also known as the sample ratio mismatch test (SRM). We run the test at the more conservative $\alpha = 0.001$ level because this test is not sequentially valid; the more conservative significance level helps us avoid false positives. - -Issues with the traffic allocations can come from many sources; here are some common ones we have seen: - -- There is an issue with the logging assignments (note this could be introduced through latency) -- Traffic allocations are updated in the middle of an experiments; in general, try to avoid changing the traffic allocations during an experiment -- Assignments for one variant (e.g. the control cell) started before assignments to other variants +Eppo runs a test to check whether the number of subjects assigned to each variation matches the expected split (sample ratio mismatch, or SRM). When it doesn’t, there is likely an issue with randomization or assignment logging, which can invalidate experiment results. For a detailed explanation of the test, common causes, and a step-by-step troubleshooting flow, see [Sample Ratio Mismatch](/statistics/sample-ratio-mismatch). ![Example diagnostic for a traffic imbalance in the assignment data](/img/experiments/diagnostics/diagnostics_imbalance.png) @@ -72,9 +64,22 @@ Data quality diagnostics check that experiment data matches what we would expect ### Pre-experiment data diagnostic -Eppo detects when pre-experiment metric averages differ significantly across variations for one or more metrics. Eppo will highlight the top metrics where we see this differentiation. -This issue is most often driven by the non-random assignment of users into variations within the experiment. Consult with your engineering team to diagnose potential issues with your randomization tool. +Eppo detects when pre-experiment metric averages differ significantly across variations for one or more metrics. Eppo will highlight the top metrics where we see this differentiation. When the gap is too large to be plausibly due to chance, we flag it so you can investigate before trusting experiment results. + +Possible reasons include: incorrectly specified experiment dates; iterating on a feature (e.g. same split after a buggy build) so Treatment had different pre-experiment exposure than Control; gradual roll-out with the experiment start set to full roll-out; or any [traffic imbalance](#traffic-imbalance-diagnostic) cause (assignment logging, latency, one variant starting before others). For a detailed list of causes and a step-by-step diagnostic process, see [CUPED and significance](/guides/advanced-experimentation/cuped_and_significance#common-causes-for-pre-experiment-imbalance). + + :::info The pre-experiment data diagnostic is only run when CUPED is enabled. This setting can be changed in the Admin panel across all experiments, or on a per-experiment basis in the Configure tab under Statistical Analysis Plan. ::: + +## Understanding diagnostic queries + +Each diagnostic check includes a SQL query that you can copy and run directly in your warehouse to investigate further. However, there is an important caveat: + +:::caution Diagnostic queries are approximations +The SQL queries shown in the diagnostic sidebar are **simplified approximations** of the full experiment pipeline. They do not apply [CUPED++](/statistics/cuped) variance reduction, [winsorization](/statistics/confidence-intervals/#estimating-lift), or mixed-assignment filtering. As a result, running these queries in your warehouse may produce numbers that differ from what Eppo displays on the experiment results page. + +This is expected and does not indicate a bug. The diagnostic queries are designed to help you verify that data exists and joins correctly — not to reproduce the final experiment statistics. +::: diff --git a/docs/experiment-analysis/reading-results/global-lift.md b/docs/experiment-analysis/reading-results/global-lift.md index f7269a30..00e8b87d 100644 --- a/docs/experiment-analysis/reading-results/global-lift.md +++ b/docs/experiment-analysis/reading-results/global-lift.md @@ -16,4 +16,10 @@ If your rollout plan would include additional users from the same audience that ::: -Global lift and coverage are currently only available for **sum** and **count** aggregation types. For details on how Global Lift is calculated, see [the Statistics section](/statistics/global-lift). +Global lift and coverage are available for **sum**, **count**, and **unique entity** (count distinct) aggregation types. + +:::info Unique entity metrics and non-additivity +Unique entity (count distinct) metrics are **non-additive**: a user who converts in both the experiment population and the ineligible population is counted once in the global total, not twice. This could make the extrapolation used in the Global Lift calculation invalid. If you are comfortable making that assumption, Eppo support can activate that option for you. +::: + +For details on how Global Lift is calculated, see [the Statistics section](/statistics/global-lift). diff --git a/docs/experiment-analysis/reading-results/progress-bar.md b/docs/experiment-analysis/reading-results/progress-bar.md index 674122eb..49bf04e2 100644 --- a/docs/experiment-analysis/reading-results/progress-bar.md +++ b/docs/experiment-analysis/reading-results/progress-bar.md @@ -36,6 +36,10 @@ Furthermore, when hovering over a progress bar, additional information about the **Note:** We compute the days remaining using a linear interpolation. This interpolation does not take into account that gathering data usually slows down during an experiment, and so the estimate may be optimistic, especially in the early days of an experiment. +:::info Progress bar reflects CUPED++ but the Sample Size Calculator does not +The progress bar updates using the actual precision of your running experiment, which includes variance reduction from [CUPED++](/statistics/cuped) if enabled. The [Sample Size Calculator](/statistics/sample-size-calculator/usage), however, does not factor in CUPED++. This means the Sample Size Calculator may predict a longer runtime than is actually needed — experiments with CUPED++ enabled often reach 100% progress significantly earlier than pre-experiment estimates suggest. +::: + ## How to use the progress bar Traditionally, when using a fixed sample test, we decide up front how long the experiment ought to run and cannot interpret results until we have finished gathering all data. However, the sequential testing approach we use allows for flexibility in deciding when to stop an experiment. Here's some advice on how to get the most out of the progress bar. @@ -62,7 +66,15 @@ If, at the end of the experiment, the progress bar has not filled up, it might i When using either the sequential confidence intervals, or Bayesian methodology, the above still applies. But with both of these there is another option: both of these methods[^bayesian-peeking] are always-valid and hence you can confidently stop an experiment any time. -Whenever we detect that a primary metric of one of the variants is statistically significant (the confidence/credible interval does not contain 0%), we mark the experiment is **early stopping eligible\*** and hence **ready for review**. Of course, you might still want to run the experiment for longer, e.g. to obtain more data on secondary metrics. In the following example, the precision target is set to 2%, which has not been reached yet, but the experiment is still eligible for early stopping as we see a statistically significant lift and are using sequential analysis: +Whenever we detect that a primary metric of one of the variants is statistically significant (the confidence/credible interval does not contain 0%), we mark the experiment as **early stopping eligible\*** and hence **ready for review**. Of course, you might still want to run the experiment for longer, e.g. to obtain more data on secondary metrics. + +Specifically, an experiment becomes "ready for review" when **all three** of the following conditions are met: + +1. The confidence interval for a primary metric excludes zero (statistical significance). +2. The minimum sample size requirement is met (if configured). +3. The minimum experiment runtime is met (if configured). + +If any of these conditions is not yet satisfied, the experiment will remain in the "running" state even if one of the other conditions has been met. In the following example, the precision target is set to 2%, which has not been reached yet, but the experiment is still eligible for early stopping as we see a statistically significant lift and are using sequential analysis: ![Progress bar popover for early stopping](/img/interpreting-experiments/progress-bar-early-stopping.png) diff --git a/docs/feature-flagging/concepts/flag-variations.md b/docs/feature-flagging/concepts/flag-variations.md index d9eb6309..2c122e63 100644 --- a/docs/feature-flagging/concepts/flag-variations.md +++ b/docs/feature-flagging/concepts/flag-variations.md @@ -48,3 +48,7 @@ You can write an empty array as `{}` if there is no property value present for a JSON object and array flags have a default size limit of 250KB, which can be increased if your team needs more. Eppo will show an error message in the UI if a variation value does not validate as proper JSON. + +:::caution Integer values in JSON may be returned as floating-point numbers +Due to how JSON is serialized and deserialized, integer values in JSON variations (e.g., `{"count": 5}`) may be returned as floating-point numbers (e.g., `5.0`) by some SDK language runtimes. If your application uses strict type checking or integer comparison, this can cause silent breakage. To avoid this, either use string values and parse them on the client side, or ensure your application handles both integer and floating-point representations. +::: diff --git a/docs/feature-flagging/concepts/holdout-config.md b/docs/feature-flagging/concepts/holdout-config.md index 9475253f..1954f960 100644 --- a/docs/feature-flagging/concepts/holdout-config.md +++ b/docs/feature-flagging/concepts/holdout-config.md @@ -102,3 +102,13 @@ class MyAssignmentLogger(AssignmentLogger): > **Note:** Some SDK implementations may nest the holdout information within an `extraLogging` field. If you don't see `holdoutKey` and `holdoutVariation` at the top level of the event, check for them in `event.extraLogging.holdoutKey` and `event.extraLogging.holdoutVariation`. +## Limitations + +### Archived holdouts and allocation space + +When a holdout is archived, its hash-space band is not immediately freed. Eppo keeps a buffer of the two most recently archived holdouts to avoid reassigning subjects to a new holdout too quickly. Older archived holdout bands are gradually reclaimed when new holdouts are created. If you run into allocation space issues after archiving many holdouts, contact Eppo support. + +### Start date cannot be changed after creation + +The start date of a holdout is immutable once created. The end date can be extended or shortened after creation. Plan your holdout start date carefully, as changing it requires creating a new holdout. + diff --git a/docs/feature-flagging/concepts/mutual_exclusion.md b/docs/feature-flagging/concepts/mutual_exclusion.md index ba41584c..7e84b016 100644 --- a/docs/feature-flagging/concepts/mutual_exclusion.md +++ b/docs/feature-flagging/concepts/mutual_exclusion.md @@ -23,6 +23,10 @@ Navigate to the Configuration section and click the "Create" button and select " ### Parameters Parameters are attributes that are changed in different variations you plan to test within the layer. They are also configured in code and can accept the values you provide. +:::caution Parameters cannot be added after an experiment starts +Once an experiment is running within a layer, the layer's parameter list is locked. You cannot add new parameters to the layer until all experiments in it have concluded. Plan your parameter set before launching any experiments in the layer. +::: + ![Parameter example](/img/feature-flagging/parameter-example.jpg) For example, if you want to test the color of a button on the page, you might create a parameter called `button_color` and set it with a default color value. When you create variations, you'll be able to specify a different value for color and test that variation in an experiment. diff --git a/docs/feature-flagging/concepts/targeting.md b/docs/feature-flagging/concepts/targeting.md index d1992065..4157cb46 100644 --- a/docs/feature-flagging/concepts/targeting.md +++ b/docs/feature-flagging/concepts/targeting.md @@ -32,6 +32,9 @@ You can target individual subjects by matching with the property `id`. The list of IDs when using `is one of` or `not one of` is limited to 50 values. +## Known limitations + +- **No array-type attribute support.** Targeting rules evaluate scalar values only. If a subject has an array-valued attribute (e.g., a list of tags), the rule will not match individual elements. Pass the relevant value as a scalar instead. ## Special case: Semantic Versioning diff --git a/docs/feature-flagging/index.md b/docs/feature-flagging/index.md index ffca11cc..dc90e384 100644 --- a/docs/feature-flagging/index.md +++ b/docs/feature-flagging/index.md @@ -21,6 +21,10 @@ The following are the central feature flagging concepts in Eppo: - [Audiences](/feature-flagging/concepts/audiences) - [Mutual exclusion](/feature-flagging/concepts/mutual_exclusion) +:::caution Flag archival is irreversible +Archiving a feature flag is a **permanent** action — archived flags cannot be unarchived. Before archiving, ensure no running experiments or rollouts depend on the flag. If you need to temporarily disable a flag, turn off all allocations instead. +::: + ## Use cases Feature flags are applicable for a number of use cases: diff --git a/docs/guides/advanced-experimentation/interaction-effects-asymmetry.md b/docs/guides/advanced-experimentation/interaction-effects-asymmetry.md new file mode 100644 index 00000000..a0c0781b --- /dev/null +++ b/docs/guides/advanced-experimentation/interaction-effects-asymmetry.md @@ -0,0 +1,130 @@ +--- +sidebar_position: 9 +--- + +# Why interaction effects are not symmetrical + +When two experiments run concurrently and you check whether they interact, you are asking: *"Does being in Experiment B's treatment group change the lift I observe in Experiment A?"* + +But you could ask the reverse: *"Does being in Experiment A's treatment group change the lift I observe in Experiment B?"* + +These are different questions, and they will generally give different answers — even though they describe the same group of users and the same underlying data. This guide explains why, and what it means in practice. + +## Why the answers differ + +Interaction effects in Eppo are expressed as a change in **relative lift** (a percentage change). Relative lift is always computed against a baseline, and the two questions above use different baselines: + +| Question | What is being measured | Baseline used | +|---|---|---| +| From Experiment A's page | Does B's arm change A's relative lift? | A-control users, split by B arm | +| From Experiment B's page | Does A's arm change B's relative lift? | B-control users, split by A arm | + +Because the baselines are different, the measured interaction magnitude will differ — sometimes by a factor of two or more. + +## Example 1: Homepage redesign and newsletter CTA + +Two experiments are running simultaneously: + +- **Experiment A**: A complete homepage redesign, expected to roughly double daily sign-ups (+100%) +- **Experiment B**: A small "Subscribe to newsletter" call-to-action, expected to add a modest +5% to sign-ups + +Daily sign-ups across the four groups: + +| | B: Control (no CTA) | B: Treatment (CTA shown) | +|---|---|---| +| **A: Control (old homepage)** | 100 | 105 | +| **A: Treatment (new homepage)** | 200 | 215 | + +### From Experiment A's perspective + +*Does the newsletter CTA change how much lift the homepage redesign produces?* + +- Lift of A with no CTA: (200 − 100) / 100 = **+100%** +- Lift of A with CTA: (215 − 105) / 105 = **+104.8%** +- **Interaction: +4.8 percentage points** + +### From Experiment B's perspective + +*Does the homepage redesign change how much lift the newsletter CTA produces?* + +- Lift of B on old homepage: (105 − 100) / 100 = **+5%** +- Lift of B on new homepage: (215 − 200) / 200 = **+7.5%** +- **Interaction: +2.5 percentage points** + +The underlying phenomenon is identical — both experiments amplify each other beyond what pure additivity would predict. But the measured interaction is nearly **twice as large** when viewed from the big experiment's perspective (+4.8 pp vs +2.5 pp). + +## Example 2: Recommendation engine and card design + +- **Experiment A**: A new ML-powered recommendation engine, expected to triple clicks on product cards (+200%) +- **Experiment B**: Larger product card thumbnails, expected to add +10% to clicks + +Daily product card clicks: + +| | B: Control (small thumbnails) | B: Treatment (large thumbnails) | +|---|---|---| +| **A: Control (old engine)** | 1,000 | 1,100 | +| **A: Treatment (new engine)** | 3,000 | 3,400 | + +### From Experiment A's perspective + +- Lift of A with small thumbnails: (3,000 − 1,000) / 1,000 = **+200%** +- Lift of A with large thumbnails: (3,400 − 1,100) / 1,100 = **+209%** +- **Interaction: +9 percentage points** + +### From Experiment B's perspective + +- Lift of B with old engine: (1,100 − 1,000) / 1,000 = **+10%** +- Lift of B with new engine: (3,400 − 3,000) / 3,000 = **+13.3%** +- **Interaction: +3.3 percentage points** + +Here the same interaction appears **three times larger** from the big experiment's perspective. The new recommendation engine raises the baseline so much that even a larger absolute improvement in thumbnail clicks represents a smaller percentage gain. + +## Example 3: Checkout flow and free shipping threshold + +- **Experiment A**: A streamlined checkout flow, expected to increase purchase conversions by +40% +- **Experiment B**: Lowering the free shipping threshold from $50 to $25, expected to add +15% to conversions + +Purchase conversion rate (purchases per session): + +| | B: Control ($50 threshold) | B: Treatment ($25 threshold) | +|---|---|---| +| **A: Control (old checkout)** | 5.0% | 5.75% | +| **A: Treatment (new checkout)** | 7.0% | 8.25% | + +### From Experiment A's perspective + +*Does the free shipping threshold change how much the checkout improvement helps?* + +- Lift of A with $50 threshold: (7.0% − 5.0%) / 5.0% = **+40%** +- Lift of A with $25 threshold: (8.25% − 5.75%) / 5.75% = **+43.5%** +- **Interaction: +3.5 percentage points** + +### From Experiment B's perspective + +*Does the streamlined checkout change how much lowering the shipping threshold helps?* + +- Lift of B with old checkout: (5.75% − 5.0%) / 5.0% = **+15%** +- Lift of B with new checkout: (8.25% − 7.0%) / 7.0% = **+17.9%** +- **Interaction: +2.9 percentage points** + +A moderate difference here (+3.5 pp vs +2.9 pp), but still worth noting — especially because statistical significance thresholds are sensitive to effect size. + +## What this means in practice + +**The same interaction looks bigger when viewed from the experiment with the larger effect.** + +When the large experiment shifts the baseline significantly, the smaller experiment's relative lift changes less in percentage terms even if the absolute difference is the same. A few implications to keep in mind: + +1. **Significance can differ between perspectives.** An interaction effect might cross the significance threshold when viewed from one experiment's results page but not the other. Both conclusions are correct — they answer different questions about different lifts. + +2. **Neither view is "wrong".** Experiment A's interaction result tells you about the robustness of A's lift across B's arms. Experiment B's interaction result tells you about the robustness of B's lift across A's arms. Both are valid. + +3. **"No interaction" is also asymmetric.** Just because Experiment A shows no significant interaction with B does not mean Experiment B will show no significant interaction with A. If the decision is high-stakes, check both directions. + +4. **When shipping both, think in absolute terms.** If you plan to roll out both experiments to 100% of users, neither relative-lift perspective captures the full picture. Work with your data team to measure the combined outcome directly and compare it against the sum of the individual effects. + +:::info Interaction effects are only available for Simple metrics +Interaction effect analysis is currently limited to **Simple** metric types (sum, count, unique entities, threshold). Ratio metrics and funnel metrics are not supported. If you need to check for interactions on a ratio metric, consider adding the numerator and denominator as separate Simple metrics to your experiment. +::: + +For background on how Eppo detects and surfaces interaction effects, see [Interaction Detection](/statistics/interaction-detection). diff --git a/docs/guides/advanced-experimentation/running-well-powered-experiments.md b/docs/guides/advanced-experimentation/running-well-powered-experiments.md index cac3c745..c7e8cd50 100644 --- a/docs/guides/advanced-experimentation/running-well-powered-experiments.md +++ b/docs/guides/advanced-experimentation/running-well-powered-experiments.md @@ -81,6 +81,20 @@ However, particularly the t-test is susceptible to peeking. If this is a problem We want to stay away from the fully sequential paradigm when we struggle to find enough power in the first place. We cannot afford the cost in width of the confidence intervals for the added flexibility. Furthermore, it is unlikely we would be able to stop the experiment early anyway. +## Safely ramping traffic + +When launching a new experiment, it's common to start with a small percentage of traffic and gradually increase it. There are two ways to do this in Eppo, and it's important to understand the difference: + +### Adjusting traffic exposure (safe) + +Changing the **traffic exposure** percentage (the share of eligible subjects enrolled in the experiment) is the safe way to ramp up. Subjects not enrolled simply receive the default experience and are excluded from analysis. Increasing exposure from, say, 10% to 50% adds new subjects to the experiment without affecting the assignments of subjects already enrolled. Their variant assignments remain stable because the hash-based bucketing is deterministic. + +### Changing variant weights (invalidates the experiment) + +Changing the **variant split** (e.g., moving from 50/50 to 80/20 between treatment and control) mid-experiment is a different operation and should be avoided. It can cause subjects to switch between variants, which introduces mixed assignments and undermines the validity of your results. Eppo removes mixed-assignment subjects from the analysis automatically, but this reduces your effective sample size and can bias the remaining population. + +If you need to shift variant weights after the experiment has started, the recommended approach is to end the current experiment and start a new one with the desired split. + ## Conclusion In certain situations, we really need to make the most out of a limited sample size. In this case, remember that it is all about optimizing the signal-to-noise ratio. First and foremost, we should make sure we choose our metrics carefully. With winsorization, CUPED++, and a choice of statistical methodology, Eppo helps you make the most out of our data. \ No newline at end of file diff --git a/docs/guides/debugging-metrics.md b/docs/guides/debugging-metrics.md index aefef5fa..ea57ef36 100644 --- a/docs/guides/debugging-metrics.md +++ b/docs/guides/debugging-metrics.md @@ -72,3 +72,19 @@ Clicking on the diagnostic will open a sidebar with detailed information and tro The diagnostics window will contain event volume information if it is available, and SQL snippets for the metric source and the diagnostic query can be copied and executed against your warehouse to help investigate. The diagnostics query represents the code which was executed against the warehouse to test if the diagnostic should pass or fail. It uses common table expressions (CTEs) to break the problem into chunks which can be tweaked to help you investigate the problem. + +## When Diagnostics isn't enough: Run Log + +If the Diagnostics tab and Fact SQL inspection don't reveal the issue, the **Run Log** is the next place to look. The Run Log shows every pipeline execution for an experiment, including individual task statuses, durations, and errors. + +To access the Run Log, navigate to your experiment and click on the **Run Log** tab. + +The Run Log is especially useful for: + +- **Diagnosing "Update failed" errors** when Diagnostics shows all checks as passing — the Run Log helps you see the query that failed in the context of its pipeline. +- **Identifying intermittent failures** such as warehouse timeouts, permission errors, or transient connectivity issues that don't surface in Diagnostics. +- **Verifying that a fix worked** by confirming subsequent runs completed successfully after a configuration change. + +:::tip +Some pipeline steps may show as `WAITING_FOR_UPSTREAM` or `UPSTREAM_FAILED` if a prerequisite step has not yet completed or has failed. If you see tasks stuck in these states, check the upstream tasks in the same run for errors. +::: diff --git a/docs/sdks/architecture/deployment-modes.md b/docs/sdks/architecture/deployment-modes.md index 3cd2a594..65456bc3 100644 --- a/docs/sdks/architecture/deployment-modes.md +++ b/docs/sdks/architecture/deployment-modes.md @@ -290,3 +290,13 @@ Instead, initialize once at the start of the application lifecycle and use the ` ### Initializing from CDN in a serverless function Similarly, make sure to follow [the recommendations above](#local-flag-evaluation-initialized-with-pre-fetched-configurations) for serverless architectures. If you instead make a request to Eppo's CDN each time the serverless function is called, you'll introduce unnecessary latency and risk breaching Eppo's CDN limits. + +## CDN resilience and caching behavior + +Eppo's CDN serves flag configurations with `stale-while-revalidate` caching headers, which means SDKs typically receive a cached response even if the origin is temporarily slow. However, there is an important edge case: + +:::info 429 (rate limit) responses are not covered by stale-if-error +The CDN's `stale-if-error` directive covers 5xx server errors — if the origin returns a 500, the CDN will serve a stale cached copy. However, **429 (Too Many Requests) responses are not covered** by `stale-if-error`. If your application sends too many requests to the CDN (e.g., by re-initializing the SDK on every function invocation), the CDN will return a 429 directly to your application and the SDK will fail to load configuration. + +To avoid this, ensure you follow the initialization patterns above and do not make excessive CDN requests. +::: diff --git a/docs/sdks/client-sdks/android/initialization.mdx b/docs/sdks/client-sdks/android/initialization.mdx index 5fab2b6c..d79fb805 100644 --- a/docs/sdks/client-sdks/android/initialization.mdx +++ b/docs/sdks/client-sdks/android/initialization.mdx @@ -66,6 +66,10 @@ When true, prevents the SDK from making HTTP requests to fetch configurations. The SDK can cache previously loaded configurations for use in future sessions. This makes the SDK initialize faster and provides fallback values when network requests fail. +:::caution Use separate configuration stores per environment +If your app initializes the SDK in multiple environments (e.g., staging and production) on the same device, each environment must use its own SDK key and a separate cache location. A single shared cache will be overwritten each time the SDK fetches from a different environment, causing flags to silently flip between staging and production values. +::: + #### How Configuration Caching Works The SDK automatically caches configurations on the device. You can also implement your own custom cache by providing a custom storage implementation. diff --git a/docs/sdks/client-sdks/ios/initialization.mdx b/docs/sdks/client-sdks/ios/initialization.mdx index f82fbe2b..f2ea96fb 100644 --- a/docs/sdks/client-sdks/ios/initialization.mdx +++ b/docs/sdks/client-sdks/ios/initialization.mdx @@ -83,6 +83,10 @@ Task { The last downloaded configuration is cached on disk so they can provide the same experience until the network becomes available and a more recent configuration can be loaded. This is useful because when the SDK is initialized, a mobile device might not have network access. +:::caution Use separate configuration stores per environment +If your app initializes the SDK in multiple environments (e.g., staging and production) on the same device, each environment must use its own SDK key and configuration store. A single shared configuration cache will be overwritten each time the SDK fetches from a different environment, causing flags to silently flip between staging and production values. +::: + #### Initialization with Automatic Configuration Updates (Polling) diff --git a/docs/sdks/event-logging/assignment-logging.md b/docs/sdks/event-logging/assignment-logging.md index 87fc53bc..790b2cfe 100644 --- a/docs/sdks/event-logging/assignment-logging.md +++ b/docs/sdks/event-logging/assignment-logging.md @@ -19,6 +19,17 @@ The object passed into the assignment logger function contains the following fie | `featureFlag` (string) | An Eppo feature flag key | "recommendation-algo" | | `allocation` (string) | An Eppo allocation key | "allocation-17" | +:::caution Required fields — omitting these causes silent data loss +The `experiment`, `subject`, `variation`, and `timestamp` fields are all required for Eppo to correctly join assignments to your experiment analysis. In particular: + +- **`experiment`**: If this field is missing or empty, the assignment row will not match any experiment in Eppo and will be silently ignored. This is the most common cause of "no assignment data" when the SDK is otherwise working correctly. +- **`subject`**: Must match the entity identifier used in your Fact SQL. A mismatch (e.g., logging a cookie ID but defining assignments on a user ID) produces zero metric joins. +- **`variation`**: Required to determine which variant the subject was assigned to. +- **`timestamp`**: Required to scope the assignment to the experiment's analysis window. + +The `featureFlag` and `allocation` fields are optional but recommended for debugging. When using Eppo's feature flags, both values are included in the `experiment` field automatically. +::: + Eppo expects that the logger function will take this object and write data back to your warehouse in a format that roughly matches the table below. The specific column names do not matter, but these columns are needed to later [define assignments](/data-management/definitions/assignment-sql.md) in your warehouse. | experiment | subject | variation | timestamp | subject_attributes | diff --git a/docs/statistics/confidence-intervals/index.md b/docs/statistics/confidence-intervals/index.md index f840f878..9ba98a6b 100644 --- a/docs/statistics/confidence-intervals/index.md +++ b/docs/statistics/confidence-intervals/index.md @@ -62,6 +62,10 @@ For any metric where you have elected to use to handle outliers, the metric totals displayed in the tooltip when hovering over the lift are the _winsorized_ totals, not raw values. +Eppo uses **truncation-based winsorization** (also known as "clipping"): values beyond the chosen percentile threshold are capped to that threshold value, not removed. This preserves the sample size while reducing the influence of extreme outliers. + +Note that the [diagnostic queries](/experiment-analysis/diagnostics#understanding-diagnostic-queries) shown in the Diagnostics tab do **not** apply winsorization, so running them in your warehouse will produce different totals than the experiment results page. + ::: The end result of these more sophisticated methods is that we show diff --git a/docs/statistics/sample-size-calculator/usage.md b/docs/statistics/sample-size-calculator/usage.md index b3c08abc..57df63f2 100644 --- a/docs/statistics/sample-size-calculator/usage.md +++ b/docs/statistics/sample-size-calculator/usage.md @@ -104,6 +104,12 @@ The sequential version of the Minimum Detectable Effect is similar, but scaled b If metrics are [winsorized](/guides/advanced-experimentation/running-well-powered-experiments/#handling-outliers-using-winsorization) the sample size computation takes that into account. ::: +:::info CUPED++ is not included in sample size estimates +The Sample Size Calculator does not account for the variance reduction provided by [CUPED++](/statistics/cuped). In practice, CUPED++ can substantially reduce confidence interval widths (often by 30–50%), which means experiments may reach the target precision significantly sooner than the calculator predicts. + +Use the Sample Size Calculator as a **conservative upper bound** on required runtime. If your experiment has CUPED++ enabled, you can expect to reach the desired precision earlier than the table suggests. The [Progress Bar](/experiment-analysis/reading-results/progress-bar) on a running experiment reflects actual precision including CUPED++ adjustments. +::: + :::note The sample size calculator also [checks the validity of the normal approximation used](/statistics/confidence-intervals/#estimating-lift) for the MDE calculation. ::: \ No newline at end of file From ae771d9d11cecc9a3a4436c730cc53d49041c002 Mon Sep 17 00:00:00 2001 From: bertilhatt Date: Sat, 21 Mar 2026 20:12:17 +0100 Subject: [PATCH 2/2] fix: correct lookback window, ready-for-review logic, key reuse, cleanup nav MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Lookback window: 24h → 2 days (matches DEFAULT_INC_LOOKBACK_PERIOD_DAYS) - Ready-for-review: three independent paths, not all-three-required; document min-requirements gate and traffic-imbalance blocker - Flag archival: note that keys can be reused after archival - Warehouse cleanup: Admin → Settings → Experiment Schedule Settings Made-with: Cursor --- docs/data-management/data-pipeline.md | 6 +++--- .../reading-results/progress-bar.md | 12 ++++++------ docs/feature-flagging/index.md | 2 ++ 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/data-management/data-pipeline.md b/docs/data-management/data-pipeline.md index 92511083..8f29d26b 100644 --- a/docs/data-management/data-pipeline.md +++ b/docs/data-management/data-pipeline.md @@ -24,7 +24,7 @@ Note that the y axis shows the compute time accrued by that task type. That is, ### Incremental refreshes -Eppo's scheduled jobs will run an incremental refresh that only scans recent data. By default, this lookback window will include data starting 24 hours before the last successful run (to change this time window, reach out to your Eppo contact or email support@geteppo.com). New metrics and metrics/facts with an updated definition will automatically be backfilled from the start of the experiment. Further, if a job fails on a given day, the next scheduled job will automatically go back and re-run metrics for that day. +Eppo's scheduled jobs will run an incremental refresh that only scans recent data. By default, this lookback window covers the **2 days** before the last successful run, snapped to the start of day (to change this time window, reach out to your Eppo contact or email support@geteppo.com). New metrics and metrics/facts with an updated definition will automatically be backfilled from the start of the experiment. Further, if a job fails on a given day, the next scheduled job will automatically go back and re-run metrics for that day. You can also trigger a refresh in the UI by clicking "refresh experiment results" on the metric scorecard. This will compute new metrics from scratch and update existing metrics based on the incremental logic described above. If you'd instead like to force a full refresh and recompute all metrics from the start of the experiment, click "update now" under "results last updated". @@ -34,7 +34,7 @@ You can also trigger a refresh in the UI by clicking "refresh experiment results Not every data issue requires a full backfill. Use this decision tree to determine the right action: -- **Eppo's pipeline failed (e.g., warehouse timeout, permission error) but your underlying data is correct:** You generally do **not** need a backfill. The incremental lookback window (default 24 hours) will automatically re-process the missed period on the next successful run. Verify the next scheduled run completes successfully. +- **Eppo's pipeline failed (e.g., warehouse timeout, permission error) but your underlying data is correct:** You generally do **not** need a backfill. The incremental lookback window (default 2 days) will automatically re-process the missed period on the next successful run. Verify the next scheduled run completes successfully. - **Your upstream data was wrong and has now been corrected (e.g., a broken ETL was fixed, late-arriving data has landed):** You likely **do** need a full refresh to recompute metrics from the affected date. Trigger a full refresh from the experiment's results page ("update now" under "results last updated"), or use the API: `POST /api/v1/experiment-results/update/{experiment_id}`. Both endpoints accept a `lookback_date` query parameter (ISO 8601 format, e.g. `?lookback_date=2025-06-01T00:00:00Z`) to recompute results starting from a specific date instead of reprocessing the entire experiment. You can also pass `full_refresh=true` to force a non-incremental refresh. @@ -107,7 +107,7 @@ If you have any question about our privacy practices, please reach out. Eppo creates intermediate tables and views in a dedicated schema (typically `EPPO_OUTPUT`) in your warehouse. Over time — especially in long-running workspaces with many experiments — these can accumulate into thousands of objects. This is expected behavior and does not affect experiment results. -To manage this, Eppo provides an **automatic warehouse table cleanup** setting. Navigate to **Admin → Pipeline Update Schedules** and enable **"Automatically clean up old warehouse tables"**. You configure a retention period (e.g., 90 days) — Eppo will then drop any `EPPO_OUTPUT` tables that haven't been updated within that window. The cleanup runs on the 1st of every month. By default, tables used by Explore charts and the Sample Size Calculator are preserved; you can opt in to cleaning those up as well with separate toggles. +To manage this, Eppo provides an **automatic warehouse table cleanup** setting. Navigate to **Admin → Settings → Experiment Schedule Settings** and enable **"Automatically clean up old warehouse tables"**. You configure a retention period (e.g., 90 days) — Eppo will then drop any `EPPO_OUTPUT` tables that haven't been updated within that window. The cleanup runs on the 1st of every month. By default, tables used by Explore charts and the Sample Size Calculator are preserved; you can opt in to cleaning those up as well with separate toggles. Do not manually drop tables from the `EPPO_OUTPUT` schema — active experiments may depend on them. Use the built-in cleanup automation instead, which only removes tables outside the retention window. diff --git a/docs/experiment-analysis/reading-results/progress-bar.md b/docs/experiment-analysis/reading-results/progress-bar.md index 49bf04e2..4ba1ed82 100644 --- a/docs/experiment-analysis/reading-results/progress-bar.md +++ b/docs/experiment-analysis/reading-results/progress-bar.md @@ -66,15 +66,15 @@ If, at the end of the experiment, the progress bar has not filled up, it might i When using either the sequential confidence intervals, or Bayesian methodology, the above still applies. But with both of these there is another option: both of these methods[^bayesian-peeking] are always-valid and hence you can confidently stop an experiment any time. -Whenever we detect that a primary metric of one of the variants is statistically significant (the confidence/credible interval does not contain 0%), we mark the experiment as **early stopping eligible\*** and hence **ready for review**. Of course, you might still want to run the experiment for longer, e.g. to obtain more data on secondary metrics. +An experiment becomes "ready for review" through any **one** of these independent paths: -Specifically, an experiment becomes "ready for review" when **all three** of the following conditions are met: +1. **End date reached** — the experiment's end date has passed and the pipeline has completed (status moves to wrap-up). +2. **Progress reaches 100%** — the precision target has been met (Sequential / Bayesian / Hybrid methods only). +3. **Early stopping eligible** — the confidence/credible interval for a primary metric excludes zero, i.e., a statistically significant result is detected (Sequential / Bayesian / Hybrid methods only). -1. The confidence interval for a primary metric excludes zero (statistical significance). -2. The minimum sample size requirement is met (if configured). -3. The minimum experiment runtime is met (if configured). +If **minimum requirements** are configured (minimum sample size and/or minimum experiment runtime), they act as a gate: none of the paths above will trigger "ready for review" until those minimums are satisfied. Additionally, if a **traffic imbalance** (SRM) is detected, the experiment will not be marked ready for review regardless of the other conditions. -If any of these conditions is not yet satisfied, the experiment will remain in the "running" state even if one of the other conditions has been met. In the following example, the precision target is set to 2%, which has not been reached yet, but the experiment is still eligible for early stopping as we see a statistically significant lift and are using sequential analysis: +In the following example, the precision target is set to 2%, which has not been reached yet, but the experiment is still eligible for early stopping as we see a statistically significant lift and are using sequential analysis: ![Progress bar popover for early stopping](/img/interpreting-experiments/progress-bar-early-stopping.png) diff --git a/docs/feature-flagging/index.md b/docs/feature-flagging/index.md index dc90e384..7322f3c2 100644 --- a/docs/feature-flagging/index.md +++ b/docs/feature-flagging/index.md @@ -23,6 +23,8 @@ The following are the central feature flagging concepts in Eppo: :::caution Flag archival is irreversible Archiving a feature flag is a **permanent** action — archived flags cannot be unarchived. Before archiving, ensure no running experiments or rollouts depend on the flag. If you need to temporarily disable a flag, turn off all allocations instead. + +The flag key of an archived flag can be reused when creating a new flag. However, reusing a key that existing SDK clients may still reference can cause unexpected behavior (the new flag's configuration will be served for the same key). Prefer choosing a new key unless you are certain no deployments reference the old one. ::: ## Use cases