Skip to content

feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673

Open
LYZJU2019 wants to merge 1 commit into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector-success-rate
Open

feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673
LYZJU2019 wants to merge 1 commit into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector-success-rate

Conversation

@LYZJU2019
Copy link
Copy Markdown
Contributor

@LYZJU2019 LYZJU2019 commented Jun 4, 2026

Motivation

#2619 landed the failure-percentage algorithm and the shared sweep plumbing. This PR fills in the second gRFC A50 ejection algorithm: success-rate (cross-host mean/stdev).

Solution

In the housekeeping sweep, the success-rate algorithm now runs before the failure-percentage algorithm (A50 lists them in that order, and Envoy's implementation follows the same order; the two algorithms are otherwise independent and gated separately).

Tests

Seven new unit tests covering:

  • success_rate_ejects_outlier_below_threshold — 4× 100% + 1× 0%, mean=80, stdev=40, threshold@1900 = 4 → outlier ejected.
  • success_rate_uniform_population_does_not_eject — stdev=0 → threshold=mean, nothing strictly below.
  • success_rate_minimum_hosts_gates_ejection — qualifying<minimum_hosts → algorithm skipped.
  • success_rate_request_volume_filters_low_traffic — low-volume outlier excluded from population and candidates.
  • success_rate_enforcement_zero_never_ejectsenforcing_success_rate = 0 short-circuits the roll.
  • success_rate_stdev_factor_zero_ejects_below_mean — factor=0 collapses threshold to the mean.
  • success_rate_max_ejection_percent_caps_concurrent_ejections — cap (with floor) holds.
  • success_rate_and_failure_percentage_compose — both algorithms configured: success-rate ejects, failure-percentage skips already-ejected host (no double-count).

All 33 tests in client::loadbalance::outlier_detection pass, including the 26 from #2619.

…gorithm

grpc#2619 landed the failure-percentage algorithm and the shared sweep
plumbing. This change fills in the second A50 ejection algorithm:
success-rate (cross-host mean/stdev).

Algorithm (run before failure-percentage in the same sweep, gated by
SuccessRateConfig being present on the cluster config):

  1. For each host with total >= request_volume, compute its success
     rate as a percentage in 0.0..=100.0.
  2. If fewer than minimum_hosts qualify, skip the algorithm.
  3. Compute mean and population stdev of the success rates.
  4. threshold = mean - stdev * stdev_factor / 1000
  5. For each qualifying host whose success rate is strictly below the
     threshold, attempt ejection — subject to max_ejection_percent (with
     A50's at-least-one floor) and the enforcing_success_rate roll.
     Hosts already ejected (e.g., by a previous algorithm in this sweep)
     are skipped, and ejections feed into ejected_count so the
     subsequent failure-percentage pass respects the cap.

Success-rate runs first because A50 lists it first and Envoy's
implementation runs the algorithms in the same order; the two
algorithms are otherwise independent.

Seven new unit tests cover the outlier-below-threshold happy path,
uniform-population no-eject (stdev = 0), minimum_hosts gating,
request_volume filtering, enforcement = 0 no-op, stdev_factor = 0
collapsing to the mean, max_ejection_percent interaction, and
combined success-rate + failure-percentage composition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant