feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673
Open
LYZJU2019 wants to merge 1 commit into
Open
feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673LYZJU2019 wants to merge 1 commit into
LYZJU2019 wants to merge 1 commit into
Conversation
…gorithm grpc#2619 landed the failure-percentage algorithm and the shared sweep plumbing. This change fills in the second A50 ejection algorithm: success-rate (cross-host mean/stdev). Algorithm (run before failure-percentage in the same sweep, gated by SuccessRateConfig being present on the cluster config): 1. For each host with total >= request_volume, compute its success rate as a percentage in 0.0..=100.0. 2. If fewer than minimum_hosts qualify, skip the algorithm. 3. Compute mean and population stdev of the success rates. 4. threshold = mean - stdev * stdev_factor / 1000 5. For each qualifying host whose success rate is strictly below the threshold, attempt ejection — subject to max_ejection_percent (with A50's at-least-one floor) and the enforcing_success_rate roll. Hosts already ejected (e.g., by a previous algorithm in this sweep) are skipped, and ejections feed into ejected_count so the subsequent failure-percentage pass respects the cap. Success-rate runs first because A50 lists it first and Envoy's implementation runs the algorithms in the same order; the two algorithms are otherwise independent. Seven new unit tests cover the outlier-below-threshold happy path, uniform-population no-eject (stdev = 0), minimum_hosts gating, request_volume filtering, enforcement = 0 no-op, stdev_factor = 0 collapsing to the mean, max_ejection_percent interaction, and combined success-rate + failure-percentage composition.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
#2619 landed the failure-percentage algorithm and the shared sweep plumbing. This PR fills in the second gRFC A50 ejection algorithm: success-rate (cross-host mean/stdev).
Solution
In the housekeeping sweep, the success-rate algorithm now runs before the failure-percentage algorithm (A50 lists them in that order, and Envoy's implementation follows the same order; the two algorithms are otherwise independent and gated separately).
Tests
Seven new unit tests covering:
success_rate_ejects_outlier_below_threshold— 4× 100% + 1× 0%, mean=80, stdev=40, threshold@1900 = 4 → outlier ejected.success_rate_uniform_population_does_not_eject— stdev=0 → threshold=mean, nothing strictly below.success_rate_minimum_hosts_gates_ejection— qualifying<minimum_hosts → algorithm skipped.success_rate_request_volume_filters_low_traffic— low-volume outlier excluded from population and candidates.success_rate_enforcement_zero_never_ejects—enforcing_success_rate = 0short-circuits the roll.success_rate_stdev_factor_zero_ejects_below_mean— factor=0 collapses threshold to the mean.success_rate_max_ejection_percent_caps_concurrent_ejections— cap (with floor) holds.success_rate_and_failure_percentage_compose— both algorithms configured: success-rate ejects, failure-percentage skips already-ejected host (no double-count).All 33 tests in
client::loadbalance::outlier_detectionpass, including the 26 from #2619.