fix(spanner): add endpoint overload cooldown for location-aware routing#14434
fix(spanner): add endpoint overload cooldown for location-aware routing#14434
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an endpoint overload cooldown mechanism to the Spanner client to handle RESOURCE_EXHAUSTED errors by temporarily excluding affected replicas from routing. It adds an endpointOverloadCooldownTracker that manages exponential backoff with jitter and integrates it into the locationAwareSpannerClient. Feedback was provided regarding a logic error in the backoff calculation where jitter is bypassed once the maximum cooldown duration is reached.
| defaultEndpointOverloadResetAfter = 10 * time.Minute | ||
| ) | ||
|
|
||
| type endpointOverloadCooldownState struct { |
There was a problem hiding this comment.
The fact that this struct is kept separate from the actual endpoint, means that it can survive endpoint cache evictions. That again means that it can survive for example a server restart. I think that in an ideal world we would want the cooldown state to have been cleared in such a case. Given the relatively short max overload cooldown window, this is probably not an issue that will cause much problems in reality (unless we later decide to increase the max cooldown window).
There was a problem hiding this comment.
server restart will allocate a new IP to the server hence though the concern is valid the new IP will be refreshed in client via stale_ip->transient_failure->skipped_tablet_uid-> new cache_update
|
|
||
| // endpointOverloadCooldownTracker keeps routed endpoints out of selection for a | ||
| // short period after RESOURCE_EXHAUSTED so the router can try another replica. | ||
| type endpointOverloadCooldownTracker struct { |
There was a problem hiding this comment.
I am not 100% convinced that we need both a cooldown tracker and a separate set of logic for endpoint exclusion. Now we have:
- Endpoint exclusion: This is bound to a specific request, and ensures that one specific request is retried on a different endpoint.
- Endpoint cooldown: This is global, and ensures that requests are not routed to the endpoint.
Because cooldown uses random jitter, there is no absolute guarantee that if we only used the cooldown feature to direct retries to a different endpoint, it would actually be retried on a different endpoint. If the random cooldown period is so short that it ends before the retry is executed, the retry would still end up on the same endpoint. But that feels more like a theoretical argument, because the endpoint is also directly opened for any other requests.
So bottom line is: Are we really sure that we want both cooldown and exclusion? Or could we replace exclusion with cooldown? This whole feature is already quite complex, and adding yet another layer on top of the layers that we already have feels like a bit too much.
There was a problem hiding this comment.
Having both is giving better throughput in benchmark numbers, will add TODO for later to investigate if we can remove one
Summary
Add endpoint-scoped overload cooldown for location-aware routing when a routed replica returns
RESOURCE_EXHAUSTED.Instead of only avoiding the failed endpoint within the current retry flow, new requests now temporarily skip that
endpoint as well. Routing prefers the next eligible replica, and falls back to the default host only when no routed
replica remains available.
The cooldown policy is purely time-based. Successful requests do not clear endpoint overload state, since a success at reduced rate does not indicate the endpoint can sustain its former routed load.
What Changed
RESOURCE_EXHAUSTED.5s60sRESOURCE_EXHAUSTED:10mBehavior
When a routed endpoint returns
RESOURCE_EXHAUSTED:RESOURCE_EXHAUSTED