fix(spanner): add endpoint overload cooldown for location-aware routing by rahul2393 · Pull Request #14434 · googleapis/google-cloud-go

rahul2393 · 2026-04-14T17:09:54Z

Summary

Add endpoint-scoped overload cooldown for location-aware routing when a routed replica returns RESOURCE_EXHAUSTED.

Instead of only avoiding the failed endpoint within the current retry flow, new requests now temporarily skip that
endpoint as well. Routing prefers the next eligible replica, and falls back to the default host only when no routed
replica remains available.

The cooldown policy is purely time-based. Successful requests do not clear endpoint overload state, since a success at reduced rate does not indicate the endpoint can sustain its former routed load.

What Changed

Added an address-keyed overload cooldown tracker for routed endpoints.
Applied cooldown-based endpoint exclusion during route selection alongside the existing logical-request exclusion.
Marked routed endpoints as cooling down on RESOURCE_EXHAUSTED.
Updated cooldown policy to:
- initial cooldown: 5s
- exponential backoff
- max cooldown: 60s
- reset window with no RESOURCE_EXHAUSTED: 10m
- full jitter on every cooldown interval, including the first failure
Kept the existing same-call unary retry behavior:
- first retry avoids the just-failed routed endpoint
- request-scoped exclusion behavior remains unchanged
Added focused full-stack tests for:
- rerouting to the next replica while a routed endpoint is in cooldown
- fallback to the default host when all routed replicas are cooling down
- endpoint re-admission after cooldown expiry
- streaming cooldown behavior
Added direct tracker tests for:
- full-jitter cooldown computation
- time-based reset of overload penalty after a quiet window
- ensuring expired cooldown does not immediately erase failure state

Behavior

When a routed endpoint returns RESOURCE_EXHAUSTED:

the current request still retries away from that endpoint when applicable
future requests that would have selected that endpoint skip it while cooldown is active
routing tries the next eligible routed replica first
if no routed replica is eligible, the request goes to the default host
after cooldown expires, the endpoint becomes eligible for selection again
successful requests do not clear overload state
the endpoint's overload penalty only resets after a longer quiet period with no RESOURCE_EXHAUSTED

gemini-code-assist

Code Review

This pull request introduces an endpoint overload cooldown mechanism to the Spanner client to handle RESOURCE_EXHAUSTED errors by temporarily excluding affected replicas from routing. It adds an endpointOverloadCooldownTracker that manages exponential backoff with jitter and integrates it into the locationAwareSpannerClient. Feedback was provided regarding a logic error in the backoff calculation where jitter is bypassed once the maximum cooldown duration is reached.

olavloite · 2026-04-15T15:51:07Z

+	defaultEndpointOverloadResetAfter      = 10 * time.Minute
+)
+
+type endpointOverloadCooldownState struct {


The fact that this struct is kept separate from the actual endpoint, means that it can survive endpoint cache evictions. That again means that it can survive for example a server restart. I think that in an ideal world we would want the cooldown state to have been cleared in such a case. Given the relatively short max overload cooldown window, this is probably not an issue that will cause much problems in reality (unless we later decide to increase the max cooldown window).

server restart will allocate a new IP to the server hence though the concern is valid the new IP will be refreshed in client via stale_ip->transient_failure->skipped_tablet_uid-> new cache_update

olavloite · 2026-04-15T15:56:23Z

+
+// endpointOverloadCooldownTracker keeps routed endpoints out of selection for a
+// short period after RESOURCE_EXHAUSTED so the router can try another replica.
+type endpointOverloadCooldownTracker struct {


I am not 100% convinced that we need both a cooldown tracker and a separate set of logic for endpoint exclusion. Now we have:

Endpoint exclusion: This is bound to a specific request, and ensures that one specific request is retried on a different endpoint.

Endpoint cooldown: This is global, and ensures that requests are not routed to the endpoint.

Because cooldown uses random jitter, there is no absolute guarantee that if we only used the cooldown feature to direct retries to a different endpoint, it would actually be retried on a different endpoint. If the random cooldown period is so short that it ends before the retry is executed, the retry would still end up on the same endpoint. But that feels more like a theoretical argument, because the endpoint is also directly opened for any other requests.

So bottom line is: Are we really sure that we want both cooldown and exclusion? Or could we replace exclusion with cooldown? This whole feature is already quite complex, and adding yet another layer on top of the layers that we already have feels like a bit too much.

Having both is giving better throughput in benchmark numbers, will add TODO for later to investigate if we can remove one

fix(spanner): add endpoint overload cooldown for location-aware routing

0b001ef

rahul2393 requested review from a team as code owners April 14, 2026 17:09

product-auto-label bot added the api: spanner Issues related to the Spanner API. label Apr 14, 2026

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

Comment thread spanner/endpoint_overload_cooldown.go

rahul2393 added 3 commits April 14, 2026 23:06

update to time based penalty

bc35c63

address comments

08edc5a

fix flaky tests

6789468

olavloite reviewed Apr 15, 2026

View reviewed changes

rahul2393 added 6 commits April 15, 2026 22:42

skip blocking wait on request path

5e0f11b

RE retry fixes

31fb0fc

address comments

27f9248

add exclude cleanup

62d0adb

fix tests

9f5dc01

perf: experiment

1c8ac08

rahul2393 requested a review from olavloite April 16, 2026 12:54

rahul2393 added 3 commits April 16, 2026 19:54

perf: cache optimizations

ede6b14

relax serverDelay constrain for location aware client

07eadd3

retry on another replica without delay

e5b92d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spanner): add endpoint overload cooldown for location-aware routing#14434

fix(spanner): add endpoint overload cooldown for location-aware routing#14434
rahul2393 wants to merge 13 commits intomainfrom
endpoint-cooldown-re

rahul2393 commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olavloite Apr 15, 2026

Uh oh!

rahul2393 Apr 16, 2026

Uh oh!

olavloite Apr 15, 2026

Uh oh!

rahul2393 Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rahul2393 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Behavior

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olavloite Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rahul2393 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

olavloite Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rahul2393 Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahul2393 commented Apr 14, 2026 •

edited

Loading

rahul2393 Apr 16, 2026 •

edited

Loading