Skip to content

fix(spanner): add endpoint overload cooldown for location-aware routing#14434

Open
rahul2393 wants to merge 13 commits intomainfrom
endpoint-cooldown-re
Open

fix(spanner): add endpoint overload cooldown for location-aware routing#14434
rahul2393 wants to merge 13 commits intomainfrom
endpoint-cooldown-re

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 14, 2026

Summary

Add endpoint-scoped overload cooldown for location-aware routing when a routed replica returns RESOURCE_EXHAUSTED.

Instead of only avoiding the failed endpoint within the current retry flow, new requests now temporarily skip that
endpoint as well. Routing prefers the next eligible replica, and falls back to the default host only when no routed
replica remains available.

The cooldown policy is purely time-based. Successful requests do not clear endpoint overload state, since a success at reduced rate does not indicate the endpoint can sustain its former routed load.

What Changed

  • Added an address-keyed overload cooldown tracker for routed endpoints.
  • Applied cooldown-based endpoint exclusion during route selection alongside the existing logical-request exclusion.
  • Marked routed endpoints as cooling down on RESOURCE_EXHAUSTED.
  • Updated cooldown policy to:
    • initial cooldown: 5s
    • exponential backoff
    • max cooldown: 60s
    • reset window with no RESOURCE_EXHAUSTED: 10m
    • full jitter on every cooldown interval, including the first failure
  • Kept the existing same-call unary retry behavior:
    • first retry avoids the just-failed routed endpoint
    • request-scoped exclusion behavior remains unchanged
  • Added focused full-stack tests for:
    • rerouting to the next replica while a routed endpoint is in cooldown
    • fallback to the default host when all routed replicas are cooling down
    • endpoint re-admission after cooldown expiry
    • streaming cooldown behavior
  • Added direct tracker tests for:
    • full-jitter cooldown computation
    • time-based reset of overload penalty after a quiet window
    • ensuring expired cooldown does not immediately erase failure state

Behavior

When a routed endpoint returns RESOURCE_EXHAUSTED:

  • the current request still retries away from that endpoint when applicable
  • future requests that would have selected that endpoint skip it while cooldown is active
  • routing tries the next eligible routed replica first
  • if no routed replica is eligible, the request goes to the default host
  • after cooldown expires, the endpoint becomes eligible for selection again
  • successful requests do not clear overload state
  • the endpoint's overload penalty only resets after a longer quiet period with no RESOURCE_EXHAUSTED

@rahul2393 rahul2393 requested review from a team as code owners April 14, 2026 17:09
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an endpoint overload cooldown mechanism to the Spanner client to handle RESOURCE_EXHAUSTED errors by temporarily excluding affected replicas from routing. It adds an endpointOverloadCooldownTracker that manages exponential backoff with jitter and integrates it into the locationAwareSpannerClient. Feedback was provided regarding a logic error in the backoff calculation where jitter is bypassed once the maximum cooldown duration is reached.

Comment thread spanner/endpoint_overload_cooldown.go
Comment thread spanner/endpoint_overload_cooldown.go Outdated
Comment thread spanner/endpoint_overload_cooldown.go
Comment thread spanner/location_aware_client.go Outdated
defaultEndpointOverloadResetAfter = 10 * time.Minute
)

type endpointOverloadCooldownState struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this struct is kept separate from the actual endpoint, means that it can survive endpoint cache evictions. That again means that it can survive for example a server restart. I think that in an ideal world we would want the cooldown state to have been cleared in such a case. Given the relatively short max overload cooldown window, this is probably not an issue that will cause much problems in reality (unless we later decide to increase the max cooldown window).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server restart will allocate a new IP to the server hence though the concern is valid the new IP will be refreshed in client via stale_ip->transient_failure->skipped_tablet_uid-> new cache_update


// endpointOverloadCooldownTracker keeps routed endpoints out of selection for a
// short period after RESOURCE_EXHAUSTED so the router can try another replica.
type endpointOverloadCooldownTracker struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% convinced that we need both a cooldown tracker and a separate set of logic for endpoint exclusion. Now we have:

  1. Endpoint exclusion: This is bound to a specific request, and ensures that one specific request is retried on a different endpoint.
  2. Endpoint cooldown: This is global, and ensures that requests are not routed to the endpoint.

Because cooldown uses random jitter, there is no absolute guarantee that if we only used the cooldown feature to direct retries to a different endpoint, it would actually be retried on a different endpoint. If the random cooldown period is so short that it ends before the retry is executed, the retry would still end up on the same endpoint. But that feels more like a theoretical argument, because the endpoint is also directly opened for any other requests.

So bottom line is: Are we really sure that we want both cooldown and exclusion? Or could we replace exclusion with cooldown? This whole feature is already quite complex, and adding yet another layer on top of the layers that we already have feels like a bit too much.

Copy link
Copy Markdown
Contributor Author

@rahul2393 rahul2393 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both is giving better throughput in benchmark numbers, will add TODO for later to investigate if we can remove one

Comment thread spanner/endpoint_overload_cooldown.go Outdated
Comment thread spanner/endpoint_overload_cooldown.go Outdated
Comment thread spanner/endpoint_overload_cooldown.go Outdated
@rahul2393 rahul2393 requested a review from olavloite April 16, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants