Handle task queue dispatch rate limiting in WCI autoscaling#64
Handle task queue dispatch rate limiting in WCI autoscaling#64smuneebahmad wants to merge 4 commits into
Conversation
| TaskQueueName: th.taskQueueName, | ||
| TaskQueueType: th.taskQueueType, | ||
| IsSyncMatch: event.IsSyncMatch, | ||
| IsSyncMatch: noSyncMatchBatchCount == 0, // backward compatibility: true when no genuine not-matched events so old WCI workers don't scale up |
There was a problem hiding this comment.
is this really the same? I am a bit confused why the change is needed as part of the PR
| // configNoSyncRateLimitedSuppressQuietMsKey: in ProcessMetricsPoll, suppress scale-up for this many ms | ||
| // after the last observed rate-limited signal. 0 = disabled. | ||
| // Default is 2× the poll interval. | ||
| configNoSyncRateLimitedSuppressQuietMsKey = "rate_limited_suppress_quiet_ms" | ||
| configNoSyncRateLimitedSuppressQuietMsDefault = 2 * configNoSyncMetricsPollIntervalMsDefault | ||
|
|
||
| stateLastRateLimitedTimestampKey = "last_rate_limited_time_ms" |
There was a problem hiding this comment.
what is this needed for?
There was a problem hiding this comment.
Added more explanation around how these values are used to adjust ProcessMetricsPoll to handle TQ dispatch rate limiting.
There was a problem hiding this comment.
The explanation helps, but now I am unclear whether you are leveraging an implementation detail here. Did you confirm that these rate limit events will be created reliably?
67a83bb to
231c21d
Compare
231c21d to
4859762
Compare
58f5b1e to
27e03e5
Compare
| // Rate-limited events (worker is available but there's no task handoff due to rate-limiting) are a | ||
| // distinct outcome tracked separately in RateLimitedSignalsSinceLast. |
There was a problem hiding this comment.
invert the sentence structure and move it down to be the comment above RateLimitedSignalsSinceLast?
| updatedState[stateRateLimitedCountSinceLastPollKey] = prevCount + int64(event.RateLimitedSignalsSinceLast) | ||
| } | ||
|
|
||
| return &TaskAddResponse{Actions: actions, Status: updatedState, ThrottledCount: throttledCount, RateLimitedCount: rateLimitedCount}, nil |
There was a problem hiding this comment.
why do we return the input data RateLimitedCount here? do you expect the algo to change the number?
| // configNoSyncRateLimitedSuppressQuietMsKey: in ProcessMetricsPoll, suppress scale-up for this many ms | ||
| // after the last observed rate-limited signal. 0 = disabled. | ||
| // Default is 2× the poll interval. | ||
| configNoSyncRateLimitedSuppressQuietMsKey = "rate_limited_suppress_quiet_ms" | ||
| configNoSyncRateLimitedSuppressQuietMsDefault = 2 * configNoSyncMetricsPollIntervalMsDefault | ||
|
|
||
| stateLastRateLimitedTimestampKey = "last_rate_limited_time_ms" |
There was a problem hiding this comment.
The explanation helps, but now I am unclear whether you are leveraging an implementation detail here. Did you confirm that these rate limit events will be created reliably?
What was changed
Extend WCI to correctly handle task queue dispatch rate limiting for Lambda compute providers. Rate-limited events based on task queue dispatch rate, were previously incorrectly treated as "no worker available" and triggered worker invocations.
This change classifies rate-limited events separately using the new
SyncMatchOutcomesignal from PR #10045, preventing spurious Lambda invocations when rate limiting is the bottleneck.The rate_based algorithm (GCP Cloud Run) will be tracked as a follow-up change.
Note: The current changes validate rate limiting during task submission (sync match path), and do not currently handle rate limiting during backlog drain (async dispatch path)
Why?
Adding more workers when the rate limiting is the bottleneck does nothing. Before this change, WCI couldn't distinguish "no worker available" from "worker available but rate-limited", so it scaled up in both cases unnecessarily.
Checklist
Closes: https://temporalio.atlassian.net/browse/COM-125
How was this tested: