Skip to content

Handle task queue dispatch rate limiting in WCI autoscaling#64

Open
smuneebahmad wants to merge 4 commits into
mainfrom
muneeb/tq-dispatch-rl
Open

Handle task queue dispatch rate limiting in WCI autoscaling#64
smuneebahmad wants to merge 4 commits into
mainfrom
muneeb/tq-dispatch-rl

Conversation

@smuneebahmad

@smuneebahmad smuneebahmad commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What was changed

Extend WCI to correctly handle task queue dispatch rate limiting for Lambda compute providers. Rate-limited events based on task queue dispatch rate, were previously incorrectly treated as "no worker available" and triggered worker invocations.
This change classifies rate-limited events separately using the new SyncMatchOutcome signal from PR #10045, preventing spurious Lambda invocations when rate limiting is the bottleneck.

The rate_based algorithm (GCP Cloud Run) will be tracked as a follow-up change.

Note: The current changes validate rate limiting during task submission (sync match path), and do not currently handle rate limiting during backlog drain (async dispatch path)

Why?

Adding more workers when the rate limiting is the bottleneck does nothing. Before this change, WCI couldn't distinguish "no worker available" from "worker available but rate-limited", so it scaled up in both cases unnecessarily.

Checklist

  1. Closes: https://temporalio.atlassian.net/browse/COM-125

  2. How was this tested:

  • Unit tests cover signal classification, scale-up suppression, and metrics poll TTL-based suppression
    go test ./wci/client/... ./wci/workflow/scaling_algorithm/... -v -count=1 -timeout=60s
    
  • E2E validated on a test cell with a 2 RPS dispatch cap — confirmed no spurious invocations and backlog-poll suppression firing as expected

@smuneebahmad smuneebahmad marked this pull request as ready for review June 25, 2026 21:46
@smuneebahmad smuneebahmad requested a review from a team as a code owner June 25, 2026 21:46
@smuneebahmad smuneebahmad requested a review from 02strich June 25, 2026 22:07
Comment thread wci/client/hook.go
TaskQueueName: th.taskQueueName,
TaskQueueType: th.taskQueueType,
IsSyncMatch: event.IsSyncMatch,
IsSyncMatch: noSyncMatchBatchCount == 0, // backward compatibility: true when no genuine not-matched events so old WCI workers don't scale up

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really the same? I am a bit confused why the change is needed as part of the PR

Comment thread wci/client/hook.go Outdated
Comment thread wci/workflow/iface/workflow.go Outdated
Comment on lines +48 to +54
// configNoSyncRateLimitedSuppressQuietMsKey: in ProcessMetricsPoll, suppress scale-up for this many ms
// after the last observed rate-limited signal. 0 = disabled.
// Default is 2× the poll interval.
configNoSyncRateLimitedSuppressQuietMsKey = "rate_limited_suppress_quiet_ms"
configNoSyncRateLimitedSuppressQuietMsDefault = 2 * configNoSyncMetricsPollIntervalMsDefault

stateLastRateLimitedTimestampKey = "last_rate_limited_time_ms"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this needed for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more explanation around how these values are used to adjust ProcessMetricsPoll to handle TQ dispatch rate limiting.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation helps, but now I am unclear whether you are leveraging an implementation detail here. Did you confirm that these rate limit events will be created reliably?

Comment thread wci/workflow/scaling_algorithm/no_sync_match.go
Comment thread wci/workflow/scaling_algorithm/no_sync_match.go Outdated
@smuneebahmad smuneebahmad force-pushed the muneeb/tq-dispatch-rl branch from 67a83bb to 231c21d Compare June 26, 2026 00:08
@smuneebahmad smuneebahmad force-pushed the muneeb/tq-dispatch-rl branch from 231c21d to 4859762 Compare June 26, 2026 00:11
@smuneebahmad smuneebahmad force-pushed the muneeb/tq-dispatch-rl branch from 58f5b1e to 27e03e5 Compare June 26, 2026 00:50
Comment on lines +157 to +158
// Rate-limited events (worker is available but there's no task handoff due to rate-limiting) are a
// distinct outcome tracked separately in RateLimitedSignalsSinceLast.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert the sentence structure and move it down to be the comment above RateLimitedSignalsSinceLast?

updatedState[stateRateLimitedCountSinceLastPollKey] = prevCount + int64(event.RateLimitedSignalsSinceLast)
}

return &TaskAddResponse{Actions: actions, Status: updatedState, ThrottledCount: throttledCount, RateLimitedCount: rateLimitedCount}, nil

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we return the input data RateLimitedCount here? do you expect the algo to change the number?

Comment on lines +48 to +54
// configNoSyncRateLimitedSuppressQuietMsKey: in ProcessMetricsPoll, suppress scale-up for this many ms
// after the last observed rate-limited signal. 0 = disabled.
// Default is 2× the poll interval.
configNoSyncRateLimitedSuppressQuietMsKey = "rate_limited_suppress_quiet_ms"
configNoSyncRateLimitedSuppressQuietMsDefault = 2 * configNoSyncMetricsPollIntervalMsDefault

stateLastRateLimitedTimestampKey = "last_rate_limited_time_ms"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation helps, but now I am unclear whether you are leveraging an implementation detail here. Did you confirm that these rate limit events will be created reliably?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants