Skip to content

[Bounty $10k] fix(scheduler): add exponential backoff with jitter to queue retries#3931

Open
Karry2019web wants to merge 1 commit into
orchestration-agent:mainfrom
Karry2019web:bounty-3929-queue-retry-jitter
Open

[Bounty $10k] fix(scheduler): add exponential backoff with jitter to queue retries#3931
Karry2019web wants to merge 1 commit into
orchestration-agent:mainfrom
Karry2019web:bounty-3929-queue-retry-jitter

Conversation

@Karry2019web
Copy link
Copy Markdown

Summary

Fixes #3929

Adds exponential backoff with jitter to TaskScheduler.fail() so that transient queue failures are retried with increasing delays instead of immediately. This prevents thundering-herd scenarios where multiple retrying tasks overwhelm the queue simultaneously.

Changes

  • _compute_retry_delay(retry_count): New method implementing min(base * 2^retry, max_delay) * (1 + uniform(-jitter, jitter))
  • fail(): Now calls schedule(delay=...) instead of enqueue() for retries, inserting a jittered delay
  • Constructor params: base_retry_delay (1.0s), max_retry_delay (120.0s), jitter_factor (0.5) — all configurable
  • _scheduled dict: Stores retry wakeup times; dequeue() picks them up when expired

Testing

  • test_fail_respects_max_retries: Verifies retries stop after max retries exhausted
  • test_retry_delay_uses_schedule_not_enqueue: Confirms failed tasks go to _scheduled with future timestamps
  • test_retry_delay_grows_exponentially: Measures delay increases between successive retries

Fixes orchestration-agent#3929

When a task fails in the queue runtime, the TaskScheduler now delays
retries with exponential backoff plus random jitter instead of
re-enqueuing immediately. This prevents thundering-herd problems
where multiple workers retry in lockstep and overwhelm downstream
services.

Changes:
- Add _compute_retry_delay() with exponential backoff + jitter
- fail() now schedules retries via schedule() with a computed delay
- Add base_retry_delay, max_retry_delay, jitter_factor config params
- Add tests for delay scheduling, exponential growth, max retries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ Bounty $10k ] [ Runtime ] Retry transient queue failures with jitter — queue runtime

1 participant