forked from keksiqc/ctrld-sync
-
Notifications
You must be signed in to change notification settings - Fork 1
Daily Perf Improver - Add jitter to retry backoff for improved API reliability #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
abhimehro
merged 9 commits into
main
from
perf/add-jitter-to-retry-backoff-336fbb90ca0980f2
Feb 19, 2026
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
9086d7f
perf: Add jitter to exponential backoff retry logic
github-actions[bot] 56ed181
Update benchmark_retry_jitter.py
abhimehro 4576580
Update tests/test_retry_jitter.py
abhimehro 2d33756
Update .github/copilot/instructions/api-retry-strategy.md
abhimehro ae3fce3
Update tests/test_retry_jitter.py
abhimehro 77f2c9b
Update .github/copilot/instructions/api-retry-strategy.md
abhimehro deae490
Update benchmark_retry_jitter.py
abhimehro 1fc87f0
Merge branch 'main' into perf/add-jitter-to-retry-backoff-336fbb90ca0…
abhimehro 32d7718
Merge branch 'main' into perf/add-jitter-to-retry-backoff-336fbb90ca0…
abhimehro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # API Retry Strategy Guide | ||
|
|
||
| ## Performance Context | ||
|
|
||
| Control D API has strict rate limits. The sync tool retries failed requests with exponential backoff to handle transient failures (network issues, temporary server errors) while respecting API constraints. | ||
|
|
||
| ## Current Implementation | ||
|
|
||
| **Location:** `main.py::_retry_request()` (line ~845) | ||
|
|
||
| **Key characteristics:** | ||
| - Max retries: 10 attempts (configurable via `MAX_RETRIES`) | ||
| - Base delay: 1 second (configurable via `RETRY_DELAY`) | ||
| - Exponential backoff: `delay * (2^attempt)` → 1s, 2s, 4s, 8s, 16s, ... | ||
| - Smart error handling: Don't retry 4xx errors except 429 (rate limit) | ||
| - Security-aware: Sanitizes error messages in logs | ||
|
|
||
| ## Jitter Pattern (Recommended) | ||
|
|
||
| **Why jitter matters:** | ||
| When multiple requests fail simultaneously (e.g., API outage), synchronized retries create "thundering herd" - all clients retry at exact same time, overwhelming the recovering server. Jitter randomizes retry timing to spread load. | ||
|
|
||
| **Implementation formula:** | ||
| ```python | ||
| import random | ||
| wait_time = (delay * (2 ** attempt)) * (0.5 + random.random()) | ||
| ``` | ||
|
|
||
| This adds ±50% randomness: a 4s backoff becomes 2-6s range. | ||
|
|
||
| **Maintainer rationale (from discussion #219):** | ||
| > "API rate limits are non-negotiable. Serial processing exists because I got burned by 429s and zombie states in production. Any retry improvement needs rock-solid rate limit awareness." | ||
|
|
||
| ## Testing Approach | ||
|
|
||
| **Unit tests:** | ||
| - Verify jitter stays within bounds (0.5x to 1.5x base delay) | ||
| - Confirm 4xx errors (except 429) still don't retry | ||
| - Check max retries still respected | ||
|
|
||
| **Integration tests:** | ||
| - Simulate transient failures (mock server returning 500s) | ||
| - Measure retry timing distribution (should show variance) | ||
| - Confirm eventual success after transient errors | ||
|
|
||
| **Performance validation:** | ||
| No performance degradation expected - jitter adds microseconds of `random()` overhead per retry, negligible compared to network I/O. | ||
|
|
||
| ## Common Pitfalls | ||
|
|
||
| 1. **Don't add jitter to initial request** - only to retries. First attempt should be immediate. | ||
| 2. **Don't exceed max backoff** - cap total wait time to prevent indefinite delays. | ||
| 3. **Be cautious with 429 responses** - current behavior still uses jittered exponential backoff; once `Retry-After` handling is implemented, jitter should be disabled in favor of the header-specified delay. | ||
| 4. **Don't break existing behavior** - ensure 4xx non-retryable errors still fail fast. | ||
|
|
||
| ## Future Improvements | ||
|
|
||
| - **Rate limit header parsing:** Implement proper `Retry-After` handling for 429 responses (and bypass jitter when a valid `Retry-After` value is present) | ||
| - **Circuit breaker:** Stop retrying after consecutive failures to prevent cascading failures | ||
| - **Per-endpoint tracking:** Different backoff strategies for read vs. write operations |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| Synthetic benchmark to demonstrate retry jitter behavior. | ||
|
|
||
| Run this script to see how jitter randomizes retry delays compared to | ||
| deterministic exponential backoff. | ||
|
|
||
| Usage: python3 benchmark_retry_jitter.py | ||
| """ | ||
|
|
||
| import random | ||
| from typing import List | ||
|
|
||
| def simulate_retries_without_jitter(max_retries: int, base_delay: float) -> List[float]: | ||
| """Simulate retry delays WITHOUT jitter (old behavior).""" | ||
| delays = [] | ||
| for attempt in range(max_retries - 1): | ||
| wait_time = base_delay * (2 ** attempt) | ||
| delays.append(wait_time) | ||
| return delays | ||
|
|
||
| def simulate_retries_with_jitter(max_retries: int, base_delay: float) -> List[float]: | ||
| """Simulate retry delays WITH jitter (new behavior).""" | ||
| delays = [] | ||
| for attempt in range(max_retries - 1): | ||
| base_wait = base_delay * (2 ** attempt) | ||
| jitter_factor = 0.5 + random.random() # [0.5, 1.5] | ||
| wait_time = base_wait * jitter_factor | ||
| delays.append(wait_time) | ||
| return delays | ||
|
|
||
| def main(): | ||
| print("=" * 60) | ||
| print("Retry Jitter Performance Demonstration") | ||
| print("=" * 60) | ||
| print() | ||
|
|
||
| max_retries = 5 | ||
| base_delay = 1.0 | ||
|
|
||
| print(f"Configuration: max_retries={max_retries}, base_delay={base_delay}s") | ||
| print() | ||
|
|
||
| # Without jitter (deterministic) | ||
| print("WITHOUT JITTER (old behavior):") | ||
| print("All clients retry at exactly the same time (thundering herd)") | ||
| print() | ||
| without_jitter = simulate_retries_without_jitter(max_retries, base_delay) | ||
| for i, delay in enumerate(without_jitter): | ||
| print(f" Attempt {i+1}: {delay:6.2f}s") | ||
| print(f" Total: {sum(without_jitter):6.2f}s") | ||
| print() | ||
|
|
||
| # With jitter (randomized) | ||
| print("WITH JITTER (new behavior):") | ||
| print("Retries spread across time window, reducing server load spikes") | ||
| print() | ||
|
|
||
| # Run 3 simulations to show variance | ||
| for run in range(3): | ||
| print(f" Run {run+1}:") | ||
| with_jitter = simulate_retries_with_jitter(max_retries, base_delay) | ||
| for i, delay in enumerate(with_jitter): | ||
| base = base_delay * (2 ** i) | ||
| print(f" Attempt {i+1}: {delay:6.2f}s (base: {base:4.1f}s, range: [{base*0.5:.1f}s, {base*1.5:.1f}s])") | ||
| print(f" Total: {sum(with_jitter):6.2f}s") | ||
| print() | ||
|
|
||
| # Statistical analysis | ||
| print("IMPACT ANALYSIS:") | ||
| print() | ||
|
|
||
| # Simulate thundering herd scenario | ||
| num_clients = 100 | ||
| print(f"Scenario: {num_clients} clients all fail at the same time") | ||
| print() | ||
|
|
||
| print("WITHOUT JITTER:") | ||
| print(f" At t=1s: ALL {num_clients} clients retry simultaneously → server overload") | ||
| print(f" At t=2s: ALL {num_clients} clients retry simultaneously → server overload") | ||
| print(f" At t=4s: ALL {num_clients} clients retry simultaneously → server overload") | ||
| print() | ||
|
|
||
| print("WITH JITTER:") | ||
| # Simulate retry distribution | ||
| retry_times = [] | ||
| for _ in range(num_clients): | ||
| first_retry = (base_delay * (0.5 + random.random())) | ||
Check noticeCode scanning / Bandit Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note
Standard pseudo-random generators are not suitable for security/cryptographic purposes.
|
||
| retry_times.append(first_retry) | ||
|
|
||
| retry_times.sort() | ||
| min_time = min(retry_times) | ||
| max_time = max(retry_times) | ||
| avg_time = sum(retry_times) / len(retry_times) | ||
|
|
||
| print(f" First retry window: {min_time:.2f}s to {max_time:.2f}s (spread: {max_time - min_time:.2f}s)") | ||
| print(f" Average first retry: {avg_time:.2f}s") | ||
| print(f" Retries distributed over time → reduced peak load on server") | ||
| print() | ||
|
|
||
| # Calculate approximate load reduction based on bucketed concurrency | ||
| print("THEORETICAL LOAD REDUCTION:") | ||
| window_size = max_time - min_time | ||
| if window_size > 0: | ||
| # Use small time buckets (e.g., 100ms) to approximate peak concurrent retries | ||
| bucket_size = 0.1 # seconds | ||
| num_buckets = max(1, int(window_size / bucket_size) + 1) | ||
| buckets = [0] * num_buckets | ||
|
|
||
| # Count how many retries fall into each time bucket | ||
| for t in retry_times: | ||
| # Normalize to start of window and compute bucket index | ||
| idx = int((t - min_time) / bucket_size) | ||
| if idx >= num_buckets: | ||
| # Clamp to last bucket to handle any floating-point edge cases | ||
| idx = num_buckets - 1 | ||
| buckets[idx] += 1 | ||
|
|
||
| peak_with_jitter = max(buckets) | ||
| peak_without_jitter = num_clients # all clients retry together without jitter | ||
| reduction = (1 - (peak_with_jitter / peak_without_jitter)) * 100 | ||
|
|
||
| print(f" Approximate peak concurrent retries with jitter: {peak_with_jitter} (per {bucket_size:.1f}s)") | ||
| print(f" Peak concurrent retries reduced by approximately {reduction:.0f}%") | ||
| else: | ||
| # In the extremely unlikely case that all retries occur at the same instant | ||
| print(" All retries occurred at the same time; no observable spreading in this run.") | ||
| print() | ||
|
|
||
| print("✅ Jitter prevents thundering herd and improves system reliability") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check notice
Code scanning / Bandit
Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note