Skip to content

🔄 Fix: Implement exponential backoff in edge node [#5]#8

Open
Jerrycyborg wants to merge 1 commit intomainfrom
fix/issue-5-exponential-backoff
Open

🔄 Fix: Implement exponential backoff in edge node [#5]#8
Jerrycyborg wants to merge 1 commit intomainfrom
fix/issue-5-exponential-backoff

Conversation

@Jerrycyborg
Copy link
Owner

Summary

Implements exponential backoff with jitter for edge node error handling (Issue #5), preventing thundering herd problems when multiple nodes encounter errors simultaneously.

Changes

  1. Added constants:

    • MAX_BACKOFF_MS (default: 30s, configurable via EDGEMESH_MAX_BACKOFF_MS)
  2. Exponential backoff state tracking:

    • backoffMs: Current backoff duration (starts at pollMs)
    • consecutiveErrors: Counter for error tracking
  3. Error handling logic:

    • Doubles backoff on each consecutive error (exponential growth)
    • Adds 0-1000ms random jitter to prevent synchronized retries
    • Caps max backoff at MAX_BACKOFF_MS to prevent infinite delays
    • Resets backoff to pollMs on successful task execution
  4. Improved observability:

    • Logs backoff duration: backing off for Xms
    • Tracks consecutive error count in log messages

Before

catch (err) {
  console.error(`[edge-node:${nodeId}] loop error`, err);
  await sleep(pollMs); // Fixed 1.5s delay
}

After

catch (err) {
  consecutiveErrors++;
  const jitter = Math.random() * 1000;
  const sleepTime = Math.min(backoffMs + jitter, MAX_BACKOFF_MS);
  
  console.log(`backing off for ${Math.round(sleepTime)}ms`);
  await sleep(sleepTime);
  
  backoffMs = Math.min(backoffMs * 2, MAX_BACKOFF_MS);
}

Backoff Progression Example

Error # Backoff (without jitter) With jitter range
0 1.5s 1.5-2.5s
1 3s 3-4s
2 6s 6-7s
3 12s 12-13s
4 24s 24-25s
5+ 30s (max) 30-31s

Benefits

  • ✅ Prevents control plane overload during outages
  • ✅ Distributed jitter reduces synchronized thundering herd
  • ✅ Quick recovery when errors resolve (resets to 1.5s)
  • ✅ Bounded worst-case delay (30s cap)
  • ✅ Configurable via environment variable

Testing

  • Existing tests should pass
  • Manual test: Stop control plane and observe edge node logs
  • Should see increasing backoff times with jitter
  • Restart control plane and verify quick recovery

Closes

#5]

- Add MAX_BACKOFF_MS constant (default 30s, configurable via env)
- Track backoff state (backoffMs, consecutiveErrors)
- Double backoff on each consecutive error (exponential)
- Add 0-1000ms jitter to prevent thundering herd
- Reset backoff to pollMs on successful task execution
- Log backoff duration and attempt count for observability

This prevents overwhelming the control plane when nodes encounter
persistent errors, while allowing quick recovery when errors resolve.

Closes #5
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb90e6b62a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +177 to +179
// Reset backoff on successful task execution
backoffMs = pollMs;
consecutiveErrors = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset backoff after any successful poll

The new backoff state is only reset after a task is fully executed, so successful claimTask() polls that return no task do not clear backoffMs/consecutiveErrors. After a temporary outage, if the node spends time idling with empty queues, the next isolated error is still treated like a long error streak and can sleep up to MAX_BACKOFF_MS, which delays recovery and reduces polling responsiveness even though many successful polls occurred in between.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🔴 CRITICAL: Implement exponential backoff in edge node error handling

1 participant