Skip to content

backend(webhookService): retry processor has no circuit breaker, a degraded subscriber endpoint blocks all retries #655

@ogazboiz

Description

@ogazboiz

Join our community: https://t.me/+DOylgFv1jyJlNzM0

Description

The webhook retry processor in backend/src/services/webhookService.ts fetches up to 100 pending retries and processes them. There is no circuit breaker pattern. If one subscriber endpoint is slow (e.g., 30s response time), the retry processor serializes on that endpoint, and all other pending retries for other subscribers are delayed.

Even after the RPC timeout added in #617 for Stellar calls, webhook delivery to subscriber URLs does not have its own response timeout enforcement.

Expected Behavior

  1. Add a per-delivery timeout (e.g., 10 seconds) for each webhook HTTP call
  2. Track consecutive failures per subscriber URL and implement circuit breaking: after N failures, stop retrying that URL for a cooldown period
  3. Process retries concurrently across different subscriber URLs (group by URL and parallelize groups)

Suggested Fix

const MAX_DELIVERY_TIMEOUT_MS = 10_000;
const response = await Promise.race([
  fetch(url, { method: 'POST', body: payload }),
  new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), MAX_DELIVERY_TIMEOUT_MS))
]);

Impact

Medium. A single slow webhook subscriber can back up the entire retry queue, delaying notifications for all other subscribers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions