Skip to content

feat: re-register repeatable jobs on Redis reconnect#37

Merged
genisd merged 1 commit into
gynzyfrom
feat/re-register-crons-on-redis-restart
Jun 29, 2026
Merged

feat: re-register repeatable jobs on Redis reconnect#37
genisd merged 1 commit into
gynzyfrom
feat/re-register-crons-on-redis-restart

Conversation

@genisd

@genisd genisd commented Jun 25, 2026

Copy link
Copy Markdown
Member

Problem

Repeatable BullMQ jobs — the poll ticks for pr-watcher, repo-cleanup (health-check + stall-check), ticket-sync, workflow-trigger-checker, token-validation, and reconcile-resync — were registered only at API boot (inside each startXxxWorker()).

A Redis restart/flush (e.g. a Redis pod OOMKill) wipes the repeat scheduler keys. Nothing re-adds them, so the pollers silently stop until the API itself is restarted. Notably this halts the schedule-trigger poll loop, so timed Tasks stop firing.

(The user-defined schedules themselves live in Postgres workflow_triggers.nextFireAt and are unaffected — only the BullMQ tick that reads them is lost.)

Change

  • services/repeatable-jobs.tsREPEATABLE_JOBS, a single source of truth for every repeat job (same OPTIO_* env vars + defaults as before), and ensureRepeatJobs(): idempotent queue.add(..., {repeat}) per entry, in-flight guard, never throws.
  • services/repeat-job-monitor.ts — a dedicated ioredis connection. The first ready (boot) is ignored; every later ready (reconnect) debounces 2s then calls ensureRepeatJobs(). Wired into index.ts startup and graceful shutdown.
  • Workers — dropped the inline queue.add(...repeat...) registrations so boot and reconnect share the same definitions (no drift). index.ts now does cleanRepeatJobsensureRepeatJobs() → start workers → start monitor.

Scope note

Recovery triggers on reconnect only (the actual OOMKill/restart failure mode — the TCP socket drops and ioredis re-fires ready). A logical FLUSHALL/eviction that leaves the socket up emits no ready and is intentionally out of scope.

Tests

  • repeatable-jobs.test.ts — registers exactly the REPEATABLE_JOBS entries with correct repeat options, closes each queue, in-flight guard shares one pass across concurrent calls.
  • repeat-job-monitor.test.ts — first ready ignored, reconnect re-registers after debounce, burst of reconnects debounced into one, stop quits the connection.

Verification

  • pnpm turbo typecheck
  • pnpm turbo test ✓ (2022 passing, incl. 7 new)
  • prettier --check ✓ / eslint 0 errors

Repeatable BullMQ jobs (poll ticks for pr-watcher, repo-cleanup,
ticket-sync, workflow-trigger-checker, token-validation,
reconcile-resync) were registered only at API boot. A Redis
restart/flush wipes the repeat schedulers, so the pollers silently
stop until the API is restarted.

Add a single source of truth (REPEATABLE_JOBS) plus idempotent
ensureRepeatJobs(), and a dedicated monitor connection that
re-registers them on every reconnect (debounced). Workers no longer
self-register inline, so boot and reconnect share the same definitions.
@genisd genisd merged commit e50b34b into gynzy Jun 29, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants