Skip to content

feat(executions): add System Error status for SQS and infrastructure failures#1599

Draft
joelorzet wants to merge 1 commit into
stagingfrom
feat/keep-853-mark-sqs-failures-system-error
Draft

feat(executions): add System Error status for SQS and infrastructure failures#1599
joelorzet wants to merge 1 commit into
stagingfrom
feat/keep-853-mark-sqs-failures-system-error

Conversation

@joelorzet

Copy link
Copy Markdown

Summary

Adds a distinct system_error execution status, persisted wherever a run fails for a platform/infrastructure reason (error_type = system) rather than a user/workflow error. System failures, including SQS issues, are now visible and filterable apart from user errors, end to end.

This builds on the existing system-vs-user error model (KEEP-693 / KEEP-545 / TECH-6544); it does not replace error_type/error_category/error_code, it adds a status that mirrors them so operators get a status they can see and filter.

Both SQS failure directions are covered

  • Producer enqueue fails (schedule/block/event SendMessage throws): the pre-created phantom row resolves to system_error (CS-0001/BS-0001/ES-0001/N-0002).
  • Consumer never receives (message lost / dispatcher down): the reaper ages the phantom to system_error (P-0005, reclassified from workflow_engine to infrastructure).
  • Consumer fails mid-process: a new backstop marks the in-flight run system_error (E-0004) and deletes the message immediately, instead of leaving it for the reaper (up to 30 min later). A typed RequeueSignal preserves the deliberate concurrency re-queue so it is never marked or duplicated.
  • Dispatch / k8s / runtime failures: map to system_error via P-0002/P-0004 and the classifier on the completion path.

How it stays consistent

A single rule, statusForErrorType(errorType), drives every write site. Every error-count query and terminal-state check matches both error and system_error, so no metric or wait path silently drops system errors:

  • getSystemErrorsByCategoryFromDb (the platform gauge the infra alert reads), the managed-org per-workflow gauge, the per-status execution gauge (gains a status=system_error series), and the duration histogram
  • analytics error buckets, x402 isTerminalStatus, the self-heal CAS, getCustomerRunErrorMessage, and the failure digest

UI

Distinct amber "System Error" badge and filter in the runs table, and a dedicated label/icon/dot in workflow runs.

Design note for review

The consumer backstop deletes the SQS message after marking system_error. This prevents poison-message redelivery loops and duplicate execution rows. Trade-off: a transient consumer error no longer auto-retries via SQS redelivery; it surfaces immediately as a system error and the run remains re-triggerable. The concurrency back-pressure path is unaffected (it re-queues as before). Worth a careful look during review.

No DB migration: workflow_executions.status is a plain text column with no check constraint.

Validation

  • pnpm type-check: clean
  • pnpm check: changed files clean
  • Unit suite: 6451 passed, 1 skipped (incl. a new execution-status test)
  • DB e2e: reaper-codes, executor-status-backstop pass; scheduler and event-tracker package suites pass
  • Updated expectations: reaper, both phantom suites, and two infra e2e assertions (SIGTERM and RPC failures now classify as system)

…failures

Introduce a distinct system_error execution status, persisted wherever a
run fails for a platform/infrastructure reason (error_type=system) rather
than a user/workflow error. This makes SQS and infra failures visible and
filterable apart from user errors, end to end.

Both SQS failure directions are covered:
- Producer enqueue failure (schedule/block/event SendMessage throws)
  resolves the phantom to system_error.
- Consumer never receives (message lost / dispatcher down) is aged by the
  reaper to system_error (P-0005, reclassified to infrastructure).
- New consumer backstop marks an in-flight run system_error (E-0004) and
  deletes the message immediately instead of waiting for the reaper; a
  typed RequeueSignal preserves the deliberate concurrency re-queue.
- Dispatch/k8s/runtime failures map to system_error via P-0002/P-0004 and
  the classifier in the completion path.

A single rule (statusForErrorType) drives every write site. All error-count
queries and terminal-state checks match both error and system_error so no
metric or wait path drops system errors -- including the platform
system-errors-by-category gauge the infra alert reads. The per-status
Prometheus gauge gains a status=system_error series. The runs table and
workflow runs surface a distinct System Error badge, filter, and label.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant