feat(executions): add System Error status for SQS and infrastructure failures#1599
Draft
joelorzet wants to merge 1 commit into
Draft
feat(executions): add System Error status for SQS and infrastructure failures#1599joelorzet wants to merge 1 commit into
joelorzet wants to merge 1 commit into
Conversation
…failures Introduce a distinct system_error execution status, persisted wherever a run fails for a platform/infrastructure reason (error_type=system) rather than a user/workflow error. This makes SQS and infra failures visible and filterable apart from user errors, end to end. Both SQS failure directions are covered: - Producer enqueue failure (schedule/block/event SendMessage throws) resolves the phantom to system_error. - Consumer never receives (message lost / dispatcher down) is aged by the reaper to system_error (P-0005, reclassified to infrastructure). - New consumer backstop marks an in-flight run system_error (E-0004) and deletes the message immediately instead of waiting for the reaper; a typed RequeueSignal preserves the deliberate concurrency re-queue. - Dispatch/k8s/runtime failures map to system_error via P-0002/P-0004 and the classifier in the completion path. A single rule (statusForErrorType) drives every write site. All error-count queries and terminal-state checks match both error and system_error so no metric or wait path drops system errors -- including the platform system-errors-by-category gauge the infra alert reads. The per-status Prometheus gauge gains a status=system_error series. The runs table and workflow runs surface a distinct System Error badge, filter, and label.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a distinct
system_errorexecution status, persisted wherever a run fails for a platform/infrastructure reason (error_type = system) rather than a user/workflow error. System failures, including SQS issues, are now visible and filterable apart from user errors, end to end.This builds on the existing system-vs-user error model (KEEP-693 / KEEP-545 / TECH-6544); it does not replace
error_type/error_category/error_code, it adds a status that mirrors them so operators get a status they can see and filter.Both SQS failure directions are covered
SendMessagethrows): the pre-created phantom row resolves tosystem_error(CS-0001/BS-0001/ES-0001/N-0002).system_error(P-0005, reclassified fromworkflow_enginetoinfrastructure).system_error(E-0004) and deletes the message immediately, instead of leaving it for the reaper (up to 30 min later). A typedRequeueSignalpreserves the deliberate concurrency re-queue so it is never marked or duplicated.system_errorviaP-0002/P-0004and the classifier on the completion path.How it stays consistent
A single rule,
statusForErrorType(errorType), drives every write site. Every error-count query and terminal-state check matches botherrorandsystem_error, so no metric or wait path silently drops system errors:getSystemErrorsByCategoryFromDb(the platform gauge the infra alert reads), the managed-org per-workflow gauge, the per-status execution gauge (gains astatus=system_errorseries), and the duration histogramisTerminalStatus, the self-heal CAS,getCustomerRunErrorMessage, and the failure digestUI
Distinct amber "System Error" badge and filter in the runs table, and a dedicated label/icon/dot in workflow runs.
Design note for review
The consumer backstop deletes the SQS message after marking
system_error. This prevents poison-message redelivery loops and duplicate execution rows. Trade-off: a transient consumer error no longer auto-retries via SQS redelivery; it surfaces immediately as a system error and the run remains re-triggerable. The concurrency back-pressure path is unaffected (it re-queues as before). Worth a careful look during review.No DB migration:
workflow_executions.statusis a plain text column with no check constraint.Validation
pnpm type-check: cleanpnpm check: changed files cleanexecution-statustest)reaper-codes,executor-status-backstoppass; scheduler and event-tracker package suites pass