Skip to content

fix(orchestrator): stop force-marking publish failures retryable#281

Merged
behinddwalls merged 1 commit into
mainfrom
fixerrs
Jun 26, 2026
Merged

fix(orchestrator): stop force-marking publish failures retryable#281
behinddwalls merged 1 commit into
mainfrom
fixerrs

Conversation

@sbalabanov

Copy link
Copy Markdown
Contributor

Summary

The validate, buildsignal, and merge orchestrator controllers wrapped every queue-publish failure with errs.NewRetryableError, retrying unconditionally regardless of the actual cause. That defeats the classifier framework: only known-transient infra errors should retry — otherwise a genuinely non-retryable failure replays forever instead of dead-lettering.

These three sites return the raw error and let the consumer's ErrorProcessor classify it, the same as every other error path.

Changes

  • Controllersvalidate.go, buildsignal.go, merge.go: return the raw publish error (fmt.Errorf("...: %w", err)) instead of errs.NewRetryableError(...). Dropped the now-unused errs imports in buildsignal/merge (gazelle removed the matching BUILD deps).
  • Tests — removed the IsRetryable assertions from the publish-failure tests; classification is the consumer's contract, not the controller's. The tests now assert that an error surfaces. Renamed TestProcess_PublishFailureIsRetryableTestProcess_PublishFailureReturnsError.
  • DocsCLAUDE.md and platform/errs/README.md: clarify that classifiers own retryability, controllers override only with knowledge a classifier lacks (never just because replaying is convenient), and IsRetryable is meaningful only after the ErrorProcessor pass — i.e. detected in the consumer/transport layer, not in controllers.

Testing

  • go test ./submitqueue/orchestrator/controller/{validate,buildsignal,merge}/... ./platform/errs/... — pass
  • make gazelle — BUILD files synced
  • gofmt — clean

🤖 Generated with Claude Code

The validate, buildsignal, and merge controllers wrapped every queue
publish failure with errs.NewRetryableError, unconditionally retrying
regardless of the actual cause. That defeats the classifier framework:
only known-transient infra errors should retry, otherwise genuinely
non-retryable failures replay forever instead of dead-lettering.

Return the raw error from these publish paths and let the consumer's
ErrorProcessor classify it. Drop the now-unused errs imports (gazelle
removes the corresponding BUILD deps). Controller tests no longer assert
IsRetryable on the publish path, since classification is the consumer's
contract, not the controller's.

Docs: clarify in CLAUDE.md and platform/errs/README.md that classifiers
own retryability, controllers override only with knowledge a classifier
lacks (never just because replaying is convenient), and IsRetryable is
detected only after the ErrorProcessor pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sbalabanov sbalabanov requested review from a team and behinddwalls as code owners June 26, 2026 22:44
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@behinddwalls behinddwalls added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit 1d36819 Jun 26, 2026
14 of 15 checks passed
@behinddwalls behinddwalls deployed to stack-rebase June 26, 2026 22:55 — with GitHub Actions Active
@behinddwalls behinddwalls deleted the fixerrs branch June 26, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants