fix(orchestrator): stop force-marking publish failures retryable#281
Merged
Conversation
The validate, buildsignal, and merge controllers wrapped every queue publish failure with errs.NewRetryableError, unconditionally retrying regardless of the actual cause. That defeats the classifier framework: only known-transient infra errors should retry, otherwise genuinely non-retryable failures replay forever instead of dead-lettering. Return the raw error from these publish paths and let the consumer's ErrorProcessor classify it. Drop the now-unused errs imports (gazelle removes the corresponding BUILD deps). Controller tests no longer assert IsRetryable on the publish path, since classification is the consumer's contract, not the controller's. Docs: clarify in CLAUDE.md and platform/errs/README.md that classifiers own retryability, controllers override only with knowledge a classifier lacks (never just because replaying is convenient), and IsRetryable is detected only after the ErrorProcessor pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
5 tasks
behinddwalls
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
validate,buildsignal, andmergeorchestrator controllers wrapped every queue-publish failure witherrs.NewRetryableError, retrying unconditionally regardless of the actual cause. That defeats the classifier framework: only known-transient infra errors should retry — otherwise a genuinely non-retryable failure replays forever instead of dead-lettering.These three sites return the raw error and let the consumer's
ErrorProcessorclassify it, the same as every other error path.Changes
validate.go,buildsignal.go,merge.go: return the raw publish error (fmt.Errorf("...: %w", err)) instead oferrs.NewRetryableError(...). Dropped the now-unusederrsimports inbuildsignal/merge(gazelle removed the matching BUILD deps).IsRetryableassertions from the publish-failure tests; classification is the consumer's contract, not the controller's. The tests now assert that an error surfaces. RenamedTestProcess_PublishFailureIsRetryable→TestProcess_PublishFailureReturnsError.CLAUDE.mdandplatform/errs/README.md: clarify that classifiers own retryability, controllers override only with knowledge a classifier lacks (never just because replaying is convenient), andIsRetryableis meaningful only after theErrorProcessorpass — i.e. detected in the consumer/transport layer, not in controllers.Testing
go test ./submitqueue/orchestrator/controller/{validate,buildsignal,merge}/... ./platform/errs/...— passmake gazelle— BUILD files syncedgofmt— clean🤖 Generated with Claude Code