NATS consumer retry and replicas#9975
Merged
ReubenBond merged 3 commits intodotnet:mainfrom Apr 14, 2026
Merged
Conversation
Contributor
Author
|
@dotnet-policy-service agree company="Microsoft" |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves resiliency of the Orleans NATS JetStream streaming provider in multi-node NATS clusters by (1) enabling consumer self-healing after transient JetStream errors and (2) allowing stream replica count configuration so rolling restarts don’t permanently break consumption.
Changes:
- Add lazy re-initialization of
NatsStreamConsumerwhen_consumeris null during polling. - Introduce
NatsOptions.NumReplicaswith validation and apply it when creating JetStream streams (plus attempt stream update on config mismatch). - Add tests covering
NumReplicasdefaults/validation and a JetStream config assertion.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/Extensions/Orleans.Streaming.NATS.Tests/NatsOptionsTests.cs | Adds unit tests for options validation and a JetStream-facing test related to replicas. |
| src/Orleans.Streaming.NATS/Providers/NatsStreamConsumer.cs | Adds retry-on-poll initialization logic and lowers severity of “not initialized” logging. |
| src/Orleans.Streaming.NATS/Providers/NatsConnectionManager.cs | Plumbs NumReplicas into stream creation and attempts UpdateStreamAsync on config mismatch. |
| src/Orleans.Streaming.NATS/NatsOptions.cs | Adds NumReplicas option and validates it is at least 1. |
Member
|
@NSTA2, Copilot has some minor feedback (above) |
2d6ed2c to
1b34934
Compare
- NatsStreamConsumer.GetMessages() now lazily retries initialization when _consumer is null, making the consumer self-healing after transient JetStream failures (leader election, timeout, network blip). Log level changed from Error to Warning for transient retry attempts. - Added NumReplicas property to NatsOptions (default 1, backward compatible). NatsConnectionManager now passes NumReplicas to StreamConfig. - NatsConnectionManager handles NATS error code 10058 (stream exists with different config) by attempting an in-place UpdateStreamAsync, enabling replica count upgrades without manual stream deletion. - Added NumReplicas validation (>= 1) to NatsStreamOptionsValidator.
- Unit tests for NatsStreamOptionsValidator: invalid NumReplicas (0, -1, -100) throws OrleansConfigurationException; valid values (1, 3, 5) pass. - Unit tests for existing StreamName validation (null, whitespace). - Default value assertion: NumReplicas defaults to 1. - Integration test: verifies NumReplicas=1 is applied to JetStream StreamConfig. - R3 testing noted as requiring a multi-node NATS cluster (CI-level concern).
…ldStreamConfig, improve test coverage - NatsStreamConsumer: Add _consecutiveInitFailures counter. Log on first attempt then every 100th (~10s at 100ms cadence) to avoid log flooding during prolonged outages. Log at Info level when recovery succeeds. - NatsOptions: Relax XML doc to state NATS server enforces odd-number and cluster-size constraints, rather than claiming client-side validation. - NatsConnectionManager: Extract BuildStreamConfig() private helper to eliminate duplicated StreamConfig construction between Create and Update. - NatsOptionsTests: Rewrite integration test to exercise the actual provider path (NatsOptions -> NatsConnectionManager -> JetStream stream) instead of hardcoding StreamConfig directly.
1b34934 to
821459e
Compare
ReubenBond
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NATS streaming consumer permanent failure after transient JetStream errors
Problem
When running Orleans against a multi-node NATS JetStream cluster, a rolling restart of NATS nodes (routine maintenance, crash recovery, scaling) causes permanent consumer failure with no recovery path short of restarting the Orleans silo.
Two issues combine to produce this:
1. No retry on consumer initialization
NatsStreamConsumer.Initialize()is called exactly once during startup. If it fails due to a transient error — timeout during JetStream leader election, network blip, temporary unavailability — the internal_consumerfield staysnullpermanently. Every subsequentGetMessages()poll (~100ms) logs atErrorlevel:…and returns empty, indefinitely, with no self-healing path.
2. Hardcoded R1 streams
NatsConnectionManager.Initialize()creates the JetStream stream without settingNumReplicas, defaulting to R1 (single replica). R1 streams have exactly one leader; any node restart makes the stream temporarily unavailable during leader election — which is the trigger for bug #1.Combined effect: After a NATS rolling update, Orleans consumers enter a permanent error loop on every poll cycle, producing a flood of error logs and zero message delivery until the entire Orleans pod is restarted.
Root Cause
Changes
NatsStreamConsumer.cs_consumeris null inGetMessages(), attempt lazy re-initialization before returning empty. Piggybacks on the existing Orleans pulling agent poll cadence. Log level changed fromErrortoWarning— transient retries during rolling updates are expected, not permanent failures.NatsOptions.csNumReplicasproperty (default1, backward compatible). Added validation inNatsStreamOptionsValidatorensuringNumReplicas >= 1.NatsConnectionManager.csNumReplicastoStreamConfigwhen creating JetStream streams. Handles NATS error code 10058 (stream exists with different config) by attempting an in-placeUpdateStreamAsync, enabling replica count upgrades without manual stream deletion.NatsOptionsTests.cs(new)NumReplicas, missing/emptyStreamName), default value assertion, and an integration test verifyingNumReplicasflows through to JetStream stream config.Usage
Backward Compatibility
NumReplicasdefaults to1— existing deployments are unaffected.GetMessages()is purely additive — previously broken consumers now self-heal.Testing
NumReplicasis applied to JetStream stream config.NatsAdapterTests.SendAndReceiveFromNatsfailures are unrelated (cache cursor requestsSeqNum=0but JetStream sequences start at 1).Microsoft Reviewers: Open in CodeFlow