Skip to content

NATS consumer retry and replicas#9975

Merged
ReubenBond merged 3 commits intodotnet:mainfrom
NSTA2:fix/nats-consumer-retry-and-replicas
Apr 14, 2026
Merged

NATS consumer retry and replicas#9975
ReubenBond merged 3 commits intodotnet:mainfrom
NSTA2:fix/nats-consumer-retry-and-replicas

Conversation

@NSTA2
Copy link
Copy Markdown
Contributor

@NSTA2 NSTA2 commented Mar 26, 2026

NATS streaming consumer permanent failure after transient JetStream errors

Problem

When running Orleans against a multi-node NATS JetStream cluster, a rolling restart of NATS nodes (routine maintenance, crash recovery, scaling) causes permanent consumer failure with no recovery path short of restarting the Orleans silo.

Two issues combine to produce this:

1. No retry on consumer initialization

NatsStreamConsumer.Initialize() is called exactly once during startup. If it fails due to a transient error — timeout during JetStream leader election, network blip, temporary unavailability — the internal _consumer field stays null permanently. Every subsequent GetMessages() poll (~100ms) logs at Error level:

Internal NATS Consumer is not initialized. Provider: {Provider} | Stream: {Stream} | Partition: {Partition}.

…and returns empty, indefinitely, with no self-healing path.

2. Hardcoded R1 streams

NatsConnectionManager.Initialize() creates the JetStream stream without setting NumReplicas, defaulting to R1 (single replica). R1 streams have exactly one leader; any node restart makes the stream temporarily unavailable during leader election — which is the trigger for bug #1.

Combined effect: After a NATS rolling update, Orleans consumers enter a permanent error loop on every poll cycle, producing a flood of error logs and zero message delivery until the entire Orleans pod is restarted.

Root Cause

NATS node restart
  → R1 stream leader temporarily unavailable
    → NatsStreamConsumer.Initialize() throws (timeout / leader election)
      → _consumer stays null permanently
        → Every GetMessages() poll logs Error + returns empty
          → No recovery without full silo restart

Changes

File Change
NatsStreamConsumer.cs When _consumer is null in GetMessages(), attempt lazy re-initialization before returning empty. Piggybacks on the existing Orleans pulling agent poll cadence. Log level changed from Error to Warning — transient retries during rolling updates are expected, not permanent failures.
NatsOptions.cs Added NumReplicas property (default 1, backward compatible). Added validation in NatsStreamOptionsValidator ensuring NumReplicas >= 1.
NatsConnectionManager.cs Passes NumReplicas to StreamConfig when creating JetStream streams. Handles NATS error code 10058 (stream exists with different config) by attempting an in-place UpdateStreamAsync, enabling replica count upgrades without manual stream deletion.
NatsOptionsTests.cs (new) Unit tests for validator (invalid/valid NumReplicas, missing/empty StreamName), default value assertion, and an integration test verifying NumReplicas flows through to JetStream stream config.

Usage

siloBuilder.AddNatsStreams("my-provider", options =>
{
    options.StreamName = "my-stream";
    options.NumReplicas = 3; // R3 for production HA clusters (≥ 3 NATS nodes)
});

Backward Compatibility

  • NumReplicas defaults to 1 — existing deployments are unaffected.
  • The retry in GetMessages() is purely additive — previously broken consumers now self-heal.
  • Error 10058 handling is additive — previously this was an unhandled exception that crashed the adapter factory.

Testing

  • All 50 existing NATS integration tests pass (stream, subscription multiplicity, client stream).
  • 10 new tests added: validator unit tests (no NATS required) + integration test verifying NumReplicas is applied to JetStream stream config.
  • The 2 pre-existing NatsAdapterTests.SendAndReceiveFromNats failures are unrelated (cache cursor requests SeqNum=0 but JetStream sequences start at 1).
  • R3 integration testing requires a multi-node NATS cluster and should be validated in CI with a 3-node docker-compose setup.
Microsoft Reviewers: Open in CodeFlow

@NSTA2
Copy link
Copy Markdown
Contributor Author

NSTA2 commented Mar 26, 2026

@dotnet-policy-service agree company="Microsoft"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resiliency of the Orleans NATS JetStream streaming provider in multi-node NATS clusters by (1) enabling consumer self-healing after transient JetStream errors and (2) allowing stream replica count configuration so rolling restarts don’t permanently break consumption.

Changes:

  • Add lazy re-initialization of NatsStreamConsumer when _consumer is null during polling.
  • Introduce NatsOptions.NumReplicas with validation and apply it when creating JetStream streams (plus attempt stream update on config mismatch).
  • Add tests covering NumReplicas defaults/validation and a JetStream config assertion.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
test/Extensions/Orleans.Streaming.NATS.Tests/NatsOptionsTests.cs Adds unit tests for options validation and a JetStream-facing test related to replicas.
src/Orleans.Streaming.NATS/Providers/NatsStreamConsumer.cs Adds retry-on-poll initialization logic and lowers severity of “not initialized” logging.
src/Orleans.Streaming.NATS/Providers/NatsConnectionManager.cs Plumbs NumReplicas into stream creation and attempts UpdateStreamAsync on config mismatch.
src/Orleans.Streaming.NATS/NatsOptions.cs Adds NumReplicas option and validates it is at least 1.

Comment thread src/Orleans.Streaming.NATS/Providers/NatsStreamConsumer.cs Outdated
Comment thread src/Orleans.Streaming.NATS/NatsOptions.cs
Comment thread src/Orleans.Streaming.NATS/Providers/NatsConnectionManager.cs Outdated
Comment thread test/Extensions/Orleans.Streaming.NATS.Tests/NatsOptionsTests.cs Outdated
@ReubenBond ReubenBond changed the title Nats consumer retry and replicas NATS consumer retry and replicas Apr 2, 2026
@ReubenBond
Copy link
Copy Markdown
Member

@NSTA2, Copilot has some minor feedback (above)

@NSTA2 NSTA2 force-pushed the fix/nats-consumer-retry-and-replicas branch from 2d6ed2c to 1b34934 Compare April 13, 2026 08:33
NSTA2 added 3 commits April 14, 2026 06:11
- NatsStreamConsumer.GetMessages() now lazily retries initialization when
  _consumer is null, making the consumer self-healing after transient
  JetStream failures (leader election, timeout, network blip).
  Log level changed from Error to Warning for transient retry attempts.

- Added NumReplicas property to NatsOptions (default 1, backward compatible).
  NatsConnectionManager now passes NumReplicas to StreamConfig.

- NatsConnectionManager handles NATS error code 10058 (stream exists with
  different config) by attempting an in-place UpdateStreamAsync, enabling
  replica count upgrades without manual stream deletion.

- Added NumReplicas validation (>= 1) to NatsStreamOptionsValidator.
- Unit tests for NatsStreamOptionsValidator: invalid NumReplicas (0, -1, -100)
  throws OrleansConfigurationException; valid values (1, 3, 5) pass.
- Unit tests for existing StreamName validation (null, whitespace).
- Default value assertion: NumReplicas defaults to 1.
- Integration test: verifies NumReplicas=1 is applied to JetStream StreamConfig.
- R3 testing noted as requiring a multi-node NATS cluster (CI-level concern).
…ldStreamConfig, improve test coverage

- NatsStreamConsumer: Add _consecutiveInitFailures counter. Log on first
  attempt then every 100th (~10s at 100ms cadence) to avoid log flooding
  during prolonged outages. Log at Info level when recovery succeeds.

- NatsOptions: Relax XML doc to state NATS server enforces odd-number and
  cluster-size constraints, rather than claiming client-side validation.

- NatsConnectionManager: Extract BuildStreamConfig() private helper to
  eliminate duplicated StreamConfig construction between Create and Update.

- NatsOptionsTests: Rewrite integration test to exercise the actual provider
  path (NatsOptions -> NatsConnectionManager -> JetStream stream) instead of
  hardcoding StreamConfig directly.
@ReubenBond ReubenBond force-pushed the fix/nats-consumer-retry-and-replicas branch from 1b34934 to 821459e Compare April 14, 2026 13:11
@ReubenBond ReubenBond enabled auto-merge April 14, 2026 13:49
@ReubenBond ReubenBond added this pull request to the merge queue Apr 14, 2026
Merged via the queue into dotnet:main with commit ef62463 Apr 14, 2026
114 of 115 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants