DRIVERS-3446 - Add prose tests to verify correct retry behavior when a mix of overload and non-overload errors are encountered by NoahStapp · Pull Request #1924 · mongodb/specifications

NoahStapp · 2026-04-14T16:20:27Z

Please complete the following before merging:

Is the relevant DRIVERS ticket in the PR title?

Update changelog.
Test changes in at least one language driver. Python.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

Ruby PoC: mongodb/mongo-ruby-driver#3021

…a mix of overload and non-overload errors are encountered DRIVERS-3427: Remove token bucket disclaimer from backpressure speci… (mongodb#1923) DRIVERS-3436 - Refine `withTransaction` timeout error wrapping semantics and label propagation in spec and prose tests (mongodb#1920) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com> DRIVERS-3325 allow keyAltName in encryptedFields spec test (mongodb#1853)

stIncMale

I won't be able to come back to this PR review, which means, my comments/suggestions should not block the PR. But I hope both the author and the other reviewers will take them into account.

stIncMale · 2026-04-15T02:42:14Z

        two runs.

-#### Test 3: Overload Errors are Retried a Maximum of MAX_RETRIES times
+#### Test 2: Overload Errors are Retried a Maximum of MAX_RETRIES times


We should not change the test numbers. They were, I believe, on purpose left out unchanged when one of the tests was previously removed - drivers may refer to the prose tests either by their numbers, or by linking to a specific test.

Changing the number or the test name breaks those references. It is especially bad when a test is referred to by its number - the old test 3 is now 2, but there is still a test with the number 3, and if a driver refers to the 3rd test, it is not obvious that the reference is broken.

The same applies to the test below.

If so, we should probably add a stub, something like "Test 2: REMOVED" to avoid confusion in the future.

stIncMale · 2026-04-15T02:48:53Z


 - When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
- If an overload error is encountered:
+- If an overload error is encountered at any point during an operation's retry loop:


It's not operations but commands that the read/write retry policies, or the new overload retry policy prescribes the drivers to retry. Previously, the effort was made to make sure that client-backpressure.md correctly refers to either command or operation, given that they are not the same thing.

Let's replace "operation" with "command" here, because that's the correct term here and below in "Why override existing maximum number of retry attempt defaults for retryable reads and writes if an overload error is received at any point during an operation's retry loop?".

stIncMale · 2026-04-15T02:53:24Z


 - When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
- If an overload error is encountered:
+- If an overload error is encountered at any point during an operation's retry loop:


What is the "at any point during an operation's retry loop" part intended to clarify? Nowhere else in client-backpressure.md the "at any point of the command retry loop" qualification is made, because it is implied and seems obvious. What is specific about this item that requires the qualification?

Hopefully, another reviewer may see this question as useful, get to the answer, and then act on it however it seems appropriate.

It was intended to clarify that a single overload error at any point will set the maximum number of retries for all errors encountered by the command. With the clarification added below, this addition can be removed.

stIncMale · 2026-04-15T03:08:13Z

+- If an overload error is encountered at any point during an operation's retry loop:
    - Regardless of whether CSOT is enabled or not, the maximum number of retries for any retry policy becomes
-        `MAX_RETRIES`.
+        `MAX_RETRIES`. This includes retryable non-overload errors following or preceding the encountered overload error.


MAX_RETRIES limits attempts ("the execution attempt number (starting with 0)"). attempt is currently documented in the "Overload retry policy" section as

includes retries for errors that are not overload errors (this might include attempts under other retry policies, see Interactions with Other Retry Policies).

It appears that the addition made here tries to duplicate the above information, but the way it is done does not seem like correct English to me (though I am not a native speaker): the subject in the new sentence is "this", but I fail to find anything in the previous sentence that could act as the thing referred to by "this". Maybe the proposed sentence could be reformulated as

Note that MAX_RETRIES limits attempts, and attempts includes retries for errors that are not overload errors.

Another idea:

When reviewing the PR that introduced the requirements under the "If an overload error is encountered" condition, I suggested to formulate the two them by explicitly saying something like "extends the retry attempt limit..." and "forbids retry attempts after...". The current wording was chosen because it was much shorter than what I envisioned, and was still correct. However, given that it turned out to be practically difficult to understand, we may consider reformulating these requirements using the aforementioned wordings (if we do that, great care should be taken to formulate them correctly, which, historically, is difficult to do).

Updated the wording for clarity here.

comandeo-mongo

I believe that the existing wording is good enough, we can leave the text as it is.

Two new tests add the necessary clarification and help to catch incorrect implementations.

comandeo-mongo · 2026-04-15T09:43:28Z


 6. Assert that the total number of started commands is `maxAdaptiveRetries` + 1 (2).
+
+#### Test 4: Backoff is not applied for non-overload retryable errors


The suggested test covers a non-backpressure path; we can test a more complex situation - Non-overload error inside the overload loop. No backoff should be applied to the non-overload error:

Let client be a MongoClient.

Let coll be a collection.

Configure the following failpoint:

{ configureFailPoint: 'failCommand', mode: { times: 1 }, data: { failCommands: ['find'], errorCode: 91, errorLabels: ['RetryableError', 'SystemOverloadedError'] } }

Using the CommandFailedEvent from the command monitoring API, configure a callback that, when the first find error is observed, configures the following failpoint:

{ configureFailPoint: 'failCommand', mode: 'alwaysOn', data: { failCommands: ['find'], errorCode: 91, errorLabels: ['RetryableError'] } }

Perform a findOne operation with coll. Expect the operation to fail.

Assert that backoff was applied only once — for the initial overload error — and not for the subsequent non-overload retryable errors. Drivers MAY implement this assertion by overriding BASE_BACKOFF to a large value (e.g., 5 seconds) before running the test and verifying that the total elapsed time of the findOne operation is less than BASE_BACKOFF plus a small margin for network overhead. With correct behavior, only one backoff is applied (bounded by BASE_BACKOFF), so the elapsed time stays under that bound. If backoff were incorrectly applied for non-overload errors as well, the elapsed time would significantly exceed BASE_BACKOFF.

Good suggestion. This test is also more consistent with our existing retryable reads tests for backpressure, I'll replace the original with your suggestion and move it there.

Explicitly stated that requested changes were non-blocking.

papafe

Overall it looks good, just two minor comments.

papafe · 2026-04-15T14:56:22Z

+        mode: "off"
+    }
+    ```
+


Should we have an equivalent test 5 also for retryable writes?

Yes, good catch.

papafe · 2026-04-15T14:57:48Z

+    Configure the second fail point command only if the failed event is for the first error configured in step 2.
+
+4. Attempt an `insertOne` operation on any record for any database and collection. Expect the `insertOne` to fail with a
+    server error. Assert that `MAX_ADAPTIVE_RETRIES + 1` attempts were made.


Should we call this MAX_RETRIES to be consistent wit the spec files?

Since this test is in retryable-writes, the intent was to specify MAX_ADAPTIVE_RETRIES as we do in the retryable-reads tests to remove ambiguity.

I forgot to specify that this comment should be for all occurrences of this parameter in both retryable reads and writes tests. I think we just introduced it, didn't we?

papafe

LGTM!

NoahStapp added 3 commits April 14, 2026 12:19

Changelogs

f30aa07

Remove non-existant test 2

746c57e

NoahStapp requested review from a team as code owners April 14, 2026 16:20

NoahStapp requested review from comandeo-mongo, isabelatkinson, katcharov and papafe and removed request for a team April 14, 2026 16:20

katcharov requested review from stIncMale and removed request for katcharov April 14, 2026 17:11

NoahStapp removed the request for review from isabelatkinson April 14, 2026 17:44

Add new prose test

51d25dc

katcharov removed the request for review from stIncMale April 14, 2026 22:49

stIncMale previously requested changes Apr 15, 2026

View reviewed changes

comandeo-mongo reviewed Apr 15, 2026

View reviewed changes

VK + DR review

a009580

NoahStapp requested review from comandeo-mongo and stIncMale April 15, 2026 13:15

comandeo-mongo approved these changes Apr 15, 2026

View reviewed changes

NoahStapp removed the request for review from stIncMale April 15, 2026 14:49

papafe requested changes Apr 15, 2026

View reviewed changes

FP review

c5a4295

NoahStapp requested a review from papafe April 15, 2026 15:19

FP review

4cdf13a

papafe approved these changes Apr 15, 2026

View reviewed changes

NoahStapp merged commit 7039e69 into mongodb:master Apr 15, 2026
6 checks passed


		6. Assert that the total number of started commands is `maxAdaptiveRetries` + 1 (2).

		#### Test 4: Backoff is not applied for non-overload retryable errors

Conversation

NoahStapp commented Apr 14, 2026 • edited by comandeo-mongo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stIncMale left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comandeo-mongo Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comandeo-mongo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

papafe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

papafe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NoahStapp commented Apr 14, 2026 •

edited by comandeo-mongo

Loading

stIncMale left a comment •

edited

Loading

stIncMale Apr 15, 2026 •

edited

Loading

comandeo-mongo Apr 15, 2026 •

edited

Loading

stIncMale Apr 15, 2026 •

edited

Loading

stIncMale Apr 15, 2026 •

edited

Loading