Skip to content

DRIVERS-3446 - Add prose tests to verify correct retry behavior when a mix of overload and non-overload errors are encountered#1924

Merged
NoahStapp merged 7 commits intomongodb:masterfrom
NoahStapp:DRIVERS-3446
Apr 15, 2026
Merged

DRIVERS-3446 - Add prose tests to verify correct retry behavior when a mix of overload and non-overload errors are encountered#1924
NoahStapp merged 7 commits intomongodb:masterfrom
NoahStapp:DRIVERS-3446

Conversation

@NoahStapp
Copy link
Copy Markdown
Contributor

@NoahStapp NoahStapp commented Apr 14, 2026

DRIVERS-3446

Please complete the following before merging:

  • Is the relevant DRIVERS ticket in the PR title?
  • Update changelog.
  • Test changes in at least one language driver. Python.
  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

Ruby PoC: mongodb/mongo-ruby-driver#3021

…a mix of overload and non-overload errors are encountered

DRIVERS-3427: Remove token bucket disclaimer from backpressure speci… (mongodb#1923)

DRIVERS-3436 - Refine `withTransaction` timeout error wrapping semantics and label propagation in spec and prose tests (mongodb#1920)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

DRIVERS-3325 allow keyAltName in encryptedFields spec test (mongodb#1853)
@NoahStapp NoahStapp requested review from a team as code owners April 14, 2026 16:20
@NoahStapp NoahStapp requested review from comandeo-mongo, isabelatkinson, katcharov and papafe and removed request for a team April 14, 2026 16:20
@katcharov katcharov requested review from stIncMale and removed request for katcharov April 14, 2026 17:11
@NoahStapp NoahStapp removed the request for review from isabelatkinson April 14, 2026 17:44
@katcharov katcharov removed the request for review from stIncMale April 14, 2026 22:49
Copy link
Copy Markdown
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to come back to this PR review, which means, my comments/suggestions should not block the PR. But I hope both the author and the other reviewers will take them into account.

two runs.

#### Test 3: Overload Errors are Retried a Maximum of MAX_RETRIES times
#### Test 2: Overload Errors are Retried a Maximum of MAX_RETRIES times
Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the test numbers. They were, I believe, on purpose left out unchanged when one of the tests was previously removed - drivers may refer to the prose tests either by their numbers, or by linking to a specific test.

Changing the number or the test name breaks those references. It is especially bad when a test is referred to by its number - the old test 3 is now 2, but there is still a test with the number 3, and if a driver refers to the 3rd test, it is not obvious that the reference is broken.

The same applies to the test below.

Copy link
Copy Markdown
Contributor

@comandeo-mongo comandeo-mongo Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we should probably add a stub, something like "Test 2: REMOVED" to avoid confusion in the future.


- When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
- If an overload error is encountered:
- If an overload error is encountered at any point during an operation's retry loop:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not operations but commands that the read/write retry policies, or the new overload retry policy prescribes the drivers to retry. Previously, the effort was made to make sure that client-backpressure.md correctly refers to either command or operation, given that they are not the same thing.

Let's replace "operation" with "command" here, because that's the correct term here and below in "Why override existing maximum number of retry attempt defaults for retryable reads and writes if an overload error is received at any point during an operation's retry loop?".


- When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
- If an overload error is encountered:
- If an overload error is encountered at any point during an operation's retry loop:
Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "at any point during an operation's retry loop" part intended to clarify? Nowhere else in client-backpressure.md the "at any point of the command retry loop" qualification is made, because it is implied and seems obvious. What is specific about this item that requires the qualification?

Hopefully, another reviewer may see this question as useful, get to the answer, and then act on it however it seems appropriate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to clarify that a single overload error at any point will set the maximum number of retries for all errors encountered by the command. With the clarification added below, this addition can be removed.

- If an overload error is encountered at any point during an operation's retry loop:
- Regardless of whether CSOT is enabled or not, the maximum number of retries for any retry policy becomes
`MAX_RETRIES`.
`MAX_RETRIES`. This includes retryable non-overload errors following or preceding the encountered overload error.
Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_RETRIES limits attempts ("the execution attempt number (starting with 0)"). attempt is currently documented in the "Overload retry policy" section as

includes retries for errors that are not overload errors (this might include attempts under other retry policies, see Interactions with Other Retry Policies).

It appears that the addition made here tries to duplicate the above information, but the way it is done does not seem like correct English to me (though I am not a native speaker): the subject in the new sentence is "this", but I fail to find anything in the previous sentence that could act as the thing referred to by "this". Maybe the proposed sentence could be reformulated as

Note that MAX_RETRIES limits attempts, and attempts includes retries for errors that are not overload errors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea:

When reviewing the PR that introduced the requirements under the "If an overload error is encountered" condition, I suggested to formulate the two them by explicitly saying something like "extends the retry attempt limit..." and "forbids retry attempts after...". The current wording was chosen because it was much shorter than what I envisioned, and was still correct. However, given that it turned out to be practically difficult to understand, we may consider reformulating these requirements using the aforementioned wordings (if we do that, great care should be taken to formulate them correctly, which, historically, is difficult to do).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the wording for clarity here.

Copy link
Copy Markdown
Contributor

@comandeo-mongo comandeo-mongo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the existing wording is good enough, we can leave the text as it is.

Two new tests add the necessary clarification and help to catch incorrect implementations.


6. Assert that the total number of started commands is `maxAdaptiveRetries` + 1 (2).

#### Test 4: Backoff is not applied for non-overload retryable errors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested test covers a non-backpressure path; we can test a more complex situation - Non-overload error inside the overload loop. No backoff should be applied to the non-overload error:

  1. Let client be a MongoClient.

  2. Let coll be a collection.

  3. Configure the following failpoint:

      {
          configureFailPoint: 'failCommand',
          mode: { times: 1 },
          data: {
              failCommands: ['find'],
              errorCode: 91,
              errorLabels: ['RetryableError', 'SystemOverloadedError']
          }
      }
  1. Using the CommandFailedEvent from the command monitoring API, configure a callback that, when the first find error is observed, configures the following failpoint:
      {
          configureFailPoint: 'failCommand',
          mode: 'alwaysOn',
          data: {
              failCommands: ['find'],
              errorCode: 91,
              errorLabels: ['RetryableError']
          }
      }
  1. Perform a findOne operation with coll. Expect the operation to fail.

  2. Assert that backoff was applied only once — for the initial overload error — and not for the subsequent non-overload retryable errors. Drivers MAY implement this assertion by overriding BASE_BACKOFF to a large value (e.g., 5 seconds) before running the test and verifying that the total elapsed time of the findOne operation is less than BASE_BACKOFF plus a small margin for network overhead. With correct behavior, only one backoff is applied (bounded by BASE_BACKOFF), so the elapsed time stays under that bound. If backoff were incorrectly applied for non-overload errors as well, the elapsed time would significantly exceed BASE_BACKOFF.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. This test is also more consistent with our existing retryable reads tests for backpressure, I'll replace the original with your suggestion and move it there.

@NoahStapp NoahStapp dismissed stIncMale’s stale review April 15, 2026 14:49

Explicitly stated that requested changes were non-blocking.

@NoahStapp NoahStapp removed the request for review from stIncMale April 15, 2026 14:49
Copy link
Copy Markdown
Contributor

@papafe papafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good, just two minor comments.

mode: "off"
}
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an equivalent test 5 also for retryable writes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch.

Comment thread source/retryable-writes/tests/README.md Outdated
Configure the second fail point command only if the failed event is for the first error configured in step 2.

4. Attempt an `insertOne` operation on any record for any database and collection. Expect the `insertOne` to fail with a
server error. Assert that `MAX_ADAPTIVE_RETRIES + 1` attempts were made.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this MAX_RETRIES to be consistent wit the spec files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test is in retryable-writes, the intent was to specify MAX_ADAPTIVE_RETRIES as we do in the retryable-reads tests to remove ambiguity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to specify that this comment should be for all occurrences of this parameter in both retryable reads and writes tests. I think we just introduced it, didn't we?

@NoahStapp NoahStapp requested a review from papafe April 15, 2026 15:19
Copy link
Copy Markdown
Contributor

@papafe papafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@NoahStapp NoahStapp merged commit 7039e69 into mongodb:master Apr 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants