DRIVERS-3446 - Add prose tests to verify correct retry behavior when a mix of overload and non-overload errors are encountered#1924
Conversation
…a mix of overload and non-overload errors are encountered DRIVERS-3427: Remove token bucket disclaimer from backpressure speci… (mongodb#1923) DRIVERS-3436 - Refine `withTransaction` timeout error wrapping semantics and label propagation in spec and prose tests (mongodb#1920) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com> DRIVERS-3325 allow keyAltName in encryptedFields spec test (mongodb#1853)
| two runs. | ||
|
|
||
| #### Test 3: Overload Errors are Retried a Maximum of MAX_RETRIES times | ||
| #### Test 2: Overload Errors are Retried a Maximum of MAX_RETRIES times |
There was a problem hiding this comment.
We should not change the test numbers. They were, I believe, on purpose left out unchanged when one of the tests was previously removed - drivers may refer to the prose tests either by their numbers, or by linking to a specific test.
Changing the number or the test name breaks those references. It is especially bad when a test is referred to by its number - the old test 3 is now 2, but there is still a test with the number 3, and if a driver refers to the 3rd test, it is not obvious that the reference is broken.
The same applies to the test below.
There was a problem hiding this comment.
If so, we should probably add a stub, something like "Test 2: REMOVED" to avoid confusion in the future.
|
|
||
| - When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error. | ||
| - If an overload error is encountered: | ||
| - If an overload error is encountered at any point during an operation's retry loop: |
There was a problem hiding this comment.
It's not operations but commands that the read/write retry policies, or the new overload retry policy prescribes the drivers to retry. Previously, the effort was made to make sure that client-backpressure.md correctly refers to either command or operation, given that they are not the same thing.
Let's replace "operation" with "command" here, because that's the correct term here and below in "Why override existing maximum number of retry attempt defaults for retryable reads and writes if an overload error is received at any point during an operation's retry loop?".
|
|
||
| - When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error. | ||
| - If an overload error is encountered: | ||
| - If an overload error is encountered at any point during an operation's retry loop: |
There was a problem hiding this comment.
What is the "at any point during an operation's retry loop" part intended to clarify? Nowhere else in client-backpressure.md the "at any point of the command retry loop" qualification is made, because it is implied and seems obvious. What is specific about this item that requires the qualification?
Hopefully, another reviewer may see this question as useful, get to the answer, and then act on it however it seems appropriate.
There was a problem hiding this comment.
It was intended to clarify that a single overload error at any point will set the maximum number of retries for all errors encountered by the command. With the clarification added below, this addition can be removed.
| - If an overload error is encountered at any point during an operation's retry loop: | ||
| - Regardless of whether CSOT is enabled or not, the maximum number of retries for any retry policy becomes | ||
| `MAX_RETRIES`. | ||
| `MAX_RETRIES`. This includes retryable non-overload errors following or preceding the encountered overload error. |
There was a problem hiding this comment.
MAX_RETRIES limits attempts ("the execution attempt number (starting with 0)"). attempt is currently documented in the "Overload retry policy" section as
includes retries for errors that are not overload errors (this might include attempts under other retry policies, see Interactions with Other Retry Policies).
It appears that the addition made here tries to duplicate the above information, but the way it is done does not seem like correct English to me (though I am not a native speaker): the subject in the new sentence is "this", but I fail to find anything in the previous sentence that could act as the thing referred to by "this". Maybe the proposed sentence could be reformulated as
Note that
MAX_RETRIESlimitsattempts, andattemptsincludes retries for errors that are not overload errors.
There was a problem hiding this comment.
Another idea:
When reviewing the PR that introduced the requirements under the "If an overload error is encountered" condition, I suggested to formulate the two them by explicitly saying something like "extends the retry attempt limit..." and "forbids retry attempts after...". The current wording was chosen because it was much shorter than what I envisioned, and was still correct. However, given that it turned out to be practically difficult to understand, we may consider reformulating these requirements using the aforementioned wordings (if we do that, great care should be taken to formulate them correctly, which, historically, is difficult to do).
There was a problem hiding this comment.
Updated the wording for clarity here.
comandeo-mongo
left a comment
There was a problem hiding this comment.
I believe that the existing wording is good enough, we can leave the text as it is.
Two new tests add the necessary clarification and help to catch incorrect implementations.
|
|
||
| 6. Assert that the total number of started commands is `maxAdaptiveRetries` + 1 (2). | ||
|
|
||
| #### Test 4: Backoff is not applied for non-overload retryable errors |
There was a problem hiding this comment.
The suggested test covers a non-backpressure path; we can test a more complex situation - Non-overload error inside the overload loop. No backoff should be applied to the non-overload error:
-
Let client be a MongoClient.
-
Let coll be a collection.
-
Configure the following failpoint:
{
configureFailPoint: 'failCommand',
mode: { times: 1 },
data: {
failCommands: ['find'],
errorCode: 91,
errorLabels: ['RetryableError', 'SystemOverloadedError']
}
}- Using the CommandFailedEvent from the command monitoring API, configure a callback that, when the first find error is observed, configures the following failpoint:
{
configureFailPoint: 'failCommand',
mode: 'alwaysOn',
data: {
failCommands: ['find'],
errorCode: 91,
errorLabels: ['RetryableError']
}
}-
Perform a findOne operation with coll. Expect the operation to fail.
-
Assert that backoff was applied only once — for the initial overload error — and not for the subsequent non-overload retryable errors. Drivers MAY implement this assertion by overriding BASE_BACKOFF to a large value (e.g., 5 seconds) before running the test and verifying that the total elapsed time of the findOne operation is less than BASE_BACKOFF plus a small margin for network overhead. With correct behavior, only one backoff is applied (bounded by BASE_BACKOFF), so the elapsed time stays under that bound. If backoff were incorrectly applied for non-overload errors as well, the elapsed time would significantly exceed BASE_BACKOFF.
There was a problem hiding this comment.
Good suggestion. This test is also more consistent with our existing retryable reads tests for backpressure, I'll replace the original with your suggestion and move it there.
Explicitly stated that requested changes were non-blocking.
papafe
left a comment
There was a problem hiding this comment.
Overall it looks good, just two minor comments.
| mode: "off" | ||
| } | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Should we have an equivalent test 5 also for retryable writes?
| Configure the second fail point command only if the failed event is for the first error configured in step 2. | ||
|
|
||
| 4. Attempt an `insertOne` operation on any record for any database and collection. Expect the `insertOne` to fail with a | ||
| server error. Assert that `MAX_ADAPTIVE_RETRIES + 1` attempts were made. |
There was a problem hiding this comment.
Should we call this MAX_RETRIES to be consistent wit the spec files?
There was a problem hiding this comment.
Since this test is in retryable-writes, the intent was to specify MAX_ADAPTIVE_RETRIES as we do in the retryable-reads tests to remove ambiguity.
There was a problem hiding this comment.
I forgot to specify that this comment should be for all occurrences of this parameter in both retryable reads and writes tests. I think we just introduced it, didn't we?
DRIVERS-3446
Please complete the following before merging:
clusters).
Ruby PoC: mongodb/mongo-ruby-driver#3021