Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting#328
Merged
Conversation
…ble attempt at reconnecting
Contributor
BewareMyPower
left a comment
There was a problem hiding this comment.
Good catch. It's similar to my fix for the Java client: apache/pulsar#20595
I just left two comments at the moment and will review it again when I'm back to work next week.
|
|
||
| bool setFailed(Result result) const { return state_->complete(result, {}); } | ||
|
|
||
| bool setSuccess() const { return state_->complete({}, {}); } |
Contributor
There was a problem hiding this comment.
I think we can replace this method with a setValue({}) call or passing a static variable to setValue.
| connectionOpened(cnx).addListener([this, self](Result result, bool) { | ||
| // Do not use bool, only Result. | ||
| reconnectionPending_ = false; | ||
| if (result == ResultRetryable) { |
Contributor
There was a problem hiding this comment.
Use isResultRetryable() from ResultUtil.h here. Though I admit using two Result enums as retryable enums seems weird.
Contributor
Author
|
@BewareMyPower |
BewareMyPower
approved these changes
Oct 16, 2023
2 tasks
BewareMyPower
pushed a commit
to streamnative/pulsar-client-cpp
that referenced
this pull request
Oct 20, 2023
…ble attempt at reconnecting (apache#328) Related Issue: apache#235 ### Motivation A potential double scheduling of reconnection due to a broker shutdown was observed. The reconnect can be scheduled with either of the following codes [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209) or [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121) If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`: [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63) --> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217) The problem is that the consumer will no longer be able to manage events coming from the broker. To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 . However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely. In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a): ``` # Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled. ... 2023-09-26 15:42:05.736 INFO [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5 2023-09-26 15:42:05.736 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s ... # Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again. ... 2023-09-26 15:42:05.836 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool 2023-09-26 15:42:05.837 WARN [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16 2023-09-26 15:42:05.837 WARN [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable 2023-09-26 15:42:05.837 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s ... # During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled. # After that, two subscribe requests are sent to broker2. 2023-09-26 15:42:06.034 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError 2023-09-26 15:42:06.515 INFO [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot> 2023-09-26 15:42:06.515 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s ... 2023-09-26 15:42:06.907 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.912 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... 2023-09-26 15:42:07.103 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... ``` To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed. ### Modifications The timing for setting `reconnectionPending_` to false has been changed. (cherry picked from commit ebae92e)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issue: #235
Motivation
A potential double scheduling of reconnection due to a broker shutdown was observed.
The reconnect can be scheduled with either of the following codes
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209
or
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121
If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from
cnx:https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63
--> https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217
The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag
reconnectionPending_has been introduced via #310 .However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after #310(I tried with b35ae1a):
To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.
Modifications
The timing for setting
reconnectionPending_to false has been changed.