fix: avoid infinite retry in blue-green migration test#1502
Conversation
3c7c9e2 to
bc05b7a
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the blue/green migration integration test to avoid hanging indefinitely during broker migration/unload scenarios by adding bounded retry behavior, terminal-error exits, and stage-level timeouts/error propagation.
Changes:
- Add terminal error handling for producer send retries (stop retrying on
ErrTopicTerminated/ErrProducerClosed). - Bound producer and consumer retry loops to a 30s window per message.
- Add an error channel and a
waitWithErrorhelper to fail stages early and avoid waiting forever onWaitGroups.
Comments suppressed due to low confidence (2)
pulsar/blue_green_migration_test.go:193
- The pre-unload synchronization looks off by one: with
istarting at 0,if i == messageCountBeforeUnload { wgSendAndReceiveMessages.Done() }signals after processingmessageCountBeforeUnload+1messages. If the intent is to wait until exactlymessageCountBeforeUnloadmessages have been sent, this should trigger ati == messageCountBeforeUnload-1(or increment a counter and compare to the target).
if i == messageCountBeforeUnload {
wgSendAndReceiveMessages.Done()
}
}
pulsar/blue_green_migration_test.go:228
- Same off-by-one issue on the consumer side:
iis 0-based, soif i == messageCountBeforeUnload { ... }fires aftermessageCountBeforeUnload+1messages have been received/acked. If you want the unload to start after exactlymessageCountBeforeUnloadmessages, adjust the condition accordingly (e.g.,messageCountBeforeUnload-1).
if i == messageCountBeforeUnload {
wgSendAndReceiveMessages.Done()
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bc05b7a to
6c34646
Compare
| // Signals both producer and consumer have processed `messageCountBeforeUnload` messages | ||
| wgSendAndReceiveMessages := sync.WaitGroup{} | ||
| wgSendAndReceiveMessages.Add(2) | ||
| errCh := make(chan error, 1) |
There was a problem hiding this comment.
Continue the previous discussion here.
Since this channel could only be sent once because every select { case errCh<- xxx, default } block is followed by a return.
select {
case <-doneCh:
return true
case err := <-errCh:
req.NoError(err, stage)
return falseHowever, it's possible that both doneCh and errCh have a message, in this case, any of them can be chosen randomly.
Therefore, I suggest just storing errors into a slice and check the errors in the case <-doneCh case.
There was a problem hiding this comment.
There are two goroutines, the 1st could send the error to errCh, the doneCh receives a message after two goroutines are done. i.e. if the 1st goroutine completes before the 2nd one, there could be a race.
But if doneCh receives a message, the error is guaranteed to be sent to errCh if there is an error. Therefore, changing the error into an atomic.Value should make sense
There was a problem hiding this comment.
Race issue has been fixed.
Motivation
The extensible-load-manager CI job timed out in TestBlueGreenMigrationTestSuite/TestTopicMigration/proxyConnection after 5 minutes.
From the stack and logs, the test was stuck waiting on WaitGroup while producer/consumer goroutines were still looping in retry paths. During migration, producer can enter terminal states (for example TopicTerminated or ProducerClosed), but the test retry loops had no terminal-exit logic, causing effectively unbounded retries and suite timeout.
Modifications
These changes make the test deterministic under migration failures and prevent hanging until global test timeout.