fix: avoid infinite retry in blue-green migration test by nodece · Pull Request #1502 · apache/pulsar-client-go

nodece · 2026-05-20T07:50:16Z

Motivation

The extensible-load-manager CI job timed out in TestBlueGreenMigrationTestSuite/TestTopicMigration/proxyConnection after 5 minutes.

From the stack and logs, the test was stuck waiting on WaitGroup while producer/consumer goroutines were still looping in retry paths. During migration, producer can enter terminal states (for example TopicTerminated or ProducerClosed), but the test retry loops had no terminal-exit logic, causing effectively unbounded retries and suite timeout.

Modifications

Add terminal error handling in producer send retry loop:
- if error is ErrTopicTerminated or ErrProducerClosed, fail fast instead of retrying forever.
Add bounded retry windows for both producer and consumer loops (30 seconds per message stage).
Add an error channel and stage-level wait helper around WaitGroup waits to fail early on goroutine errors.
Add stage timeout protection while waiting for:
- pre-unload send/receive synchronization
- producer/consumer goroutine completion
Keep per-iteration context cancellation immediate (cancel after each send/receive attempt).

These changes make the test deterministic under migration failures and prevent hanging until global test timeout.

Copilot

Pull request overview

This PR updates the blue/green migration integration test to avoid hanging indefinitely during broker migration/unload scenarios by adding bounded retry behavior, terminal-error exits, and stage-level timeouts/error propagation.

Changes:

Add terminal error handling for producer send retries (stop retrying on ErrTopicTerminated / ErrProducerClosed).
Bound producer and consumer retry loops to a 30s window per message.
Add an error channel and a waitWithError helper to fail stages early and avoid waiting forever on WaitGroups.

Comments suppressed due to low confidence (2)

pulsar/blue_green_migration_test.go:193

The pre-unload synchronization looks off by one: with i starting at 0, if i == messageCountBeforeUnload { wgSendAndReceiveMessages.Done() } signals after processing messageCountBeforeUnload+1 messages. If the intent is to wait until exactly messageCountBeforeUnload messages have been sent, this should trigger at i == messageCountBeforeUnload-1 (or increment a counter and compare to the target).


			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}
		}

pulsar/blue_green_migration_test.go:228

Same off-by-one issue on the consumer side: i is 0-based, so if i == messageCountBeforeUnload { ... } fires after messageCountBeforeUnload+1 messages have been received/acked. If you want the unload to start after exactly messageCountBeforeUnload messages, adjust the condition accordingly (e.g., messageCountBeforeUnload-1).


			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

BewareMyPower · 2026-06-01T09:37:03Z

 	// Signals both producer and consumer have processed `messageCountBeforeUnload` messages
 	wgSendAndReceiveMessages := sync.WaitGroup{}
 	wgSendAndReceiveMessages.Add(2)
+	errCh := make(chan error, 1)


Continue the previous discussion here.

Since this channel could only be sent once because every select { case errCh<- xxx, default } block is followed by a return.

select { case <-doneCh: return true case err := <-errCh: req.NoError(err, stage) return false

However, it's possible that both doneCh and errCh have a message, in this case, any of them can be chosen randomly.

Therefore, I suggest just storing errors into a slice and check the errors in the case <-doneCh case.

There are two goroutines, the 1st could send the error to errCh, the doneCh receives a message after two goroutines are done. i.e. if the 1st goroutine completes before the 2nd one, there could be a race.

But if doneCh receives a message, the error is guaranteed to be sent to errCh if there is an error. Therefore, changing the error into an atomic.Value should make sense

Race issue has been fixed.

nodece force-pushed the fix-blue-green-ci-timeout branch from 3c7c9e2 to bc05b7a Compare May 20, 2026 07:54

RobertIndie requested a review from Copilot May 27, 2026 09:13

Copilot started reviewing on behalf of RobertIndie May 27, 2026 09:13 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

BewareMyPower assigned nodece Jun 1, 2026

BewareMyPower added this to the 0.20.0 milestone Jun 1, 2026

BewareMyPower reviewed Jun 1, 2026

View reviewed changes

Comment thread pulsar/blue_green_migration_test.go

Comment thread pulsar/blue_green_migration_test.go

Comment thread pulsar/blue_green_migration_test.go

Comment thread pulsar/blue_green_migration_test.go

nodece and others added 3 commits June 1, 2026 15:57

fix: avoid context leak in blue-green migration test

c23efc4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: avoid infinite retry in blue-green migration test

e6a540d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address comment

6c34646

nodece force-pushed the fix-blue-green-ci-timeout branch from bc05b7a to 6c34646 Compare June 1, 2026 08:01

BewareMyPower requested changes Jun 1, 2026

View reviewed changes

BewareMyPower reviewed Jun 1, 2026

View reviewed changes

Comment thread pulsar/blue_green_migration_test.go Outdated

RobertIndie modified the milestones: 0.20.0, 0.21.0 Jun 1, 2026

Address comment

750f4de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid infinite retry in blue-green migration test#1502

fix: avoid infinite retry in blue-green migration test#1502
nodece wants to merge 4 commits into
apache:masterfrom
nodece:fix-blue-green-ci-timeout

nodece commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BewareMyPower Jun 1, 2026

Uh oh!

BewareMyPower Jun 1, 2026

Uh oh!

nodece Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nodece commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BewareMyPower Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

BewareMyPower Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

nodece Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nodece commented May 20, 2026 •

edited

Loading