Skip to content

fix: avoid infinite retry in blue-green migration test#1502

Open
nodece wants to merge 4 commits into
apache:masterfrom
nodece:fix-blue-green-ci-timeout
Open

fix: avoid infinite retry in blue-green migration test#1502
nodece wants to merge 4 commits into
apache:masterfrom
nodece:fix-blue-green-ci-timeout

Conversation

@nodece
Copy link
Copy Markdown
Member

@nodece nodece commented May 20, 2026

Motivation

The extensible-load-manager CI job timed out in TestBlueGreenMigrationTestSuite/TestTopicMigration/proxyConnection after 5 minutes.

From the stack and logs, the test was stuck waiting on WaitGroup while producer/consumer goroutines were still looping in retry paths. During migration, producer can enter terminal states (for example TopicTerminated or ProducerClosed), but the test retry loops had no terminal-exit logic, causing effectively unbounded retries and suite timeout.

Modifications

  • Add terminal error handling in producer send retry loop:
    • if error is ErrTopicTerminated or ErrProducerClosed, fail fast instead of retrying forever.
  • Add bounded retry windows for both producer and consumer loops (30 seconds per message stage).
  • Add an error channel and stage-level wait helper around WaitGroup waits to fail early on goroutine errors.
  • Add stage timeout protection while waiting for:
    • pre-unload send/receive synchronization
    • producer/consumer goroutine completion
  • Keep per-iteration context cancellation immediate (cancel after each send/receive attempt).

These changes make the test deterministic under migration failures and prevent hanging until global test timeout.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the blue/green migration integration test to avoid hanging indefinitely during broker migration/unload scenarios by adding bounded retry behavior, terminal-error exits, and stage-level timeouts/error propagation.

Changes:

  • Add terminal error handling for producer send retries (stop retrying on ErrTopicTerminated / ErrProducerClosed).
  • Bound producer and consumer retry loops to a 30s window per message.
  • Add an error channel and a waitWithError helper to fail stages early and avoid waiting forever on WaitGroups.
Comments suppressed due to low confidence (2)

pulsar/blue_green_migration_test.go:193

  • The pre-unload synchronization looks off by one: with i starting at 0, if i == messageCountBeforeUnload { wgSendAndReceiveMessages.Done() } signals after processing messageCountBeforeUnload+1 messages. If the intent is to wait until exactly messageCountBeforeUnload messages have been sent, this should trigger at i == messageCountBeforeUnload-1 (or increment a counter and compare to the target).

			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}
		}

pulsar/blue_green_migration_test.go:228

  • Same off-by-one issue on the consumer side: i is 0-based, so if i == messageCountBeforeUnload { ... } fires after messageCountBeforeUnload+1 messages have been received/acked. If you want the unload to start after exactly messageCountBeforeUnload messages, adjust the condition accordingly (e.g., messageCountBeforeUnload-1).

			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@BewareMyPower BewareMyPower added this to the 0.20.0 milestone Jun 1, 2026
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
nodece and others added 3 commits June 1, 2026 15:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nodece nodece force-pushed the fix-blue-green-ci-timeout branch from bc05b7a to 6c34646 Compare June 1, 2026 08:01
// Signals both producer and consumer have processed `messageCountBeforeUnload` messages
wgSendAndReceiveMessages := sync.WaitGroup{}
wgSendAndReceiveMessages.Add(2)
errCh := make(chan error, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continue the previous discussion here.

Since this channel could only be sent once because every select { case errCh<- xxx, default } block is followed by a return.

		select {
		case <-doneCh:
			return true
		case err := <-errCh:
			req.NoError(err, stage)
			return false

However, it's possible that both doneCh and errCh have a message, in this case, any of them can be chosen randomly.

Therefore, I suggest just storing errors into a slice and check the errors in the case <-doneCh case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two goroutines, the 1st could send the error to errCh, the doneCh receives a message after two goroutines are done. i.e. if the 1st goroutine completes before the 2nd one, there could be a race.

But if doneCh receives a message, the error is guaranteed to be sent to errCh if there is an error. Therefore, changing the error into an atomic.Value should make sense

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race issue has been fixed.

Comment thread pulsar/blue_green_migration_test.go Outdated
@RobertIndie RobertIndie modified the milestones: 0.20.0, 0.21.0 Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants