Skip to content

[fix][test] Make LocalBookkeeperEnsemble robust to port race and partial-setup cleanup#25690

Open
merlimat wants to merge 3 commits intoapache:masterfrom
merlimat:fix/flaky-shadow-topic-real-bk-test
Open

[fix][test] Make LocalBookkeeperEnsemble robust to port race and partial-setup cleanup#25690
merlimat wants to merge 3 commits intoapache:masterfrom
merlimat:fix/flaky-shadow-topic-real-bk-test

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

@merlimat merlimat commented May 5, 2026

Motivation

ShadowTopicRealBkTest.setup flakes intermittently with BindException: Address already in use:

java.io.IOException: java.net.BindException: Address already in use
    at org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.runZookeeper(LocalBookkeeperEnsemble.java:220)
    at org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.start(LocalBookkeeperEnsemble.java:424)
    at org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.start(LocalBookkeeperEnsemble.java:434)
    at org.apache.pulsar.broker.service.persistent.ShadowTopicRealBkTest.setup(ShadowTopicRealBkTest.java:52)

PortManager.nextLockedFreePort() only locks the port inside the JVM's static set — another process can grab the OS-level port between allocation and the ZK bind.

When that happens, cleanup() also fails with NPE because LocalBookkeeperEnsemble.stop() dereferences bookieComponents (and other fields) that the aborted setup left null:

java.lang.NullPointerException: Cannot read the array length because "<local1>" is null
    at org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble.stop(LocalBookkeeperEnsemble.java:513)

Failing CI: https://github.com/apache/pulsar/actions/runs/25387331732/job/74455050956

Modifications

  • ShadowTopicRealBkTest: pass 0 for the ZK port so the kernel picks a free port at bind time. This eliminates the JVM-vs-OS race entirely. The actual bound port is read back via bk.getZookeeperPort() after start.
  • LocalBookkeeperEnsemble.stop: null-guard bookieComponents, zkc, zks, serverFactory, and streamStorage, and wrap each shutdown in try/catch so cleanup never NPEs after a partial setup (the port-bind failure was one trigger, but any setup failure could leave fields uninitialized).
  • ShadowTopicRealBkTest: tolerate a partially-initialized PulsarService in cleanup.

Verifying this change

  • ShadowTopicRealBkTest passes locally with the change.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: no
  • The schema: no
  • The default values of configurations: no
  • The threading model: no
  • The binary protocol: no
  • The REST endpoints: no
  • The admin CLI options: no
  • The metrics: no
  • Anything that affects deployment: no

…ial-setup cleanup

`ShadowTopicRealBkTest.setup` occasionally fails with `BindException: Address
already in use`: `PortManager.nextLockedFreePort()` only locks the port at the
JVM level, so another process can grab the port between allocation and the ZK
bind. When that happens, the subsequent `cleanup()` call also fails with NPE
because `LocalBookkeeperEnsemble.stop()` dereferences `bookieComponents` (and
other fields) that are still null from the aborted setup.

* `LocalBookkeeperEnsemble.runZookeeper` now retries on `BindException` up to 5
  attempts and falls back to a kernel-assigned port (port 0) on the last try.
  Callers must read the actual bound port via `getZookeeperPort()` after start.
* `LocalBookkeeperEnsemble.stop` null-guards every field it touches so cleanup
  never NPEs when setup failed mid-way.
* `ShadowTopicRealBkTest` now reads the bound ZK port back from the ensemble
  after start, so the metadata store URL stays correct even when the retry
  fallback kicks in. The cleanup path also tolerates a partially-initialized
  PulsarService.
merlimat added 2 commits May 5, 2026 16:43
Drop the BindException retry loop in LocalBookkeeperEnsemble.runZookeeper
and instead pass 0 for the ZK port from the test. The kernel guarantees a
free port at bind time, so there is no race window. The actual bound port
is read back via bk.getZookeeperPort().

The cleanup null-safety fix in LocalBookkeeperEnsemble.stop() is kept
because any partial-init failure (not just port bind) would otherwise
NPE the cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants