Skip to content

fix(tests): address recurring flaky failures in subscription-replay, MQTT, and multi-threaded suites#654

Merged
kriszyp merged 2 commits into
mainfrom
fix/flaky-unit-tests
May 23, 2026
Merged

fix(tests): address recurring flaky failures in subscription-replay, MQTT, and multi-threaded suites#654
kriszyp merged 2 commits into
mainfrom
fix/flaky-unit-tests

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 20, 2026

Summary

  • Subscription replay (race conditions)collect()'s quiet-period timer was used to gather events while writes were still in flight. If a write committed after the silence window the event was silently dropped. Replaced with: attach listener → await all writes → fixed drain.
  • MQTT mTLS — authorised client connections were missing rejectUnauthorized: false; Node's TLS was rejecting the server's self-signed cert even when the CA was provided, causing intermittent handshake failures. The test's goal is server-side mTLS (server rejects clients without a cert), not client-side server-cert validation.
  • MQTT QoS=1 reconnect — polling loop had no internal timeout and relied on the 10 s suite limit which isn't sufficient on loaded runners. Added per-test this.timeout(20000) and an internal 15 s reject path.
  • Multi-threaded cache — assertions targeted specific IDs ('24', '25') whose write counts are sensitive to the seeded PRNG's non-uniform distribution. Switched to aggregate across all IDs 20–29.

Context

These tests have caused ~5 of the last 16 main-branch CI runs to fail (31% failure rate). All failures were confirmed intermittent (different test/different commit each time), and the root causes above explain why they are load/timing/PRNG sensitive.

Changes are test-only, zero production-code risk.

Attention

  • The delay(200/300) drain periods in the subscription-replay fixes are empirical. If they still flake on very slow runners they could be bumped further, but the cross-model review ran the suite and confirmed they pass reliably.
  • The mTLS rejectUnauthorized: false change means the client no longer validates the server cert in these tests. This is appropriate for an internal test scenario but worth noting if these tests are ever repurposed.

🤖 Generated with Claude Code — reviewed by agy (Gemini)

…lay, MQTT, and multi-threaded suites

- subscriptionReplay race-condition tests: the collect() quiet-period timer could expire
  before in-flight writes committed, silently dropping events. Switch to attach-listener-
  first then await-writes then fixed-drain pattern so all committed events are captured.

- MQTT mTLS tests: authorised client connect was missing rejectUnauthorized:false, causing
  intermittent "self-signed certificate in certificate chain" failures on loaded runners.
  The tests exercise server-side mTLS (server rejects clients without a cert), not
  client-side server-cert validation. Add explicit per-test timeout (20 s) and internal
  15 s reject path to the QoS=1 durable-session reconnect test.

- multi-threaded cache test: assertions targeted specific IDs whose write counts were
  sensitive to the seeded PRNG's non-uniform distribution. Aggregate across all IDs 20-29
  so no single ID's count can cause a false failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp requested a review from cb1kenobi May 20, 2026 23:32
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Reviewed; no blockers found.

@kriszyp kriszyp marked this pull request as ready for review May 20, 2026 23:46
@kriszyp kriszyp requested a review from a team as a code owner May 20, 2026 23:46
@kriszyp kriszyp requested review from Ethan-Arrowood and removed request for a team May 22, 2026 20:05
Replace shared-resolver pattern with ws.once() so each message handler
is self-contained; also close the msgpack WebSocket after its test to
avoid leaking connections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp merged commit 2d597bb into main May 23, 2026
36 of 37 checks passed
@kriszyp kriszyp deleted the fix/flaky-unit-tests branch May 23, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants