Skip to content

test(dynamodb): reproducer for Alternator seed-failover (SCT-370)#36

Open
CodeLieutenant wants to merge 1 commit into
masterfrom
test/alternator-seed-failover
Open

test(dynamodb): reproducer for Alternator seed-failover (SCT-370)#36
CodeLieutenant wants to merge 1 commit into
masterfrom
test/alternator-seed-failover

Conversation

@CodeLieutenant

@CodeLieutenant CodeLieutenant commented May 20, 2026

Copy link
Copy Markdown

Summary

Adds AlternatorSeedFailoverFunctionalTest, a self-contained in-process reproducer for the production failure observed in upgrade_test.test_generic_cluster_upgrade (SCT-370).

During a rolling upgrade the YCSB stress run exited with status 1 after the Alternator seed node went down — 62,624 UPDATE operations returned CLIENT_ERROR (connection refused) to 10.142.0.148:8080 even though only one cluster node was actually down. The rest of the cluster was healthy throughout, and only the YCSB client side was affected.

How the reproducer works

  • Boots a single HttpServer on 127.0.0.1 hosting two logical Alternator endpoints, distinguished by the request Host header (localhost = seed, 127.0.0.1 = peer). Both hostnames naturally resolve to the loopback address, so no DNS resolver tricks are needed — same approach as the existing AlternatorLoadBalancingFunctionalTest.
  • Warm-up drives traffic through both endpoints and asserts native load balancing is round-robining across them.
  • The seed is then "killed": its handler silently closes the exchange without writing a response, which the AWS SDK observes as a transport-level failure (the closest in-process equivalent of the production connection refused).
  • Drives 200 update operations and asserts they all succeed via the healthy peer.

Root cause (in the load-balancing library, not YCSB)

  • The /localnodes poller only contacts a single host per refresh; a refresh that picked the dead seed gave up immediately even though the peer could have answered.
  • mergeWithInitialNodes() re-injected the original seeds into the live-node list on every refresh, so the dead seed was permanently kept in nextAsURI()'s round-robin rotation.

See:

Status

State Result
com.scylladb.alternator:load-balancing:2.0.4 (current dependency) 200 updates → ~100 succeed, ~100 fail with Connection reset (reproduces SCT-370).
With the fix from scylladb/alternator-client-java#89 applied locally 200 / 200 succeed, 0 requests dispatched to the dead seed.

The dynamodb module sets <skipTests>true</skipTests> by default, so this test does not run in normal CI builds and the failing-against-2.0.4 state does not break the master pipeline. The intent is for the test to start passing automatically once the dependency is bumped to a release that contains the library fix.

Test plan

  • mvn -pl dynamodb test -Dtest=AlternatorSeedFailoverFunctionalTest -DskipTests=false against patched 2.0.5-SNAPSHOT → passes (200/200).
  • Same command against 2.0.4 → fails as expected, reproducing the production failure signature.
  • No changes to dynamodb/pom.xml — the test exists alongside the unfixed dependency without affecting CI.

🤖 Generated with Claude Code

Adds AlternatorSeedFailoverFunctionalTest, a self-contained reproducer
for the production failure observed in upgrade_test.test_generic_cluster_upgrade
(SCT-370) where the YCSB stress run exited with status 1 after the
Alternator seed node went down during a rolling upgrade. 62,624 UPDATEs
returned CLIENT_ERROR (connection refused) even though only one node
was actually down.

The test hosts two logical Alternator endpoints (distinguished by Host
header) on a single in-process HttpServer. After warm-up confirms both
endpoints have received traffic via native load balancing, the "seed"
endpoint is killed: its handler silently closes the exchange without
responding, which the AWS SDK observes as a transport-level failure
(the closest in-process equivalent of the production
"connection refused to 10.142.0.148:8080"). The test then drives 200
update operations and asserts they all succeed via the healthy peer.

Root cause is in com.scylladb.alternator:load-balancing, not in YCSB:
the load balancer kept round-robining onto the dead seed because the
/localnodes poller never tried alternative hosts after the seed failed,
and mergeWithInitialNodes() re-injected the seed into the live list on
every refresh. See:

- Reproducer for: https://scylladb.atlassian.net/browse/SCT-370
- Underlying library bug: scylladb/alternator-client-java#88
- Library fix: scylladb/alternator-client-java#89

With the unreleased fix applied locally the test passes 200/200; against
the released 2.0.4 library it fails as expected (~50% of post-kill
updates dispatched to the dead seed). The dynamodb module already has
<skipTests>true</skipTests>, so CI is unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant