test(dynamodb): reproducer for Alternator seed-failover (SCT-370) by CodeLieutenant · Pull Request #36 · scylladb/YCSB

CodeLieutenant · 2026-05-20T14:48:29Z

Summary

Adds AlternatorSeedFailoverFunctionalTest, a self-contained in-process reproducer for the production failure observed in upgrade_test.test_generic_cluster_upgrade (SCT-370).

During a rolling upgrade the YCSB stress run exited with status 1 after the Alternator seed node went down — 62,624 UPDATE operations returned CLIENT_ERROR (connection refused) to 10.142.0.148:8080 even though only one cluster node was actually down. The rest of the cluster was healthy throughout, and only the YCSB client side was affected.

How the reproducer works

Boots a single HttpServer on 127.0.0.1 hosting two logical Alternator endpoints, distinguished by the request Host header (localhost = seed, 127.0.0.1 = peer). Both hostnames naturally resolve to the loopback address, so no DNS resolver tricks are needed — same approach as the existing AlternatorLoadBalancingFunctionalTest.
Warm-up drives traffic through both endpoints and asserts native load balancing is round-robining across them.
The seed is then "killed": its handler silently closes the exchange without writing a response, which the AWS SDK observes as a transport-level failure (the closest in-process equivalent of the production connection refused).
Drives 200 update operations and asserts they all succeed via the healthy peer.

Root cause (in the load-balancing library, not YCSB)

The /localnodes poller only contacts a single host per refresh; a refresh that picked the dead seed gave up immediately even though the peer could have answered.
mergeWithInitialNodes() re-injected the original seeds into the live-node list on every refresh, so the dead seed was permanently kept in nextAsURI()'s round-robin rotation.

See:

Library issue: scylladb/alternator-client-java#88
Library fix: scylladb/alternator-client-java#89

Status

State	Result
`com.scylladb.alternator:load-balancing:2.0.4` (current dependency)	200 updates → ~100 succeed, ~100 fail with `Connection reset` (reproduces SCT-370).
With the fix from scylladb/alternator-client-java#89 applied locally	200 / 200 succeed, 0 requests dispatched to the dead seed.

The dynamodb module sets <skipTests>true</skipTests> by default, so this test does not run in normal CI builds and the failing-against-2.0.4 state does not break the master pipeline. The intent is for the test to start passing automatically once the dependency is bumped to a release that contains the library fix.

Test plan

mvn -pl dynamodb test -Dtest=AlternatorSeedFailoverFunctionalTest -DskipTests=false against patched 2.0.5-SNAPSHOT → passes (200/200).
Same command against 2.0.4 → fails as expected, reproducing the production failure signature.
No changes to dynamodb/pom.xml — the test exists alongside the unfixed dependency without affecting CI.

🤖 Generated with Claude Code

Adds AlternatorSeedFailoverFunctionalTest, a self-contained reproducer for the production failure observed in upgrade_test.test_generic_cluster_upgrade (SCT-370) where the YCSB stress run exited with status 1 after the Alternator seed node went down during a rolling upgrade. 62,624 UPDATEs returned CLIENT_ERROR (connection refused) even though only one node was actually down. The test hosts two logical Alternator endpoints (distinguished by Host header) on a single in-process HttpServer. After warm-up confirms both endpoints have received traffic via native load balancing, the "seed" endpoint is killed: its handler silently closes the exchange without responding, which the AWS SDK observes as a transport-level failure (the closest in-process equivalent of the production "connection refused to 10.142.0.148:8080"). The test then drives 200 update operations and asserts they all succeed via the healthy peer. Root cause is in com.scylladb.alternator:load-balancing, not in YCSB: the load balancer kept round-robining onto the dead seed because the /localnodes poller never tried alternative hosts after the seed failed, and mergeWithInitialNodes() re-injected the seed into the live list on every refresh. See: - Reproducer for: https://scylladb.atlassian.net/browse/SCT-370 - Underlying library bug: scylladb/alternator-client-java#88 - Library fix: scylladb/alternator-client-java#89 With the unreleased fix applied locally the test passes 200/200; against the released 2.0.4 library it fails as expected (~50% of post-kill updates dispatched to the dead seed). The dynamodb module already has <skipTests>true</skipTests>, so CI is unaffected.

CodeLieutenant mentioned this pull request May 25, 2026

fix(alternator): switch default to DNS routing instead of native load balancing scylladb/scylla-cluster-tests#14758

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(dynamodb): reproducer for Alternator seed-failover (SCT-370)#36

test(dynamodb): reproducer for Alternator seed-failover (SCT-370)#36
CodeLieutenant wants to merge 1 commit into
masterfrom
test/alternator-seed-failover

CodeLieutenant commented May 20, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CodeLieutenant commented May 20, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the reproducer works

Root cause (in the load-balancing library, not YCSB)

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CodeLieutenant commented May 20, 2026 •

edited by atlassian Bot

Loading