Skip to content

Verify replica replication catch-up during rolling updates#25

Merged
kurtmc merged 1 commit into
mainfrom
feature/fix-rolling-update
Feb 22, 2026
Merged

Verify replica replication catch-up during rolling updates#25
kurtmc merged 1 commit into
mainfrom
feature/fix-rolling-update

Conversation

@kurtmc
Copy link
Copy Markdown
Member

@kurtmc kurtmc commented Feb 22, 2026

Add ReplicationInfo struct and ParseInfoReplication() function to parse
INFO REPLICATION output from Valkey nodes, extracting role,
master_link_status, master_sync_in_progress, master_repl_offset, and
slave_repl_offset fields. Handles the valkey-go client's "txt:" prefix
and \r\n line endings.

Enhance isValkeyClusterHealthy() to check that all replicas have caught
up with their masters before allowing the next pod deletion during
rolling updates. This prevents data loss when the next pod to be deleted
is the master of a shard whose replica hasn't finished syncing.

For each replica, checks:

  1. master_link_status is "up"
  2. master_sync_in_progress is 0
  3. slave_repl_offset is within 1024 bytes of master_repl_offset

If any replica fails these checks, the rolling update is deferred and
requeued after 30 seconds.

Includes 7 test cases for ParseInfoReplication covering: master node,
fully caught-up replica, replica with lag, full sync in progress, link
down without sync, txt: prefix handling, and carriage return handling.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Add ReplicationInfo struct and ParseInfoReplication() function to parse
INFO REPLICATION output from Valkey nodes, extracting role,
master_link_status, master_sync_in_progress, master_repl_offset, and
slave_repl_offset fields. Handles the valkey-go client's "txt:" prefix
and \r\n line endings.

Enhance isValkeyClusterHealthy() to check that all replicas have caught
up with their masters before allowing the next pod deletion during
rolling updates. This prevents data loss when the next pod to be deleted
is the master of a shard whose replica hasn't finished syncing.

For each replica, checks:
1. master_link_status is "up"
2. master_sync_in_progress is 0
3. slave_repl_offset is within 1024 bytes of master_repl_offset

If any replica fails these checks, the rolling update is deferred and
requeued after 30 seconds.

Includes 7 test cases for ParseInfoReplication covering: master node,
fully caught-up replica, replica with lag, full sync in progress, link
down without sync, txt: prefix handling, and carriage return handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kurtmc kurtmc merged commit 8a28bb7 into main Feb 22, 2026
4 of 5 checks passed
@kurtmc kurtmc deleted the feature/fix-rolling-update branch February 22, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant