Skip to content

release: v2.5.1 — HA cluster, failover watcher, dependency upgrades#229

Merged
caikpigosso merged 2 commits intomainfrom
feat/ha-cluster-v2.5.0
Mar 24, 2026
Merged

release: v2.5.1 — HA cluster, failover watcher, dependency upgrades#229
caikpigosso merged 2 commits intomainfrom
feat/ha-cluster-v2.5.0

Conversation

@caikpigosso
Copy link
Copy Markdown
Collaborator

Summary

  • HA Cluster with Raft consensus (v2.5.0): Hybrid architecture combining Raft for metadata consensus and TCP streaming for vector data replication
  • Automatic failover watcher (v2.5.1): Bridges openraft elections to HaManager role transitions — automatic master/replica failover in Kubernetes
  • Dependency upgrades: bincode 1.3→2.0.1, tonic 0.12→0.14.5, prost 0.13→0.14.3, and 8 more Dependabot PRs resolved
  • Bug fixes: Snapshot spam on empty replicas, exponential backoff on reconnect, TCP connection timeout
  • CI: Debian 13 base image, GHCR publishing, docker/metadata-action v6

Test plan

  • cargo test — 1,890 tests passed, 0 failures
  • cargo clippy — 0 warnings
  • cargo fmt --check — formatted
  • cargo build --release — builds successfully
  • Deploy to K8s staging cluster and verify failover behavior
  • Verify container image builds from CI pipeline

🤖 Generated with Claude Code

caikpigosso and others added 2 commits March 22, 2026 20:57
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Features:
- Add Raft leadership watcher (src/cluster/raft_watcher.rs) that bridges
  openraft consensus elections to HaManager role transitions, enabling
  automatic master/replica failover in Kubernetes clusters
- Watcher uses zero-cost tokio::sync::watch subscription (no polling)
- Leader address resolved from Raft state machine with K8s env var support
  (HOSTNAME, VECTORIZER_SERVICE_NAME, POD_IP)

Bug fixes:
- Fix snapshot spam on empty replicas: early-exit when .vecdb doesn't exist,
  downgrade "no data" errors to debug level in auto_save
- Add exponential backoff (5s→60s cap) to replica reconnect, reducing log
  noise from ~12/min to ~1/min when master is unreachable
- Add 5s TCP connection timeout to replica, replacing OS-level timeout
  (30-120s) for faster failure detection in Kubernetes

Dependency upgrades (all Dependabot PRs resolved):
- bincode 1.3→2.0.1 with codec wrapper module for wire-format compatibility
- tonic 0.12→0.14.5, prost 0.13→0.14.3 (coordinated upgrade, eliminates
  duplicate crate versions), tonic-prost/tonic-prost-build added
- uuid 1.19→1.22, nix 0.30→0.31, hyper-util 0.1.19→0.1.20
- tar, openssl, futures, tempfile, rustls (patch bumps via cargo update)
- docker/metadata-action v5→v6 in CI workflow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@caikpigosso caikpigosso merged commit 08cdc44 into main Mar 24, 2026
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant