Skip to content

feat(cluster): HA cluster with Raft consensus and TCP replication (v2.5.0)#218

Merged
caikpigosso merged 6 commits intomainfrom
feat/ha-cluster-v2.5.0
Mar 21, 2026
Merged

feat(cluster): HA cluster with Raft consensus and TCP replication (v2.5.0)#218
caikpigosso merged 6 commits intomainfrom
feat/ha-cluster-v2.5.0

Conversation

@caikpigosso
Copy link
Copy Markdown
Collaborator

Summary

Production-grade High Availability cluster implementation for Vectorizer v2.5.0.

  • Raft Consensus (openraft 0.10) for metadata operations and leader election
  • TCP Replication for streaming vector data from leader to followers
  • Write-Redirect Middleware — followers return HTTP 307 redirecting writes to leader
  • Dashboard Cluster Page — real-time visualization of nodes, roles, and replication status
  • Docker Compose HA — ready-to-use 3-node cluster (docker-compose.ha.yml)
  • Kubernetes Support — Helm headless service, DNS discovery, shared JWT auth

Architecture

              ┌─────────────┐
              │ Load Balancer│
              └──────┬──────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
   ┌────▼────┐  ┌───▼───┐  ┌────▼────┐
   │ Leader  │  │Follower│  │Follower │
   │ (write) │  │ (read) │  │ (read)  │
   └────┬────┘  └───▲───┘  └────▲───┘
        └────────────┴───────────┘
            TCP Replication

What's new

Feature Description
Raft consensus Leader election, metadata consensus via openraft
TCP replication Full/partial sync with WAL-backed durability
Write concern Optional synchronous replication (WAIT command)
HTTP 307 redirect Followers redirect writes to leader
Epoch conflict resolution Higher epoch wins shard assignment conflicts
Shard data migration Actual data transfer during rebalance
Collection quorum Majority-based collection creation
DNS discovery K8s headless service auto-discovery
Dashboard Cluster status page with real-time refresh

Test plan

  • 1066 unit/integration tests passing
  • Docker HA cluster tested (3 nodes)
  • Failover simulation: leader dies → replicas survive → leader recovers
  • Write redirect confirmed: replicas return 307
  • Replication confirmed: data syncs across all nodes
  • cargo fmt, clippy, codespell clean
  • No Portuguese text in code or docs

🤖 Generated with Claude Code

caikpigosso and others added 6 commits March 20, 2026 23:48
…cation (v2.5.0)

Production-grade high availability for Kubernetes and Docker deployments.

Raft Consensus (openraft 0.10):
- RaftManager with StateMachine, LogStore, and gRPC-backed Network
- ClusterCommand for metadata consensus (collections, shards, membership)
- Leader election with configurable timeout (1-3s)
- Raft RPCs in proto/cluster.proto (Vote, AppendEntries, Snapshot)

TCP Replication:
- Master-replica streaming with full/partial sync and auto-reconnect
- DurableReplicationLog with WAL-backed persistence
- WriteConcern (WAIT command) for synchronous replication
- Replica ACK processing with offset tracking
- Replication config parsed from YAML with DNS hostname resolution

HA Lifecycle:
- HaManager handles Leader/Follower transitions dynamically
- LeaderRouter tracks current leader for request routing
- Write-redirect middleware: followers return HTTP 307 to leader
- Reads served locally on any node

Cluster Resilience:
- Epoch-based conflict resolution for shard assignments
- ShardMigrator for data transfer during rebalance
- CollectionSynchronizer with quorum-based creation and background repair
- DNS discovery for Kubernetes headless services

Dashboard:
- New ClusterPage with nodes table, leader info, replication status
- Auto-refresh every 5 seconds with sidebar navigation

Infrastructure:
- docker-compose.ha.yml for 3-node HA cluster
- Helm headless service template for K8s
- Test scripts for cluster simulation and failover

Documentation:
- Updated README with HA features
- Updated CLUSTER.md with HA configuration guide
- Kubernetes deployment instructions

Version: 2.4.2 -> 2.5.0
Tests: 1066 passed, 0 failed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- docker-compose.ha.yml: use ${VECTORIZER_ADMIN_PASSWORD} and ${VECTORIZER_JWT_SECRET} from .env
- CLUSTER.md: remove hardcoded passwords, show .env setup instructions
- CLUSTER.md: single connection URL instead of per-node ports
- .gitignore: exclude cluster config files and .env (may contain secrets)
@caikpigosso caikpigosso merged commit c8b065e into main Mar 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant