This document describes the Raft consensus implementation in DQueue.
We implement the Raft consensus algorithm from scratch to provide:
- Leader election: Automatic election when leader fails
- Log replication: Consistent state across all nodes
- Safety: No data loss with majority of nodes alive
Our implementation follows the Raft paper with practical simplifications for a real-world task queue.
- Randomized election timeouts (150-300ms default)
- RequestVote RPC with log comparison
- Majority vote required to become leader
- Single vote per term (persisted before responding)
- AppendEntries RPC for replication and heartbeats
- Heartbeat interval: 50ms default
- Log matching property enforced
- Conflict detection and resolution
- Only commit entries from current term
- Leader Completeness: elected leaders have all committed entries
- State Machine Safety: same index → same command
Persisted before responding to any RPC:
currentTermvotedFor- Log entries
- Snapshots of state machine state
- InstallSnapshot RPC for slow followers
- Log entries before snapshot index deleted
The following Raft extensions are not implemented in this version:
Membership changes use the "single-server" approach: add/remove one node at a time with a configuration change entry. Joint consensus for atomic multi-node changes is not implemented.
Why: Adds significant complexity. Single-server changes are sufficient for most use cases and simpler to reason about.
The pre-vote extension prevents disruption from partitioned nodes. We don't implement it.
Why: The standard Raft algorithm handles partitions correctly. Pre-vote is an optimization that adds complexity without changing correctness.
All reads currently go to the leader. We don't implement linearizable reads via read-index or lease-based reads.
Why: For a task queue, reading task status doesn't require linearizability. If strict read consistency is needed, a quorum read can be implemented.
We don't batch multiple client requests into single log entries or pipeline AppendEntries.
Why: Simplicity. Can be added as an optimization if throughput becomes a bottleneck.
Non-voting members that receive log entries but don't participate in elections.
Why: Not needed for our cluster sizes. Useful for large clusters or geo-replication.
type RaftNode struct {
// Persistent state (saved to disk before responding to RPCs)
currentTerm uint64
votedFor string
log []LogEntry
// Volatile state
commitIndex uint64
lastApplied uint64
state State // FOLLOWER, CANDIDATE, LEADER
// Leader-only volatile state
nextIndex map[string]uint64 // Next log index to send to each peer
matchIndex map[string]uint64 // Highest known replicated index per peer
}// Invoked by candidates to gather votes
Request:
term uint64 // Candidate's term
candidateId string // Candidate requesting vote
lastLogIndex uint64 // Index of candidate's last log entry
lastLogTerm uint64 // Term of candidate's last log entry
Response:
term uint64 // currentTerm, for candidate to update itself
voteGranted bool // True if candidate received voteGrant vote if:
- Request term ≥ currentTerm
- Haven't voted this term, OR voted for this candidate
- Candidate's log is at least as up-to-date as ours
// Invoked by leader for log replication and heartbeats
Request:
term uint64 // Leader's term
leaderId string // So followers can redirect clients
prevLogIndex uint64 // Index of log entry before new ones
prevLogTerm uint64 // Term of prevLogIndex entry
entries []LogEntry // Log entries to store (empty for heartbeat)
leaderCommit uint64 // Leader's commitIndex
Response:
term uint64 // currentTerm, for leader to update itself
success bool // True if contained matching prevLogIndex/Term
conflictIndex uint64 // Optimization: hint for faster backtrackingLeader advances commitIndex when:
- Entry is replicated to majority of nodes
- Entry is from current term (Raft safety requirement)
// Background goroutine that applies committed entries
for {
if lastApplied < commitIndex {
entry := log[lastApplied + 1]
stateMachine.Apply(entry.command)
lastApplied++
}
}- Election with various failure scenarios
- Log replication and conflict resolution
- Snapshot installation
- Multi-node cluster operations
- Leader failure and re-election
- Network partition scenarios
The scripts/chaos-test.sh script:
- Continuously enqueues tasks
- Kills the leader every N seconds
- Verifies no data loss
Results from chaos testing:
- Leader failover: ~2-3 seconds
- Zero data loss for committed entries
- At-most transient failures during transitions
- In Search of an Understandable Consensus Algorithm - Original Raft paper
- Raft Visualization - Interactive visualization
- Students' Guide to Raft - Common pitfalls