Raft Implementation

This document describes the Raft consensus implementation in DQueue.

Overview

We implement the Raft consensus algorithm from scratch to provide:

Leader election: Automatic election when leader fails
Log replication: Consistent state across all nodes
Safety: No data loss with majority of nodes alive

Our implementation follows the Raft paper with practical simplifications for a real-world task queue.

What We Implement

Leader Election (§5.2)

Randomized election timeouts (150-300ms default)
RequestVote RPC with log comparison
Majority vote required to become leader
Single vote per term (persisted before responding)

Log Replication (§5.3)

AppendEntries RPC for replication and heartbeats
Heartbeat interval: 50ms default
Log matching property enforced
Conflict detection and resolution

Safety (§5.4)

Only commit entries from current term
Leader Completeness: elected leaders have all committed entries
State Machine Safety: same index → same command

Persistence (§5.4)

Persisted before responding to any RPC:

currentTerm
votedFor
Log entries

Log Compaction (§7)

Snapshots of state machine state
InstallSnapshot RPC for slow followers
Log entries before snapshot index deleted

Non-Goals (Deferred)

The following Raft extensions are not implemented in this version:

Joint Consensus (§6)

Membership changes use the "single-server" approach: add/remove one node at a time with a configuration change entry. Joint consensus for atomic multi-node changes is not implemented.

Why: Adds significant complexity. Single-server changes are sufficient for most use cases and simpler to reason about.

Pre-Vote Extension

The pre-vote extension prevents disruption from partitioned nodes. We don't implement it.

Why: The standard Raft algorithm handles partitions correctly. Pre-vote is an optimization that adds complexity without changing correctness.

Read-Index / Linearizable Reads

All reads currently go to the leader. We don't implement linearizable reads via read-index or lease-based reads.

Why: For a task queue, reading task status doesn't require linearizability. If strict read consistency is needed, a quorum read can be implemented.

Batching and Pipelining

We don't batch multiple client requests into single log entries or pipeline AppendEntries.

Why: Simplicity. Can be added as an optimization if throughput becomes a bottleneck.

Learners

Non-voting members that receive log entries but don't participate in elections.

Why: Not needed for our cluster sizes. Useful for large clusters or geo-replication.

Implementation Details

State Machine

type RaftNode struct {
    // Persistent state (saved to disk before responding to RPCs)
    currentTerm uint64
    votedFor    string
    log         []LogEntry

    // Volatile state
    commitIndex uint64
    lastApplied uint64
    state       State  // FOLLOWER, CANDIDATE, LEADER

    // Leader-only volatile state
    nextIndex   map[string]uint64  // Next log index to send to each peer
    matchIndex  map[string]uint64  // Highest known replicated index per peer
}

RequestVote RPC

// Invoked by candidates to gather votes

Request:
  term         uint64  // Candidate's term
  candidateId  string  // Candidate requesting vote
  lastLogIndex uint64  // Index of candidate's last log entry
  lastLogTerm  uint64  // Term of candidate's last log entry

Response:
  term        uint64  // currentTerm, for candidate to update itself
  voteGranted bool    // True if candidate received vote

Grant vote if:

Request term ≥ currentTerm
Haven't voted this term, OR voted for this candidate
Candidate's log is at least as up-to-date as ours

AppendEntries RPC

// Invoked by leader for log replication and heartbeats

Request:
  term         uint64      // Leader's term
  leaderId     string      // So followers can redirect clients
  prevLogIndex uint64      // Index of log entry before new ones
  prevLogTerm  uint64      // Term of prevLogIndex entry
  entries      []LogEntry  // Log entries to store (empty for heartbeat)
  leaderCommit uint64      // Leader's commitIndex

Response:
  term          uint64  // currentTerm, for leader to update itself
  success       bool    // True if contained matching prevLogIndex/Term
  conflictIndex uint64  // Optimization: hint for faster backtracking

Commit Index Advancement

Leader advances commitIndex when:

Entry is replicated to majority of nodes
Entry is from current term (Raft safety requirement)

Apply Loop

// Background goroutine that applies committed entries

for {
    if lastApplied < commitIndex {
        entry := log[lastApplied + 1]
        stateMachine.Apply(entry.command)
        lastApplied++
    }
}

Testing

Unit Tests

Election with various failure scenarios
Log replication and conflict resolution
Snapshot installation

Integration Tests

Multi-node cluster operations
Leader failure and re-election
Network partition scenarios

Chaos Testing

The scripts/chaos-test.sh script:

Continuously enqueues tasks
Kills the leader every N seconds
Verifies no data loss

Results from chaos testing:

Leader failover: ~2-3 seconds
Zero data loss for committed entries
At-most transient failures during transitions

References

In Search of an Understandable Consensus Algorithm - Original Raft paper
Raft Visualization - Interactive visualization
Students' Guide to Raft - Common pitfalls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft Implementation

Overview

What We Implement

Leader Election (§5.2)

Log Replication (§5.3)

Safety (§5.4)

Persistence (§5.4)

Log Compaction (§7)

Non-Goals (Deferred)

Joint Consensus (§6)

Pre-Vote Extension

Read-Index / Linearizable Reads

Batching and Pipelining

Learners

Implementation Details

State Machine

RequestVote RPC

AppendEntries RPC

Commit Index Advancement

Apply Loop

Testing

Unit Tests

Integration Tests

Chaos Testing

References

FilesExpand file tree

raft.md

Latest commit

History

raft.md

File metadata and controls

Raft Implementation

Overview

What We Implement

Leader Election (§5.2)

Log Replication (§5.3)

Safety (§5.4)

Persistence (§5.4)

Log Compaction (§7)

Non-Goals (Deferred)

Joint Consensus (§6)

Pre-Vote Extension

Read-Index / Linearizable Reads

Batching and Pipelining

Learners

Implementation Details

State Machine

RequestVote RPC

AppendEntries RPC

Commit Index Advancement

Apply Loop

Testing

Unit Tests

Integration Tests

Chaos Testing

References