Skip to content

Latest commit

 

History

History
195 lines (134 loc) · 5.66 KB

File metadata and controls

195 lines (134 loc) · 5.66 KB

Raft Implementation

This document describes the Raft consensus implementation in DQueue.

Overview

We implement the Raft consensus algorithm from scratch to provide:

  • Leader election: Automatic election when leader fails
  • Log replication: Consistent state across all nodes
  • Safety: No data loss with majority of nodes alive

Our implementation follows the Raft paper with practical simplifications for a real-world task queue.

What We Implement

Leader Election (§5.2)

  • Randomized election timeouts (150-300ms default)
  • RequestVote RPC with log comparison
  • Majority vote required to become leader
  • Single vote per term (persisted before responding)

Log Replication (§5.3)

  • AppendEntries RPC for replication and heartbeats
  • Heartbeat interval: 50ms default
  • Log matching property enforced
  • Conflict detection and resolution

Safety (§5.4)

  • Only commit entries from current term
  • Leader Completeness: elected leaders have all committed entries
  • State Machine Safety: same index → same command

Persistence (§5.4)

Persisted before responding to any RPC:

  • currentTerm
  • votedFor
  • Log entries

Log Compaction (§7)

  • Snapshots of state machine state
  • InstallSnapshot RPC for slow followers
  • Log entries before snapshot index deleted

Non-Goals (Deferred)

The following Raft extensions are not implemented in this version:

Joint Consensus (§6)

Membership changes use the "single-server" approach: add/remove one node at a time with a configuration change entry. Joint consensus for atomic multi-node changes is not implemented.

Why: Adds significant complexity. Single-server changes are sufficient for most use cases and simpler to reason about.

Pre-Vote Extension

The pre-vote extension prevents disruption from partitioned nodes. We don't implement it.

Why: The standard Raft algorithm handles partitions correctly. Pre-vote is an optimization that adds complexity without changing correctness.

Read-Index / Linearizable Reads

All reads currently go to the leader. We don't implement linearizable reads via read-index or lease-based reads.

Why: For a task queue, reading task status doesn't require linearizability. If strict read consistency is needed, a quorum read can be implemented.

Batching and Pipelining

We don't batch multiple client requests into single log entries or pipeline AppendEntries.

Why: Simplicity. Can be added as an optimization if throughput becomes a bottleneck.

Learners

Non-voting members that receive log entries but don't participate in elections.

Why: Not needed for our cluster sizes. Useful for large clusters or geo-replication.

Implementation Details

State Machine

type RaftNode struct {
    // Persistent state (saved to disk before responding to RPCs)
    currentTerm uint64
    votedFor    string
    log         []LogEntry

    // Volatile state
    commitIndex uint64
    lastApplied uint64
    state       State  // FOLLOWER, CANDIDATE, LEADER

    // Leader-only volatile state
    nextIndex   map[string]uint64  // Next log index to send to each peer
    matchIndex  map[string]uint64  // Highest known replicated index per peer
}

RequestVote RPC

// Invoked by candidates to gather votes

Request:
  term         uint64  // Candidate's term
  candidateId  string  // Candidate requesting vote
  lastLogIndex uint64  // Index of candidate's last log entry
  lastLogTerm  uint64  // Term of candidate's last log entry

Response:
  term        uint64  // currentTerm, for candidate to update itself
  voteGranted bool    // True if candidate received vote

Grant vote if:

  1. Request term ≥ currentTerm
  2. Haven't voted this term, OR voted for this candidate
  3. Candidate's log is at least as up-to-date as ours

AppendEntries RPC

// Invoked by leader for log replication and heartbeats

Request:
  term         uint64      // Leader's term
  leaderId     string      // So followers can redirect clients
  prevLogIndex uint64      // Index of log entry before new ones
  prevLogTerm  uint64      // Term of prevLogIndex entry
  entries      []LogEntry  // Log entries to store (empty for heartbeat)
  leaderCommit uint64      // Leader's commitIndex

Response:
  term          uint64  // currentTerm, for leader to update itself
  success       bool    // True if contained matching prevLogIndex/Term
  conflictIndex uint64  // Optimization: hint for faster backtracking

Commit Index Advancement

Leader advances commitIndex when:

  1. Entry is replicated to majority of nodes
  2. Entry is from current term (Raft safety requirement)

Apply Loop

// Background goroutine that applies committed entries

for {
    if lastApplied < commitIndex {
        entry := log[lastApplied + 1]
        stateMachine.Apply(entry.command)
        lastApplied++
    }
}

Testing

Unit Tests

  • Election with various failure scenarios
  • Log replication and conflict resolution
  • Snapshot installation

Integration Tests

  • Multi-node cluster operations
  • Leader failure and re-election
  • Network partition scenarios

Chaos Testing

The scripts/chaos-test.sh script:

  1. Continuously enqueues tasks
  2. Kills the leader every N seconds
  3. Verifies no data loss

Results from chaos testing:

  • Leader failover: ~2-3 seconds
  • Zero data loss for committed entries
  • At-most transient failures during transitions

References