Distributed Systems

The Raft Consensus Algorithm

A comprehensive Python implementation of the Raft distributed consensus protocol โ€” designed for understandability without sacrificing correctness.

What is Raft?

Raft is a consensus algorithm designed to be easy to understand while providing strong consistency guarantees for distributed systems.

3-7
Typical Nodes
F+1
Fault Tolerance
O(n)
Message Complexity
100%
Strong Consistency

Leader Election

When a cluster starts or the leader fails, nodes hold an election. Candidates request votes from peers, and a node becomes leader upon receiving a majority. Elections use randomized timeouts to prevent split votes.

Log Replication

The leader accepts client requests, appends them to its log, and replicates entries to followers. Each entry includes a term number and command. Followers acknowledge receipt, enabling the leader to track consensus.

Commitment

Once an entry is replicated on a majority of servers, it's considered committed. The leader tracks the highest committed index and includes it in AppendEntries messages so followers can apply committed entries.

State Machine Application

Committed entries are applied to the state machine in log order. This ensures all servers execute the same commands in the same order, maintaining consistency across the distributed system.

๐Ÿ”ฌ

Applied Research: Distributed Parameter Server

See how Raft consensus enables fault-tolerant distributed ML with 40%+ throughput gains in federated learning.

Architecture Overview

PyRaft implements a clean separation of concerns with distinct layers for networking, consensus logic, and state management.

๐ŸŒ

RaftNet Layer

Handles all network communication between nodes using TCP sockets. Provides abstract send/receive operations with separate queues for each peer. Supports both real TCP connections and mock networks for testing.

๐Ÿ“

RaftLog Module

Implements the replicated transaction log with the append_entries operation. Enforces log matching properties, handles idempotent operations, and maintains strict ordering guarantees.

๐Ÿ”„

RaftServer State Machine

Core consensus logic implementing the three-state machine: Follower, Candidate, and Leader. Handles all Raft RPCs including RequestVote, AppendEntries, and their responses.

๐ŸŽฎ

RaftControl Runtime

Bridges the Raft algorithm with the execution environment. Manages timers, event queues, message dispatch, and state machine application. Enables both threaded execution and simulation.

Server State Transitions

Each server in a Raft cluster is always in one of three states

Follower

Passive state, responds to RPCs
โ†’

Candidate

Requests votes from peers
โ†’

Leader

Handles all client requests

Transition Rules

โ— Follower โ†’ Candidate: Election timeout without hearing from leader
โ— Candidate โ†’ Leader: Receives votes from majority of servers
โ— Any โ†’ Follower: Discovers higher term from any message

Consensus Mechanism

Understanding how Raft achieves distributed consensus through leader election and log replication.

๐Ÿ“จ

AppendEntries

Leader sends log entries to followers. Also serves as heartbeat when entries list is empty. Contains term, prev_index, prev_term, entries, and commit_index.

โœ…

AppendEntriesResponse

Followers respond with success/failure and their match_index. Failed responses trigger leader to retry with earlier entries to find consistency point.

๐Ÿ—ณ๏ธ

RequestVote

Candidates request votes during elections. Contains term, last_log_index, and last_log_term to verify candidate's log is up-to-date.

Replicated Log Example
Term 1
Term 2
Term 3
Leader Log
[0]
xโ†3
T1
[1]
yโ†1
T1
[2]
yโ†9
T1
[3]
xโ†2
T2
[4]
xโ†0
T3
[5]
yโ†7
T3
[6]
xโ†5
T3
[7]
xโ†4
T3
commit_index
5
Entries 0-5 committed on majority
raftlog.py โ€” Log Entry Append Operation
def append_entries(log, prev_index, prev_term, entries) -> bool:
    '''
    Append entries into the Raft log. Return True if it worked, False otherwise.
    Implements the log matching properties as described in Section 5.3.
    '''
    # The log is never allowed to have holes
    if prev_index >= len(log):
        return False

    # Check log matching property - term must match
    if prev_index >= 0 and log[prev_index].term != prev_term:
        return False

    # Append is valid - replace entries from prev_index+1 onwards
    log[prev_index+1:] = entries
    return True

API Documentation

Complete reference for all PyRaft classes, methods, and message types.

CLASS RaftServer

Core Raft server implementing the consensus algorithm state machine. Manages log, current term, commit index, and handles all state transitions between Follower, Candidate, and Leader.

state Current server state (FollowerState, CandidateState, LeaderState)
log List of LogEntry objects representing the transaction log
current_term Latest term server has seen (persisted)
commit_index Index of highest log entry known to be committed
MSG AppendEntries

Invoked by leader to replicate log entries. Also used as heartbeat when entries is empty. Contains all information needed for followers to verify and append entries.

term Leader's current term
prev_index Index of log entry immediately preceding new ones
prev_term Term of prev_index entry
entries List of LogEntry objects to append (may be empty for heartbeat)
commit_index Leader's commit_index
MSG RequestVote

Invoked by candidates to gather votes during leader election. Voters only grant vote if candidate's log is at least as up-to-date as their own.

term Candidate's term
last_log_index Index of candidate's last log entry
last_log_term Term of candidate's last log entry
CLASS RaftControl

Runtime controller bridging Raft algorithm with execution environment. Manages event queues, timers, message dispatch, and provides interface for client operations.

start() Launch threaded execution environment
client_append_entry(item) Submit entry to leader, returns Future
is_leader() Check if this node is current leader
process_event(evt) Process single event (for simulation)
CLASS TCPRaftNet

TCP-based network layer for real distributed deployment. Manages separate outgoing queues per peer, handles connection failures gracefully, and provides non-blocking send operations.

send(msg) Non-blocking send to destination (from msg.dest)
receive(blocking=True) Receive next message from any peer
start() Launch acceptor and sender threads
CLASS LogEntry

Named tuple representing a single entry in the Raft log. Each entry contains the term when it was created and the actual command/item to be applied to the state machine.

term Term number when entry was created
item The actual command/data (any object)

Simulation Framework

PyRaft includes a powerful simulation framework for testing consensus scenarios without real network overhead.

๐Ÿงช

MockNetwork

Simulated network layer using in-memory queues. Messages can be inspected, delayed, or dropped to test failure scenarios. Perfect for unit testing Raft logic in isolation.

โฑ๏ธ

Deterministic Stepping

The simulation framework allows stepping through message exchanges one at a time. Call step() to process all pending messages, or step_all() to run until quiescence.

๐Ÿ“Š

Scenario Testing

Pre-built test scenarios including Figure 7 (diverse log states) and Figure 8 (commitment safety). Validates correct log replication and commit behavior.

๐Ÿ”

State Inspection

Full access to server state during simulation. Verify log contents, commit indices, term numbers, and state transitions. Assert invariants at any point.

raftsim.py โ€” Running a Simulation
class RaftSimulation:
    def __init__(self, num_nodes):
        self.nodes = []
        for n in range(num_nodes):
            net = raftnet.MockNetwork(n, [x for x in range(num_nodes) if x != n])
            self.nodes.append(raftcontrol.RaftControl(net))

    def step(self):
        """Process all pending messages across all nodes"""
        messages = []
        for node in self.nodes:
            messages.extend(node.outgoing)
            node.outgoing.clear()
        
        # Dispatch messages to destination nodes
        for msg in messages:
            self.nodes[msg.dest].process_event(
                raftcontrol.MessageEvent(msg)
            )
        return bool(messages)

    def step_all(self):
        """Run until no more messages pending"""
        while self.step():
            pass

# Example: Test log replication
sim = RaftSimulation(5)
sim.nodes[0].become_leader()
sim.nodes[0].client_append_entry('x=1')
sim.step_all()

# Verify all logs match
assert all(n.server.log == sim.nodes[0].server.log for n in sim.nodes)

Key-Value Store Integration

PyRaft includes a distributed key-value store as a reference implementation

๐Ÿ“ฅ

SET Operation

Client sends SET request โ†’ Leader appends to log โ†’ Replicates to followers โ†’ Commits on majority โ†’ Applies to state machine โ†’ Returns success

๐Ÿ“ค

GET Operation

Read requests can be served immediately from committed state. For linearizable reads, leader must verify it still holds leadership via heartbeat round.

๐Ÿ—‘๏ธ

DELETE Operation

Follows same consensus path as SET. Delete command is logged and replicated, ensuring all nodes eventually remove the key from their local state.

Related Research

Academic and industry research building on Raft consensus for distributed systems.