A comprehensive Python implementation of the Raft distributed consensus protocol โ designed for understandability without sacrificing correctness.
Raft is a consensus algorithm designed to be easy to understand while providing strong consistency guarantees for distributed systems.
When a cluster starts or the leader fails, nodes hold an election. Candidates request votes from peers, and a node becomes leader upon receiving a majority. Elections use randomized timeouts to prevent split votes.
The leader accepts client requests, appends them to its log, and replicates entries to followers. Each entry includes a term number and command. Followers acknowledge receipt, enabling the leader to track consensus.
Once an entry is replicated on a majority of servers, it's considered committed. The leader tracks the highest committed index and includes it in AppendEntries messages so followers can apply committed entries.
Committed entries are applied to the state machine in log order. This ensures all servers execute the same commands in the same order, maintaining consistency across the distributed system.
See how Raft consensus enables fault-tolerant distributed ML with 40%+ throughput gains in federated learning.
PyRaft implements a clean separation of concerns with distinct layers for networking, consensus logic, and state management.
Handles all network communication between nodes using TCP sockets. Provides abstract send/receive operations with separate queues for each peer. Supports both real TCP connections and mock networks for testing.
Implements the replicated transaction log with the append_entries operation. Enforces log matching properties, handles idempotent operations, and maintains strict ordering guarantees.
Core consensus logic implementing the three-state machine: Follower, Candidate, and Leader. Handles all Raft RPCs including RequestVote, AppendEntries, and their responses.
Bridges the Raft algorithm with the execution environment. Manages timers, event queues, message dispatch, and state machine application. Enables both threaded execution and simulation.
Each server in a Raft cluster is always in one of three states
Understanding how Raft achieves distributed consensus through leader election and log replication.
def append_entries(log, prev_index, prev_term, entries) -> bool: ''' Append entries into the Raft log. Return True if it worked, False otherwise. Implements the log matching properties as described in Section 5.3. ''' # The log is never allowed to have holes if prev_index >= len(log): return False # Check log matching property - term must match if prev_index >= 0 and log[prev_index].term != prev_term: return False # Append is valid - replace entries from prev_index+1 onwards log[prev_index+1:] = entries return True
Complete reference for all PyRaft classes, methods, and message types.
Core Raft server implementing the consensus algorithm state machine. Manages log, current term, commit index, and handles all state transitions between Follower, Candidate, and Leader.
state
Current server state (FollowerState, CandidateState, LeaderState)
log
List of LogEntry objects representing the transaction log
current_term
Latest term server has seen (persisted)
commit_index
Index of highest log entry known to be committed
Invoked by leader to replicate log entries. Also used as heartbeat when entries is empty. Contains all information needed for followers to verify and append entries.
term
Leader's current term
prev_index
Index of log entry immediately preceding new ones
prev_term
Term of prev_index entry
entries
List of LogEntry objects to append (may be empty for heartbeat)
commit_index
Leader's commit_index
Invoked by candidates to gather votes during leader election. Voters only grant vote if candidate's log is at least as up-to-date as their own.
term
Candidate's term
last_log_index
Index of candidate's last log entry
last_log_term
Term of candidate's last log entry
Runtime controller bridging Raft algorithm with execution environment. Manages event queues, timers, message dispatch, and provides interface for client operations.
start()
Launch threaded execution environment
client_append_entry(item)
Submit entry to leader, returns Future
is_leader()
Check if this node is current leader
process_event(evt)
Process single event (for simulation)
TCP-based network layer for real distributed deployment. Manages separate outgoing queues per peer, handles connection failures gracefully, and provides non-blocking send operations.
send(msg)
Non-blocking send to destination (from msg.dest)
receive(blocking=True)
Receive next message from any peer
start()
Launch acceptor and sender threads
Named tuple representing a single entry in the Raft log. Each entry contains the term when it was created and the actual command/item to be applied to the state machine.
term
Term number when entry was created
item
The actual command/data (any object)
PyRaft includes a powerful simulation framework for testing consensus scenarios without real network overhead.
Simulated network layer using in-memory queues. Messages can be inspected, delayed, or dropped to test failure scenarios. Perfect for unit testing Raft logic in isolation.
The simulation framework allows stepping through message exchanges one at a time. Call step() to process all pending messages, or step_all() to run until quiescence.
Pre-built test scenarios including Figure 7 (diverse log states) and Figure 8 (commitment safety). Validates correct log replication and commit behavior.
Full access to server state during simulation. Verify log contents, commit indices, term numbers, and state transitions. Assert invariants at any point.
class RaftSimulation: def __init__(self, num_nodes): self.nodes = [] for n in range(num_nodes): net = raftnet.MockNetwork(n, [x for x in range(num_nodes) if x != n]) self.nodes.append(raftcontrol.RaftControl(net)) def step(self): """Process all pending messages across all nodes""" messages = [] for node in self.nodes: messages.extend(node.outgoing) node.outgoing.clear() # Dispatch messages to destination nodes for msg in messages: self.nodes[msg.dest].process_event( raftcontrol.MessageEvent(msg) ) return bool(messages) def step_all(self): """Run until no more messages pending""" while self.step(): pass # Example: Test log replication sim = RaftSimulation(5) sim.nodes[0].become_leader() sim.nodes[0].client_append_entry('x=1') sim.step_all() # Verify all logs match assert all(n.server.log == sim.nodes[0].server.log for n in sim.nodes)
PyRaft includes a distributed key-value store as a reference implementation
Client sends SET request โ Leader appends to log โ Replicates to followers โ Commits on majority โ Applies to state machine โ Returns success
Read requests can be served immediately from committed state. For linearizable reads, leader must verify it still holds leadership via heartbeat round.
Follows same consensus path as SET. Delete command is logged and replicated, ensuring all nodes eventually remove the key from their local state.
Academic and industry research building on Raft consensus for distributed systems.
Multi-agent systems, enterprise LLM infrastructure, and agent communication protocols from CSยฒB Technologies.
GPU-NVMe integration, distributed KV-cache offloading for LLM inference, and CXL memory architectures.
Ultra Ethernet, RDMA, NVMe-oF integration for AI/HPC workloads and data center fabrics.