M. Ahmed

Understanding Consensus & Implementing Raft

Raft is a consensus algorithm proposed for state machine replication by Diego Ongaro and John Ousterhout in 2014. It’s much simpler than Paxos and it’s designed to be understandable for practitioners.

What is consensus

Consensus helps two or more nodes reach a decision (or a series of decisions) deterministically. The Two Generals Problem is a good example of a problem where consensus is crucial.

How does it relate to modern distributed systems

In order to understand consensus, we need to divide distributed systems into two categories:

  1. Stateless Systems:
  • A bunch of web servers that serve user requests (REST APIs)
  • Some sort of a compute-intensive system (LLM inference endpoint)
  1. Stateful Systems:
  • A database management server that uses disk (i.e. Postgres, MySQL)
  • Object-storage service (i.e. AWS S3)

Now, scaling stateless services is quite easy in practice. Here’s one way to do it:

  1. Find a metric such as the number of requests per server or CPU utilization.
  2. Use the metric as an anchor to add or remove web servers as needed.
  3. Put the group of servers under a load balancer.

This might all seem a little too reductive but that’s how almost all modern stateless services are scaled to some extent.

The issue arises when you have persistent state that needs to be written to disk. The easiest way to scale such services is to scale vertically as much as possible (i.e. use a server with more CPU & memory).

There are of course many tradeoffs with this approach. A few are mentioned below:

  1. The server costs increase exponentially with size so it can get quite expensive.
  2. Internet-scale services have users across the globe. The latency between a database in California and a user in Singapore is around 170–200ms, which can add up when you have many queries.
  3. Servers die all the time. Availability is a critical metric nowadays and a given for most consumer-facing services.

Full-system Replication

The solution is obvious, we need multiple servers. But every server has its own state. The simplest solution is to just do a full-system replication over the network which basically replicates everything on the OS level. This is both inefficient and not resilient. Copying every command is unnecessary and OS/hardware between two servers needs to be compatible.

So that’s probably not the best idea for most use cases.

State-machine replication

A better approach is to use the concept of state machines. A state machine is a system that gives you the same outputs for the same set of inputs. Another way of putting this would be saying that state machines are deterministic based on their inputs.

If we create a state machine that takes in a set of commands and reaches a deterministic state we can copy the same commands over to a different server and expect the same output. Postgres does this by storing a Write-Ahead Log (WAL). Every time a command comes in it stores it in the append-only log. The idea is that applying the same log to two different Postgres servers will end up with both in the same end state.

Raft

That’s where Raft comes in. It allows us to replicate a log (consistently — more on that later) so that all servers can apply it to their state machine and get the same end result.

Now before we start discussing Raft we need to discuss the system model assumed here. It assumes crash-stop node behavior and a partially synchronous, fair-loss network model.

It’s also a CP system:

  • Consistent: It only commits entries into the log once a majority quorum is reached. For 2F+1 nodes, the system can withstand a failure of F nodes and continue operation. i.e. for a system of 5 nodes, F=2 such that 2(2) + 1 = 5, we need 3 nodes (a majority) to continue committing logs.
  • Partition-Tolerant: The system can continue functioning under a network partition as long as it has a majority. That means as long as 3 out of 5 nodes are in touch the system doesn’t care about network partitions and continues normal operation.

This also implies that the minority partition will NOT be able to commit log entries under a network partition. This is derived from the CAP Theorem.

How it works

Raft is divided into three parts for better understanding:

  1. Leader Election
  2. Log Replication
  3. Safety

Leader Election

The way Raft works is that a leader is responsible for all write operations from the client’s end for a particular term. A term is a period of continuous time in which log entries are committed. There can only be 1 leader per term.

Raft uses RPCs (remote procedure calls) to communicate between nodes. There are two RPCs that are important in particular:

  1. RequestVote: Called by a candidate to get votes to start an election for a new term.
  2. AppendEntries: Called by a leader to replicate log entries to followers for a particular term.

A node can have one of 3 roles:

  • Follower (Every node starts here on boot/reboot)
  • Candidate (When asking for votes to become a leader)
  • Leader (The node responsible for replicating log entries)

In order to trigger an election every node has a unique random timer that goes off at a random interval. Once the timer goes off it starts an election by:

  • Changing its role to Candidate
  • Incrementing its term
  • Sending the RequestVote RPC to every other node

Once a node receives a RequestVote RPC:

  • If the term of the Candidate is smaller than its term, it denies the vote.
  • If the term is equal to the Candidate but it has already voted for another candidate for the term it denies the vote.
  • Else if the log of the Candidate is at least up to date with the local log of the node it grants the vote and resets its election timer.

On receiving the response:

  • If the term of the reply is greater than the candidate it steps down (changes role to follower, updates its term and stops the election)
  • If the vote is granted then it increments the number of votes received for the term
  • If the votes received are greater than a majority of nodes then the Candidate changes its role to Leader and starts replicating the log for every node by sending the AppendEntries RPC.

Log Replication

In order to replicate the log the leader sends the AppendEntries RPC to every node along with the list of log entries. The leader maintains a state of what entries have already been sent to a particular node and only sends newer entries.

The leader also calls the AppendEntries RPC at a fixed interval.

On receiving the AppendEntries RPC the receiver:

  • Increments its term if required and sets the leader id to the leader’s id.
  • It also resets its timer to make sure an election isn’t called as long as it keeps receiving AppendEntries at a regular interval. It is to be noted that the AppendEntries RPC or the “heartbeat interval” should be 3–6x smaller than the minimum election timeout to account for network/queuing delays between nodes and avoid unnecessary elections.
  • The node checks if the log up to the current state as per the sending node (last index) is up to date with the local log. If it doesn’t match up then the receiver rejects the RPC.
  • If the log is in fact up to date then the receiver concatenates the log entries to its log and returns success.
  • The receiver also receives a commit index and applies the logs to its state machine up to the given index.

On the leader once it receives a reply:

  • It either sets the logs’ sent length for the node to current or decrements the “sent” length for that follower and tries the RPC again until it receives a successful response (this helps remove inconsistent entries).
  • It also loops through the “sent” and acknowledged log state for all nodes and finds the highest sent index and sets it as the committed index (same index used to apply log entries to state in AppendEntries RPC) which is sent with the next AppendEntries RPC.

A few optimizations that can be done:

  1. Rather than decrementing the sent length—which can be tedious and slow for a node that starts with an empty log—we should use a term-based or some other intelligent way of sending log entries.
  2. Logs can grow a lot in a huge application. For this we should create a snapshot for logs after some time and use that to compact our logs.

The full toy implementation of Raft in Go can be found at: GitHub. Hashicorp’s Implementation is much more mature in practice. Here’s a link to the Raft site which includes the paper, videos, illustrations, talks and more.