Skip to content
Simon Massey edited this page Nov 29, 2024 · 13 revisions

Cluster Replication With Paxos for the Java Virtual Machine

An original version of this essay was initially published on WordPress in 2014 as Cluster Replication With Paxos. That was the year the Raft paper was published.

If you carefully read the 2001 paper Paxos Made Simple then the algorithm is straightforward and mechanically simple. This is done with the mathematical minimum number of message exchanges. This may be a surprising result to any reader who has looked at Zookeeper Atomic Broadcast ZAB and Raft. You don't need an external leader election service like the one used in the Spinnaker or PaxStore papers.

I am not claiming any innovation or invention in this post; it is simply my attempt to demystify the subject. I assert that it is possible to implement the protocol described in the 2001 paper Paxos Made Simple to make a safe and efficient cluster replication algorithm. Please look at that paper for definitions of the messages and terminology used in this post.

I am also not claiming that writing a distributed system is easy. It is not. You must deal with many remote clients attempting to interact with the cluster with lost messages and crashes. It is very challenging to reason about all the permutations of messages and failures in any distributed system. Yet that is true no matter what core algorithm you use to gain consistency. The Paxos algorithm is, however, mechanically simple. We can very carefully assert that we do not violate the invariants of the algorithm in our code. This will be detailed below.

A common confusion is not comprehending that Paxos is Multi-Paxos. The 2001 paper Paxos Made Simple by Leslie Lamport very clearly states on page 10:

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers.

That means you send the prepare message once and then stream the accept messages afterwards. The original 1998 paper also described the The Multi-Decree Parliament as the Paxos protocol, not the Single-Decree Synod part of that paper.

Another common confusion is not understanding that failover safety is designed into the Paxos algorithm, as proposal numbers must be unique to each leader. This is stated on page 8 as:

Different proposers choose their numbers from disjoint sets of numbers, so two different pro- posers never issue a proposal with the same number.

Remarkably, even papers published in late 2013 miss this core feature of the algorithm and choose an unsafe global log index number as the Paxos number, which violates the algorithm and requires them to use a slow custom leader election mechanism to gain safety.

There is no mystery about how to achieve this. It is trivially achieved by encoding a "node unique number" into the least significant bits of the ballot number used in the prepare and accept messages.

Yet another common confusion is that Paxos uses a distinguished leader node. On page 6:

The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

The list of common confusions seems endless. It does not help that the first paper, eventually published in 1998, was a colossal pedagogical blunder. Yet something more troubling is that counterfactual information has eclipsed the factual information in plain sight. This tragedy has propelled me to spend a not-insignificant amount of time trying to set the record straight.

Problem Statement

Paxos is an algorithm for fixing many values across a cluster. This is known as the consensus problem:

Assume a collection of client processes which may propose a value. A consensus algorithm ensures that only one value among all proposed values is chosen.

If that seems abstract, consider a sports ticket-selling website where the last ticket to the Super Bowl is on sale. Thousands of web browsers (clients) will offer (propose) the payment details (value) of different football fans to be applied (fixed) to some unique ticket number (slot).

The algorithm lets a mathematical majority of nodes agree on many values concurrently. Chaining values together into a meaningful protocol to provide a reliable service is left as an exercise to the reader.

It helps to have a simple example in mind. For this discussion, assume we want to replicate a file-backed map as a trivial key-value datastore across three different servers. In practice, any client-to-server network traffic can be replicated using the approach detailed in this post.

One thing that may confuse you is that your application may label "value" as something held in a replicated map. Paxos calls "value" the command you are trying to fix as being consistent across the cluster. This is stated on page 8:

A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state.

When replicating a file-backed map as a trivial key-value datastore across three different servers, the commands may be binary encodings of a put(k,v) or remove(k) remote procedure call. Please understand that the paper does not tell you to write your code as a state machine; it says any server can be described as a state machine for the mathematical proof.

In our map example above, the operations do not commute; they must be applied in the same order at every node in the cluster; otherwise, the maps will not match the leaders. If we label each command with a sequential index, we can enforce the ordering.

Multi-Paxos Numbers For Replication

The description of multi-Paxos on Wikipedia (as of late October 2014) includes a counter:

To achieve multi-Paxos, the instance number I is included along with each value.

The paper Paxos Made Simple clarifies that it is the absolute counter of the number of instances of the algorithm. The leader sends accept(I,N,V) where I is the instance counter, N is the proposal number unique to a leader, and V is the value being proposed by the leader for that instance. I am going to clarify the definition of the counter to be more specific:

Let S be the logical log index or "slot" included with each value, whose values must be made consistent and gap-free across all servers.

Each S is logically a slot in commit history into which leaders propose commands, which must be fixed across all servers and applied in order at each server. The Spinnaker paper describing transaction log replication with Paxos also uses this definition.

Clients are not shown the effect of the command at each slot until a majority of nodes in the cluster acknowledge it. When a majority of servers have accepted the value, the Paxos algorithm ensures that the value at the slot will not change. The log index allows consensus to be performed in parallel on different slots using the same N. A leader may stream accept(S,N,V) messages in log order using S,S+1,S+2,..,S+i.

It is a common misconception that the original Paxos papers don't use a stable leader. In Paxos Made Simple, on page 6 in the section entitled The Implementation, Lamport wrote:

The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

This is achieved using the "Phase 1" messaging of prepare and the response promise. A separate leadership election is not required. Simple heartbeats and timeouts may be used as described below.

Due to leadership changes, a particular value may be proposed into a given slot S by one leader using N and then fixed by another using a different N'. Leader failovers can also cause alternative leaders to propose different values into the same slot. Paxos requires that N is unique to a node so clashes can be resolved by always preferring the higher number. We can make unique numbers by encoding the node identifier in the lower bits of the number. A simple counter can be used for the higher bits.

In this essay, we will use a decimal notification such as N=45.3, where the node number is 3, which is held in the lower decimal part 0.3, and the counter is the whole part 45. In practice, we can use binary and any number of bytes.

If, during a network partition, we have multiple nodes attempting to lead, who all got a different value accepted into the same slot by a minority, with no majority, the round fails. The term counter is incremented at each node to fix the failed slot. This will not happen under stable leadership. It will only happen when nodes attempt to take over the leadership.

Each leader will use a new higher N' at the next round to attempt to fix a value into that slot with a fresh accept(S,N',V) message. This means that for any given {N,S} pair, only one unique proposed V exists.

Choosing A Value: Fix It And Then Fixing It

We must consider when a value is fixed, how that is learnt, and when it is applied in log order at each node.

With Paxos, a value cannot change when a majority of nodes in the cluster has accepted it. Any node in the cluster can discover the value fixed at any given slot by asking each node what value they hold at each log slot.

Whatever value is held by a majority of nodes is the fixed value. To reduce messages we can have the leader listen to the accept response messages of followers. It can then send a short fixed(S,N) message when it learns a value has been accepted by a majority of nodes. Yet, we do not need to send it in a separate network packet.

An alternative to the fixed message is to simply have all nodes exchange message to learn which values are fixed at each slot. Else to use some true multicast networking. The algorithm ensures once a value is fixed it is always fixed. So any mechanism may be used to learn which values have been fixed. Below we will introduce a catchup message to allow nodes to request retransmission of values that were sent in lost accept messages. This is a simple "learning" mechanism.

Each leader only proposes a unique V for any {S,N} pair. If it does not get a successful majority positive response must increment the counter within the higher bits of N in any new messages. The N is unique to each node by encoding the node identifier into the lower bits. If followers see a fixed(S,N) and it has seen the corresponding accept(S,N,V), it has now learned which exact V is fixed at the slot S.

To reduce network roundtrips, we can piggyback the short fixed message at the beginning of the next leader message which is typically the subsequent accept message. The repository uses one byte for the message type, one for the node identifier, four for the counter, and eight for the slot number. This means that a fixed message is just fourteen bytes placed at the beginning of the next accept message sent from the leader.

When each follower learns that all prior slots have been fixed, it can "up-call" the value V to the host application. This is an application-specific callback that can do whatever the host application desires. The point is that every node will up-call values in the same order.

Please remember that the values are actually remote procedure call commands that are applied to the host application state. In our example of a k-v store, these are put(k,v) or remove(k) operations. Trex simply uses byte arrays. The library's users can put whatever they want into the byte arrays (e.g., JSON, protobuf, Avro).

When there is no message loss, followers will see accept and fixed messages packed together into network packets. The leader can stream accept messages such that there can be multiple outstanding values with the highest fixed slot lagging behind. The challenge is when the follower sees gaps in the S numbers of either accept messages or fixed messages due to lost or reordered messages.

Interestingly, a leader may die before he sees a majority of accept responses, so a value may be fixed and cannot change, but no one is aware. This scenario will be covered in the next section.

Leader Election

The paper clearly states on page 6:

The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

It does not imply that other mechanisms are used for this purpose besides the prepare or accept messages. On the contrary, it explicitly defines the leader election mechanisms as the prepare and accept messages. The paper goes on to say on page 7:

.. a reliable algorithm for electing a proposer must use either randomness or realtime — for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The whole novelty of the Paxos algorithm was that it did not require real-time, which he means perfectly synchronised clocks. So, he is saying you need random timeouts or "real-time timeouts." He is suggesting that you "try random timeouts."

The leader can heartbeat messages. When a follower times out on heartbeats it can increment the counter in N to come up with a new unique number for the slot higher than it knows has been fixed. Node number 2 will encode its identifier in the lower bits and increment the counter in the higher bits. If the counter was previously 3, it will issue prepare(S,N=4.2), and the election is simply whether it receives a majority of positive responses. A node can and should vote for itself. So in a three node cluster a node it only needs to exchange one message to become elected. This is optimal. We must always have a crash-proof disk flush on each node before each message is sent out (else non-volatile memory or some other crash-proof mechanism).

The positive responses to prepare messages are known as promises. These are leader election votes. In a three-node cluster, a node first votes for itself, and it only needs to see other positive responses to have been elected. It can then choose a value and issue accept(S,N=4.2,V=32). To get this to work, we simply need other nodes not to aggressively interfere based on heartbeats and randomised timeouts.

If message exchanges are in the order of milliseconds, we can use a randomised timeout, lower bound to something like thirty milliseconds and upper bound to a hundred milliseconds. What is optimal is, of course, entirely dependent on the network and disk properties of a given cluster. Getting it wrong can lead to a leader duel, which is a form of live lock. Two failed rounds at random extend your worst case.

Of course, if people want more predictable failover, they can implement more complex leader election mechanisms. This itself needs either a real-time clock or randomness. It can exchange a lot more messages to avoid live-lock. The simplicity of randomised timeouts is very elegant to me. Exactly what might be elegant to you depends on your application. Exemplary Paxos libraries should be compatible with external failure-detection or leader election libraries.

In addition to the built-in leader elections, we can add a vote of no confidence. In a real parliament, candidate leaders canvass support and exit the race when they have insufficient support. Elected leaders may also step down once they have lost support. We can use negative acknowledgement messages known as nack messages to both prepare and accept. Leaders can step down when they learn of a higher fixed slot in a nack. They can also abdicate power if it receives a majority nack. This is known as "backing down" in this implementation.

We can extend the protocol to allow the standard PrepareResponse and AcceptResponse messages to be either a positive acknowledgement "ack" or a negative acknowledgement "nack." The paper covers the positive acknowledgement case. The negative acknowledgement is legal so long as it does not violate the algorithm's invariants. This means that a "nack" cannot change a node's promise or its commit log. The node that receives a majority nack response can abdicate (aks "back down") and request retransmission of lost messages.

The Leader Take-Over Protocol

What happens for slots above the maximum committed log index slot where a dead leader issued accept messages but no commit messages were seen? The new leader uses both prepare and accept messages to fix values into these slots. Importantly, it must collaborate with any prior leader by fixing the value sent by the past leader that used the highest number.

It issues a prepare(N,S) for all slots S, S+1, S+2, .. S+i not so far learnt to have been fixed. Each other node responds with a promise message which is the highest uncommitted {N,V} pair at each slot promise(S,N,V).

A node should never issue a promise for a slot S that it knows was previously fixed, no matter how high the ballot number used in the prepare message.

When the new candidate leader receives a majority of promises, it selects the value with the highest N at each slot S that it finds in the majority response. It then attempts to fix this by sending a fresh accept message for that V under its new higher N.

If the new leader gets a positive majority accept response, it knows the value at the slot is now fixed. The Spinnaker paper refers to this "slot recovery" phase as the leader takeover phase.

If the new leader is not aware of any uncommitted slots, it can send a prepare for the slot just higher than the last it committed. It may be the case that the new leader missed some accept messages from the previous leader. It is unaware of the full range of slots it needs to fix. One node in the majority knows the highest slot index proposed by the last leader. The ack messages holding the promise can state the highest accepted log index at each node. This allows the new leader to learn the full range of slots it needs to fix. It can then send out additional prepare messages as required to correctly fix all slots.

Probing and filling in the previous leader's uncommitted slots is a form of crash recovery. The Paxos Made Simple paper says that if the new leader finds a slot with no accept message due to lost messages, it should fix a no-op value into that slot. This does not affect the correctness as the new leader can choose any value; it just speeds up recovery. There is the question of what N number to use in the prepare messages sent during the leader takeover. The new leader should choose a number higher than the maximum number it last promised or last accepted.

It is important to understand that the same value being accepted under different term numbers. Consider a three node cluster where the leader with node identifier 1 at term N=2.1 chooses value V=32 at slot S=10 when the network went crazy as it attempts to send accept(S=10,N=1.1,V=32). This can lead to some interesting scenarios depending on which messages can get through or which node are crashed.

If accept(S=10,N=1.1,V=32) never arrives after a timeout, a new node will run the recovery protocol. This might be node 2, which increments the counter and adds its node identifier to the lower to give N=2.2. Image other old leader remains crashed for hours. The new leader 2 selects V=128 for slot S=10 and sends accept(S=10,N=2.2,V=128). This reaches node 3, and now slot S=10 is forever fixed at V=128. Eventually, node 1 rejoins the cluster having incremented its term to N=3.1. It will never accept the new leader's message accept(S=10,N=2.2,V=32). Yet the value fixed at slot is fixed at V=128. Node 1 must learn that the value was fixed at V=128 from the new leader. The new leader should realise that node 1 cannot accept messages under the current leader term of N=2.2, which it can observe via negative response messages. The new leader can increment its counter until it gets a higher N=4.2. It can then make an instantaneous promise to itself for the next slot and then issue the next accept message using N=4.2.

Alternatively, consider if the accept message accept(S=10,N=1.1,V=32) only arrived at node 3. The new leader node 2 will learn of the value V=32 via the response to the prepare that it sends for slot S=10. It will create a new accept(S=10,N=2.2,V=32) and self-accept then transmit. Node 3 must accept the new message. It already had V=32 at slot S=10 yet it must give a positive acknowledgement back to 2 for the new leader to learn that the value has been fixed. This means that while each {N,S} is a unique V the opposite is not true. The same V will be sent for the same slot under different numbers. Intuitively, what is happening here is that the cluster gossips until every node has exchanged enough messages to understand what value is fixed.

Retransmission

A leader can stream accept messages without awaiting a replay for each slow. If we stream five messages, and three are dropped, a leader might learn that the first and fifth slots are fixed. What is in the other three slots? That depends on whether any of the messages got through. Imagine the leader then crashes and comes back as a follower after another node has been leading for a while. It will learn from a fixed message about a higher slot being fixed. It needs to request retransmission to learn about what happened to the lower slots.

To cover this case, we can add a Catchup message. This message can state the last slot known to be fixed and may be sent to any node in the cluster. The CatchupResponse will send the fixed values above the slot in the request. This is simply a method to learn the fixed values efficiently. The node getting the CatchupResponse must not update its promise, as that would be a protocol violation. It should journal and up-call the slots that it has not previously fixed. It must then journal that it has fixed new slots.

Safety Assertions

How can we be sure that any library implements the Paxos Algorithm correctly? This library does the following:

  1. Asserts that we are do not update the promise outside of processing a prepare or accept message.
  2. Asserts that we do not increment the fixed slot outside of processing specific "learning" messages.
  3. Asserts that promises only increment.
  4. Asserts that the fixed slot index only increments.
  5. Documents clearly that a journal must flush the disk when the sync method is called.
  6. Documents clearly order the journal to write the promise and progress state last.

It then has randomised network partition simulations that exercise the state space. You are still not guaranteed to write a client-server system that uses this library that is bug-free. It means you can carefully review the protocol, messaging, state management, and assertions that invariants are not violated to have a reasonable degree of confidence that the library is good enough for your specific purposes.

© Simon Massey, 2014 - 2024, CC BY-SA