-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathalgorithm.tex
177 lines (142 loc) · 14.5 KB
/
algorithm.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
\section{Algorithm}\label{section:algorithm}
In this section, we present the extended Raft algorithm, which is a variant of the Raft algorithm. It is designed for cluster with regular servers and at most one witness. The witness functions strictly as a follower server, implying that its role never transitions to candidate or leader. Furthermore, a witness only maintains a minimal set of metadata.
Our newly proposed algorithm minimizes data traffic and reduces the frequency of witness visits while preserving all key properties of the Raft algorithm. This implies that the extended Raft algorithm ensures that each of the following properties is consistently upheld.
\begin{itemize}
\item Election Safety
\item Leader Append-only
\item Log Matching
\item Leader Completeness
\item State Machine Safety
\end{itemize}
In the rest of this document, unless otherwise specified, the term 'server' will denote either a regular server or a witness. Consequently, the server set, denoted as $Server$, that constitutes a cluster could either be $RegularServer \cup {witness}$ or simply $RegularServer$, where $RegularServer$ is the set of all regular servers.
\subsection{Concepts and Definitions}\label{subsection:definition}
\begin{definition}
A leader's \textbf{replication set} is a subset of the cluster server set, with its cardinality matching that of the regular server set.
\end{definition}
Let $ReplicationSets$ represent the set of all replication sets within a cluster. We have:
\begin{displaymath}
ReplicationSets \defeq
\begin{cases}
\{Server\}, & witness \notin Server \\
\{Server \setminus \{x\}: x \in Server\}, & witness \in Server
\end{cases}
\end{displaymath}
A leader maintains its replication set based on its view of the servers and modifies it under specific conditions. A leader may alter its replication set multiple times within a single term. To uniquely identify each replication set, we introduce the notion of a 'subterm' as detailed below.
\begin{definition}
A term is segmented into \textsc{subterms}, each begins with a replication set.
\end{definition}
Subterms are sequentially numbered using consecutive integers, beginning from $0$ for each term. The leader retains its current subterm and increments it when the replication set is altered. Consequently, a leader's replication set remains static throughout its subterm.
\subsection{States}\label{subsection:states}
Figure \ref{fig:algo-state} illustrates the additional states introduced by the Extended Raft algorithm to accommodate the witness.
\begin{figure}[htbp]
\begin{framed}
\input{algo/state}
\end{framed}
\caption{State}
\label{fig:algo-state}
\end{figure}
Each leader maintains three additional volatile variables: $replicationSet$, $currentSubterm$, and $witnessSubterm$. The $replicationSet$ represents the leader's current replication set, initialized to $RegularServer$, and is modified by $AdjustReplicationSet$ action. The $currentSubterm$ represents the latest subterm of the leader, initialized to $0$ and incremented upon a change in the leader's replication set. The $witnessSubterm$ denotes the latest subterm during which the leader replicated its log entry to the witness.
In addition to $currentTerm$ and $votedFor$, the witness also includes $witnessReplicationSet$, $witnessLastLogTerm$, and $witnessLastLogSubterm$. These represent the most recent replication set, log entry term, and log entry subterm that the leader sent to the witness. We will further discuss this in Section \ref{subsection:log-replication}.
Beyond the additional states in the leader and witness, each log entry is also associated with the leader's current subterm number when it is appended to the leader's log (Figure \ref{fig:client-request}). This subterm number is replicated and persisted alongside the log entry on all regular servers. We now denote a log entry as $\entry{index,term,subterm}$, which is uniquely identified by $index$ and $term$, and associated with $subterm$.
\begin{figure}
\begin{framed}
\input{algo/ClientRequest}
\end{framed}
\caption{Client request}
\label{fig:client-request}
\end{figure}
\subsection{Log Replication} \label{subsection:log-replication}
The leader replicates its log to regular servers in the exact same way as that in the Raft algorithm, and the regular servers handle the received log entries in the same way as well.
However, for the witness, log replication is performed differently, as shown in Figure \ref{fig:algo-append-entries-to-witness}. In the Extended Raft algorithm, the leader sends an $AppendEntriesToWitnessRequest$ to the witness, which includes the current replication and metadata of the log entry that satisfies the following conditions:
\begin{enumerate}
\item\label{cond:in-subterm} The log entry's term and subterm are equal to $currentTerm$ and $currentSubterm$, respectively.
\item The leader's current replication set includes the witness.
\item The leader has received acknowledgments for the log entry from at least a subquorum (one server away from forming a quorum) in the leader's current replication set.
\item\label{condition:first-in-subterm} The leader has not received any acknowledgment from the witness during the current subterm.
\end{enumerate}
\begin{figure}[htbp]
\begin{framed}
\input{algo/AppendEntriesToWitness}
\end{framed}
\caption{AppendEntriesToWitness}
\label{fig:append-entries-to-witness}
\end{figure}
The $AppendEntriesToWitness$ can be considered as a special implementation of the $AppendEntries$ action with batching of log prefix up to a specific log entry. This log entry belongs to the current subterm and has received subquorum acknowledgements from the current replication set. In addition to the message fields that also exist in $AppendEntriesRequest$, the dispatched message $AppendEntriesToWitnessReques$ also contains the current replication set and the metadata of the last batched entries.
Upon receiving the request, the witness persists the replication set, as well as the term and subterm of the log entry, then responds to the leader with an $AppendEntriesResponse$ message (Figure \ref{fig:handle-append-entries-to-witness}).
\begin{figure}
\begin{framed}
\input{algo/HandleAppendEntriesToWitnessRequest}
\end{framed}
\caption{Handle AppendEntriesToWitness}
\label{fig:handle-append-entries-to-witness}
\end{figure}
Note that the $mlog$ and $mentries$ fields in the message are solely used for proof. They do not exist in the actual implementation, nor will the witness persist the full log prefix in its durable storage. The persistence of $m.mentries$ to $log$ in $HandleAppendEntriesToWitnessRequest$ is exclusively used for proof purposes.
When the leader receives an acknowledgment from the witness, it also updates its $witnessSubterm$ to the current subterm if the acknowledged log entry is in the current subterm. Then, according to condition \ref{cond:first-in-subterm}, the leader no longer sends $AppendEntriesToWitnessRequest$ in the current subterm. Instead, the leader treats subsequent log entries as already acknowledged by the witness if they fulfill all conditions except condition \ref{cond:first-in-subterm}. This is referred to as \textbf{Shortcut Replication}, where the leader sends an $AppendEntriesResponse$ to itself on behalf of the witness.
Note that the log entry in $AppendEntriesToWitness$ must be within the current term and subterm. So, when the leader advances its subterm (a consequence of replication set adjustment), a new empty log entry must be appended to the leader's log. This is to ensure the leader won't be impeded by condition \ref{cond:in-subterm} in committing log entries.
Leader immediately commits a log entry during its term when the log entry is acknowledged by a quorum, in the same way as in Raft. The acknowledgment either comes from a regular server or the witness. Since the leader sends $AppendEntriesToWitnessRequest$ to the witness only after it has received subquorum acknowledgments from regular servers, any log entry acknowledged by the witness is immediately committed.
\subsection{Replication Set Adjustment} \label{subsection:replication-set-adjustment}
Leader maintains the replication set and modifies it whenever necessary. This process is referred to as \textbf{Replication Set Adjustment}. Figure \ref{fig:adjust-replicaitonset} illustrates a straightforward method to adjust the replication set. Although there could be multiple ways to adjust a replication set in practical implementations, such a simple method can be used in the algorithm to minimize the atomic regions without loss of generality.
\begin{figure}
\begin{framed}
\input{algo/AdjustReplicationSet}
\end{framed}
\caption{Replicaiton set adjustment}
\label{fig:adjust-replicaitonset}
\end{figure}
To commit, the leader's replication set must contain at least a subquorum of healthy regular servers if it includes a witness. Therefore, in a practical implementation, replication set adjustment should be performed in a way such that the resulting replication set includes as many healthy regular servers as possible. To achieve this, leader can track the status of each peer regular server by monitoring their responses. If leader receives no response from a regular peer server over an $election timeout$, it assumes the regular server is unreachable and initiates a replication set adjustment. Let $Reachable$ and $Unreachable$ represent the set of reachable regular servers and unreachable regular servers from the leader's perspective. A practical replication set adjustment can be implemented as below:
\begin{enumerate}
\item When a regular server becomes a leader, its replication set is initialized to all regular servers $RegularServer$ in the leader's configuration.
\item If all servers in $RegularServer$ are reachable from the leader, leader changes its replication set to $RegularServer$, i.e.,
\begin{displaymath}
\forall s \in RegularServer : s \in Reachable \implies ReplicationSet' = RegularServer
\end{displaymath}
\item Leader swaps one unreachable regular server inside its replication set with the reachable regular server or witness outside, i.e.,
\begin{displaymath}
\begin{aligned}
& \exists x \in ReplicationSet, y \in Server \setminus ReplicationSet :
\!\begin{aligned}[t]
& \land x \in Unreachable \\
& \land
\!\begin{aligned}[t]
& \lor y \in Reachable \\
& \lor y = witness
\end{aligned} \\
\end{aligned} \\
& \qquad \implies ReplicationSet' = Server \setminus \{x\}
\end{aligned}
\end{displaymath}
\item Leader increments its $currentSubterm$ and appends a new empty log entry if its replication set changes.
\end{enumerate}
\subsection{Leader Election}\label{subsection:leader-election}
When a regular server transitions into candidate, it requests votes from all peer servers including witness. The extended Raft algorithm adheres to the exact same rules as those in Raft when it comes to requesting and granting votes between a candidate and a regular server. However, the candidate does not request a vote from the witness until it has received subquorum votes from regular servers. Figure \ref{fig:request-witness-vote} describes the new $RequestWitnessVote$ action for a candidate to request vote from a witness.
\begin{figure}
\begin{framed}
\input{algo/RequestWitnessVote}
\end{framed}
\caption{RequestWitnessVote}
\label{fig:request-witness-vote}
\end{figure}
Considering that the witness does not persist the full log prefix, the extended Raft algorithm employs a new set of rules for the witness in leader elections. Figure \ref{fig:handle-request-witness-vote} describes the new action for the witness to handle a vote request from a candidate. The $RequestWitnessVoteRequest$ message includes the $term$ and $subterm$ of the last log entry in the candidate, as well as the votes the candidate has already received in the current election.
\begin{figure}
\begin{framed}
\input{algo/HandleRequestWitnessVoteRequest}
\end{framed}
\caption{HandleRequestVoteToWitness}
\label{fig:handle-request-witness-vote}
\end{figure}
The witness votes for a candidate if it has not casted vote in the current term and any of the following conditions is true:
\begin{itemize}
\item Candidate's last log entry has larger term, i.e. $m.mlastLogTerm > witnessLastLogTerm$.
\item Candidate's last log entry has same term but larger subterm, i.e. $m.mlastLogTerm = witnessLastLogTerm \land m.lastLogSubterm > witnessLastLogSubterm$.
\item Candidate's last log entry has same term and subterm comparing to witness, and all subqorum votes it got are from regular servers in witness's replication set, i.e.
\begin{displaymath}
\begin{aligned}
& m.mlastLogTerm = witnessLastLogTerm \\
& m.mlastLogSubterm = witnessLastLogSubterm \\
& m.mvotesGranted \subseteq witnessReplicationSet \\
\end{aligned}
\end{displaymath}
\end{itemize}
\subsection{Membership Reconfiguration}
Cluster configuration change can be done in both simple and complex approach in Raft algorithm. The simple approach changes one server at a time. And the complex approach allows arbitrary configuration changes at one time.
The key to the safety of configuration change in Raft is preserving overlap between any majority of the old cluster and any majority of the new cluster. This overlap prevents the cluster from splitting into two independent majorities where two leaders being elected in same term, since otherwise the overlapped server would be able to vote two servers in a term, which is a a contradiction of the specification. This applies to the extended Raft algorithm too because a new leader also needs quorum voters to succeed in an election.Therefore, it is safe to apply same membership configuration methods in Raft algorithm to the extended Raft algorithm either with a sequence of single-server membership change or with an intermediate joint configuration that overlaps both old and new membership configurations.