Lecture Given by Lindsey Kuper on April 1st, 2020 via YouTube
Previous | Next |
---|---|
Lecture 3 |
Leslie Lamport gives the rather comical definition that:
"A distributed system is one in which I can't get my work done because a computer I've never heard of has crashed"
Although he was joking, this definition captures a very important aspect of a distributed system in that it is one defined by some type of failure.
Martin Kleppmann's definition of a distributed system is somewhat more serious (he's the author of a book called Designing Data-Intensive Applications).
A distributed system runs on several nodes (computers) and is characterised by partial failure
However, other definitions of a distributed system include ideas such as:
- Systems where work is distributed fairly between nodes, or
- Where multiple computers "behave as one"
These last definitions, however, are all rather too optimistic because they do not account for any real-life difficulties - one of which is mentioned in Martin Kleppman's definition.
Partial failure is where some component of a computation fails, but that failure is not necessarily fatal. For instance, one machine in a cluster of 1000 could fail without there being any significant impact on the overall "system". In other words, the presence of this type of partial failure is non-fatal to the operation of your system as a whole.
The problem with HPC is that it treats partial failure as total failure. Consequently, HPC must rely on techniques such as check-pointing. This is where the progress of the calculation is saved at regular intervals so that in the event of a failure, the computation can be continued from the last check point without needing to start over from the beginning.
However, in computing in general and specifically in Cloud Computing, it is expected that various parts of the system will fail. Consequently, this type of behaviour is a fundamental part of the software design.
Had Ben Franklin been a programmer, he would probably have said:
Failure can occur in many ways. For instance:
- Network partition
- Hardware failure
- Software failure
In a minimal cluster of two computers M1
and M2
, M1
wants to ask M2
for the value of a particular variable:
M1
sends a message toM2
saying "What's the value ofx
?"M2
should then respond with another message saying "x=5
"
But in this very simple scenario, the number of possibilities for failure is still very high:
M1
's message might get lost due to network failureM1
's message is delivered very slowly due to unexpectedly high network latencyM2
could be downM2
might crash immediately after reporting toM1
that it is alive, but before receivingM1
's messageM2
might crash as a result of trying to respond toM1
's messageM2
might refuse to respond to the question due to some type of authentication or authorisation failureM2
responds correctly, but the message never gets back toM1
- Some random external event such as a cosmic ray flips a bit in the message thus corrupting it (Byzantine error)
And on and on...
Although the underlying causes are very different, as far as M1
is concerned, they all look the same.
All M1
can tell is that it never got an answer to its question.
In fact, it is impossible to determine the cause of such a failure without first having global knowledge of the entire system.
It is often assumed that M1
must wait for some predefined timeout period, after which it should assume failure if it does not receive an answer.
But the problem is that network latency is indeterminate; therefore, waiting for an arbitrary timeout period might help to trap some errors, but in general, it is not a good strategy for determining failure.
For example, instead of asking "What is the value of x
", M1
could ask M2
to "Add 1 to x
".
Q: How will M1
know this request was successfully processed?
Should it simply wait for a predetermined timeout period and if it hears nothing back, assume everything's fine?
A: No, M1
can only discover the success or failure of its request when it receives some sort of acknowledgement back from M2
.
If M1
does not receive a response within its timeout period, what should it conclude?
We can see that without any additional information, it is not correct for M1
to assume either success or failure.
All M1
can say with any certainty is that from its point of view, the state of variable x
in M2
is indeterminate.
If the maximum network delay for message transmission is D
seconds and the maximum time a machine spends processing a request is R
, then the upper bound of the message handling timeout should be 2D + R
(two network journeys (send and receive) plus the remote machine's processing time).
This would rule out the uncertainty for timeouts in a slow network, but still leave us completely unable to reason about any other types of failure.
In distributed systems, we must deal not only with the problems of "Partial failure", but also the problem of "Unbounded Latency" (the definition given by Peter Alvaro - one of Lindsey Kuper's colleagues)
Q: Given they're so hard to debug, why would you want to use a distributed system? A: Well, largely because we have no choice. All manner of external factors can force us to distribute either a computation or storage (or both) across multiple machines…
- Too much data to fit on a single machine
- Need a faster response time, so we have to throw more processing power at the problem
- Scalability (need to handle more data, more users, or simply need more CPUs)
- Fault Tolerance (redundancy)
Computers use clocks to identify specific points in time (both past and present). For example:
- This class starts at 09:00
- This item in the cache expires tomorrow at 4:26 PM
- This log event happened yesterday at 11:31:34.352
Computers also use clocks for reasoning about time intervals:
- This class is 65 minutes long
- This access code will expire in 30 seconds time
- This user spent 4 minutes 38 seconds on our website
Computers have two types of clock, time-of-day clocks and monotonic clocks.
- Time-of-day clocks are typically synchronised across different machines using NTP
- Time-of-day clocks are bad for measuring intervals between events taking place on different machines:
- The clocks on the different machines may not be synchronised correctly and may disagree on the exact size of a time interval
- There are several cases where the time-of-day clock can jump:
- Daylight saving time just started
- A leap second happens
- NTP resets the machine's clock to an earlier value
- Time-of-day clocks are fine for timestamping events that only require a low degree of accuracy, but they are quite inadequate in situations where a high degree of accuracy is needed
There is a general principle in timekeeping: the more accurately you need to measure time, the harder that task becomes. If you now try to get two or more computers to agree on what the time is, then the difficulty is only compounded.
A value is said to be "monotonic" if it only ever changes in one direction. So, a monotonic clock is one that will never jump backwards; its counter is guaranteed only to get bigger.
The value of a monotonic clock is:
- simply a counter that only ever gets bigger (E.G. microseconds since system boot)
- useless as a timestamp, because it has no meaning outside the machine on which it exists. Therefore, it makes no sense to compare monotonic clock values between different machines.
- ideally suited for measuring time intervals on a single machine (E.G. how long did it take to compress that image)
Time of day and monotonic clocks are both physical clocks. This means they are concerned with quantifying time intervals between now and some fixed reference point in the past. Common reference points are Jan 1st, 1970 (for most date calculations) and when the machine was started.
Q: So how do we mark points in time that are valid across multiple machines?
A: If we attempt to use any sort of physical clock, then it's very tricky.
If we think we can solve this problem by repeatedly asking a physical clock "What's the time?", then in fact, we are asking the wrong question because we have misunderstood the problem.
Machines in a distributed system don't need to know what the time is in any absolute sense, all they need to be able to do is answer the question "Which event happened first?". This is why distributed systems use a very different notion of what a clock is; they use something called a Logical Clock (a.k.a. a Lamport Clock).
Counter-intuitively, a logical clock measures neither the time of day nor the elapsed interval between two events; instead, it measures nothing more than the ordering of events.
At first, this concept is not easy to grasp because it goes against our firmly established notion of what a clock is (or ought to be). But in order to answer the all-important question "Which event happened first?" we don't need to reference the time of day, we just need some sort of counter that clicks up every time an event occurs. This type of "clock" is very important for several reasons:
- Communication is unreliable
This means that the length of time taken for message transmission is unpredictable (a phenomenon known as "unbounded latency") - Unpredictable changes of state
E.G. the last time you asked machineM2
what the value ofx
was, it saidx=5
; but in the meantime, some other machine has changedx
to6
and hasn't told you…
In these situations, it is vital to know the order in which events occurred.
Let's says we have two database events A
and B
and we know that A
happened before B
.
So we denote this relationship by the syntax A -> B
. This is pronounced "A
happens before B
"
But what does this tell us about causality?
All we can really say is that event A
might have been the cause of event B
, but we cannot be certain about this.
All we can say with absolute certainty is that since B
happened after A
, it is impossible for B
to have been the cause of A
.
Questions about causality are very important in distributed systems because they help us solve problems such as debugging.
If M1
thinks x=1
, but M2
thinks x=5
, then by knowing when the value of x
changed in each machine, we can rule out those events that did not contribute to the problem.
This greatly helps in narrowing down what is the cause of the problem.
This determinism is also useful when designing systems in which users see a sequence of events.
You know that if event A
happened before event B
, then you can ensure that your system will never display event B
to the user before they have first seen event A
.
Previous | Next |
---|---|
Lecture 3 |