In distributed systems, we solve problems that are hard to solve using a single computer, using multiple systems working together to solve a problem. This makes distributed systems hard to wrap your head around as there are a lot of factors that come into play.
Two Generals Problem
Let’s consider the two generals problem:
Two generals (with their armies) are trying to conquer a city. They must attack at the same time for the attack to be successful. The issue is that they can only communicate through messengers.
How can we make sure that they attack at the same time?
Let’s say G1 sends a messenger to G2 with a time of attack.
Scenario 1: Initial message gets lost
┌─────────┐ ┌─────────┐
│ General │ ──── "Attack" ──── │ General │
│ 1 │ at 3 PM │ 2 │
└─────────┘ ✗ └─────────┘
│ │
│ │
Waits... Waits...
(No attack) (No attack)
Scenario 2: Acknowledgement gets lost
┌─────────┐ ┌─────────┐
│ General │ ──── "Attack" ───→ │ General │
│ 1 │ at 3 PM │ 2 │
└─────────┘ └─────────┘
│ ✗ ←─── ACK ──────────│
│ │
Uncertain Will attack
(May not attack) (Expects G1)
- Scenario 1: Messenger gets lost on the way.
- Scenario 2: Messenger reaches G2 but the messenger from G2 carrying the acknowledgement gets lost on the way back.
Both of these scenarios are the same from G1’s perspective but from G2’s perspective in scenario 2 it’s supposed to attack but G1 does not know.
This simple example shows that it’s practically impossible to have 2 systems agree on something with certainty without making assumptions about the messenger, the general and the time taken by the messenger (The time of attack could already be in the past by the time a slow messenger reaches).
Byzantine Generals Problem
Now let’s consider a slightly more complex problem.
In a byzantine generals problem, we have 2 or more generals with some of them being traitors (they send different messages to different servers or relay the message from the other generals incorrectly).
To keep things simple, let’s assume that the messengers are reliable in this case.
Our goal is to have the correct generals agree on a plan. The theorem says that for an f
number of byzantine (lying or corrupt) nodes, we need 3f+1
total number of nodes to have the correct nodes agree on a decision.
So for 1 byzantine node we’ll need a total of 4 nodes and so on.
Byzantine Generals Problem (f=1, requires 3f+1=4 nodes)
┌─────────┐ ┌─────────┐
│General 1│◄─────────►│General 2│
│(Honest) │ │(Honest) │
└─────────┘ └─────────┘
│ │
│ ┌─────────┐ │
└─────►│General B│◄───┘
│(Traitor)│
└─────────┘
│
▼
┌─────────┐
│General 3│
│(Honest) │
└─────────┘
Byzantine node sends conflicting messages:
• To G2: "Attack"
• To G3: "Retreat"
• G1 says: "Attack"
Result: Honest nodes can still reach consensus
because majority (3 out of 4) are honest.
System Models
These two problems demonstrate the need for a backdrop / system model when we’re working with consensus problems (multiple nodes trying to agree on something).
A systems model for distributed system has 3 dimensions. i.e. it makes 3 assumptions about the system the distributed algorithm is run on:
- Network Behavior:
- Reliable: Every network packet is delivered (Could be ordered wrong)
- Fair-loss: Every network packet is eventually delivered.
- Arbitrary: Network packets might be subject to malicious actors who might drop or corrupt the packet.
- Node Behavior:
- Crash-stop/Fail-stop: Nodes that crash never come back up again.
- Crash-recovery/Fail-recovery: Nodes that crash might be rebooted but will lose their volatile memory (RAM).
- Byzantine: Nodes might deviate from the algorithm without telling other nodes and actively try to sabotage the system.
- Timing:
- Synchronous: Every message takes a known amount of time.
- Asynchronous: No timing assumptions can be made about message delivery.
- Partially Synchronous: System is asynchronous for a finite time (unknown to system) and then synchronous.
Network behavior can be moved between different states by using techniques such as applying TLS to Arbitrary Network makes it Fair-loss which in turn can be turned into Reliable by retrying and de-duping the same requests.
These properties help us set a backdrop before building a distributed algorithm. For example, in a system, where we’re expecting reliable network, we do not need to worry about the networking issues. Although, these assumptions should be made carefully and should reflect the real-world situations to be effective.