The first law of distribution is “Don’t Distribute“. If everything can be done on a single machine then nothing beats that. But, lets cut to reality.
Today’s computing and information systems are inherently distributed. Starting from large data centers to cloud storage to your mobile phone, everything is a part of the distributed infrastructure. Distribution, just like distributed development teams is here to stay because of
- Geography – We don’t live in the same location
- Parallelism – speed up computation
- Reliability – replication, and redundancy to avoid loss
- Availability – replication leads to faster access, fault tolerance, and low latency
Even though distributed systems lead to multiple benefits, there are a large number of issues that need to be tackled. Even though every single node of a distributed system will only fail once every few years, with millions of nodes, you can expect a failure every minute. Did not see that coming!
The basis of this failure is that the participants or their communication medium may experience failures.
- Processors are prone to failures because, each participants processor would work at a different speed, may crash, may rejoin after the crash
- The network might result in message loss, delays, lost, reordered, duplicated, there may be a network partition with n/2 nodes forming a separate network.
What are the issues that plague a distributed system?
I) A simple example, One client and One server
A simple scenario with just a single client and a server are fraught with challenges such as Message Loss. The client has no way to know if the message reached the server. Even if it reached the server, was it corrupted or in the original order. For the client to know that the message reached, the server can send out an ACK but what if the ACK is lost?
II) The problem becomes more involved when there are more nodes involved.
Here, Client 1 and Client 2 want to change the value of some variable, say score, on the server which is initially 0(zero)
Client 1 would like to do score=score+10
Client 2 would like to do score=score*5
Depending on Message Delay and the network between the client and server, either the message from Client 1 reaches the server first or second. Hence the value of score is either
score = (0+10)*5 = 50 or score = (0*5)+10 = 10
III) The complexity increases multifold when there are more servers involved and we are looking at State Replication. State replication is a fundamental property for distributed systems.
A set of nodes achieves state replication if all nodes execute a (potentially infinite) sequence of commands c 1 , c 2 , c 3 , . . . , in the same order.
State Replication is synonymous with Blockchain!
In order to achieve State Replication, there are a few ways
Two-Phase Protocol – One way is to have the client get a lock on all the servers. It asks for a lock. If all servers return a lock that they are only tuned in to this client then the client sends the command and releases the lock. Else, it waits for all locks to be available and keeps trying. Here, in the first phase the transaction is prepared, and then in the second, either it is committed or rolled-back.
The pitfall here is that all the servers must be responsive since the client has to get a lock on all of them before issuing the command. Also, if one of the servers crashes then we have an issue since we would be able to get a lock on that!
What if we do not need all the servers to be responsive, what if we were able to execute our command from the client on the basis of getting a lock on a majority (not all) the servers?
This lays the foundation for the Paxos protocol which is used to solve consensus in a network of unreliable processors. We discuss Paxos in the next post.