The first law of distribution is “Don’t Distribute“. If everything can be done on a single machine then nothing beats that. But, lets cut to reality.

Today’s computing and information systems are inherently distributed. Starting from large data centers to cloud storage to your mobile phone, everything is a part of the distributed infrastructure. Distribution, just like distributed development teams is here to stay because of

  • Geography – We don’t live in the same location
  • Parallelism – speed up computation
  • Reliability – replication, and redundancy to avoid loss
  • Availability – replication leads to faster access, fault tolerance, and low latency

Even though distributed systems lead to multiple benefits, there are a large number of issues that need to be tackled. Even though every single node of a distributed system will only fail once every few years, with millions of nodes, you can expect a failure every minute. Did not see that coming!

The basis of this failure is that the participants or their communication medium may experience failures.

  • Processors are prone to failures because, each participants processor would work at a different speed, may crash, may rejoin after the crash
  • The network might result in message loss, delays, lost, reordered, duplicated, there may be a network partition with n/2 nodes forming a separate network.

What are the issues that plague a distributed system?

I) A simple example, One client and One server

Screenshot from 2018-03-01 20-49-31

A simple scenario with just a single client and a server are fraught with challenges such as Message Loss. The client has no way to know if the message reached the server. Even if it reached the server, was it corrupted or in the original order. For the client to know that the message reached, the server can send out an ACK but what if the ACK is lost?

II) The problem becomes more involved when there are more nodes involved.

Screenshot from 2018-03-01 20-53-12

Here, Client 1 and Client 2 want to change the value of some variable, say score, on the server which is initially 0(zero)

Client 1 would like to do score=score+10

Client 2 would like to do score=score*5

Depending on Message Delay and the network between the client and server, either the message from Client 1 reaches the server first or second. Hence the value of score is either

score = (0+10)*5 = 50 or score = (0*5)+10 = 10

III) The complexity increases multifold when there are more servers involved and we are looking at State Replication. State replication is a fundamental property for distributed systems.

Screenshot from 2018-03-01 23-37-29

A set of nodes achieves state replication if all nodes execute a (potentially infinite) sequence of commands c 1 , c 2 , c 3 , . . . , in the same order.

State Replication is synonymous with Blockchain!

In order to achieve State Replication, there are a few ways

Two-Phase Protocol – One way is to have the client get a lock on all the servers. It asks for a lock. If all servers return a lock that they are only tuned in to this client then the client sends the command and releases the lock. Else, it waits for all locks to be available and keeps trying. Here, in the first phase the transaction is prepared, and then in the second, either it is committed or rolled-back.

Screenshot from 2018-03-01 23-47-14

The pitfall here is that all the servers must be responsive since the client has to get a lock on all of them before issuing the command. Also, if one of the servers crashes then we have an issue since we would be able to get a lock on that!

What if we do not need all the servers to be responsive, what if we were able to execute our command from the client on the basis of getting a lock on a majority (not all) the servers?

This lays the foundation for the Paxos protocol which is used to solve consensus in a network of unreliable processors. We discuss Paxos in the next post.


 

knoldus-advt-sticker

About the Author: Vikas Hazrati

Vikas is the Founding Partner @ Knoldus which is a group of software industry veterans who have joined hands to add value to the art of software development. Knoldus does niche Reactive and Big Data product development on Scala, Spark and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). To know more, send a mail to hello@knoldus.com or visit www.knoldus.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s