Akka Cluster in use (Part 8): What is a Split Brain?

Table of contents

Reading Time: 5 minutes

In our previous blog post, Manually Healing an Akka Cluster, we have already seen that if we do not handle the failures in an Akka Cluster carefully, then it can lead to disastrous situations like Split Brain. Hence, in this blog post we will learn more about the consequences of the Split Brain problem.

What is Split Brain?

Split Brain is a destructive condition of a cluster. It occurs when an Akka Cluster gets divided into 2 or more different clusters. For example, the nodes in the above diagram, were initially part of a single cluster. But due to some failure and improper management, they got splitted in to 2 clusters – an orange cluster which contains Nodes 1 & 3, and a green cluster which contains Node 2. Each of the cluster is operating fully independently without knowing about the existence of the other cluster.

Reason behind Split Brain

If we recall, from our previous blog post, we will find that handling a simple failure like node crash, OOM errors, etc. can be handled pretty easily. We just have to mark the crashed node(s) as Down. Since the node(s) have crashed, or I should say terminated, it is safe to mark them as Down. They will be removed from the cluster without any hassle.

However, if the node is not crashed and it is marked as Down, then it may form it’s own independent cluster. For example, in the above diagram, Node 2 was marked as Unreachable due to a network partition or a delay in gossip. Then instead of verifying the reason of being Unreachable, the DevOps team manually marked the Node 2 as Down. This resulted in 2 different clusters where Node 2 formed a cluster of it’s own, and Nodes 1 & 3 formed a cluster of their own.

Causing Data Inconsistency

Split Brain can cause some pretty significant impact on the Data Consistency. How?

We all know that Cluster Sharding/Singleton are used to enforce data consistency in Akka Cluster Sharding. They always require a single actor instance running in the cluster to provide a single threaded illusion to the user in a distributed environment. However, in case of Split Brain, a cluster is divided into 2 or more different clusters. Since, each cluster will have it’s own copy of the actors (shards), it can lead to data inconsistency.

For example, in the above diagram, Node 1 is hosting Shard A, then all the requests related to Shard A should be redirected to Node 1. However, since Node 2 is hosting a copy of Shard A, and is not terminated, due to network partition it won’t know about Node 1’s existence. Hence Node 2 might also process some requests related to Shard A. This can lead to data inconsistency in the application.

So, we should be very careful when marking Down Unreachable nodes in case data consistency is important.

Experiencing Split Brain

Since, now we know what a Split Brain is, let’s experience it on a sample cluster. For that first form a cluster by following the steps mentioned in the blog: Setup a Local Akka Cluster. Once the cluster is formed, add $100 to a sample account on each instance ($200 total):

Next, to create a network partition scenario, let us pause the first node of the cluster:

$ ./pauseNode.sh 2551

Now, let’s check on the second node (http://localhost:8559/cluster/members) to verify that the first has become unreachable:

Now to simulate a Split Brain scenario, we just have to inform the second node that the first is Down (without terminating or crashing it) using the following curl request:

curl -w '\n' -X PUT -H 'Content-Type: multipart/form-data' -F operation=down http://localhost:8559/cluster/members/Credibility@127.0.0.1:2551

To verify whether the first node is marked as Down or not, have a look at the cluster membership on the second node (http://localhost:8559/cluster/members). It should appear to be a healthy cluster of one node.

Since second node has formed a healthy cluster of it’s own, let’s un-pause the first node:

$ ./resumeNode.sh 2551

You will experience that the second node has formed a healthy cluster, but is unaware of the existence of the first node. However, the first node on the other hand appears to be unhealthy because it can’t talk to the second node. This can be verified by the cluster membership of both the nodes:

First Node (http://localhost:8558/cluster/members)

Second Node (http://localhost:8559/cluster/members)

Now to form a healthy cluster for the first Node, we have to inform it that the second node is Down (without terminating it) using the following curl request:

curl -w '\n' -X PUT -H 'Content-Type: multipart/form-data' -F operation=down http://localhost:8558/cluster/members/Credibility@127.0.0.1:2552

Now if you look at the cluster membership of the first node you will see that it is also reporting a healthy cluster. However none of the clusters, of single node, knows about the existence of the other. Finally we have achieved a Split Brain. In this state both clusters will try to operate normally, unaware that the other exists.

Let’s see what is the impact of the Split Brain have on our application. For that we need to verify the balance amount in our sample account on both the nodes:

Now let’s debit $200 from our sample account via the first node:

This should leave us with $0 in our sample account. That means that any further requests to debit money should be rejected.

However, now try to debit $200 from our sample account via the second node:

Notice how $200 was debit(ed) again from the sample account? This happens because of Split Brain. Our two nodes are no longer part of the same cluster and are now making decisions independently of each other. This means they can no longer promise data consistency.

Conclusion

So, now we know the consequences of Split Brain and what causes a Split Brain. In our future post(s) we will learn how to use a Split Brain Resolver to fix the Split Brain problem in an automated fashion. So stay tuned 🙂