In our previous blog post, Understanding Failures in Akka Cluster, we came to know how an Akka Cluster reacts in case of a failure and what are the reasons behind the failures. Now, whenever a failure will occur in an Akka Cluster, we would need a way to heal the cluster, so that we can restore it back to its normal working condition. Hence, in this blog post, we are going to talk about, how to manually heal an Akka Cluster.
Healing a Simple Failure
In case of a simple failure like resource starvation or sudden crash of a node, healing the cluster is fairly simple. We just have to restart the Unreachable member (Node 2 in the above diagram) on the same host and port. This will make the cluster recognize the Unreachable member as a new member who is trying to Join the cluster. The old reference of the Unreachable member will be removed by the cluster and the new member will Join the cluster. Once the Convergence is achieved, the new member will be marked as Up in the cluster.
In fact, there are a few tools that do all these steps for us automatically like Kubernetes, DC/OS, etc. Because they can detect node failure and can restart them automatically.
Healing a Complex Failure
In the above section, we have seen how to heal a cluster in the case of a simple failure. But there are failures that are complex in nature. Like network partition, which requires additional steps, then just restarting the Unreachable node(s), to heal.
Why? Because when a network partition occurs, the nodes are actually operating fine. It’s just that they are not able to communicate with each other. Hence, we first need to decide which side of the network partition has to be cleaned up. Then we need to shut down the node(s) on the side which we are going to clean (Node 2 side in the above diagram). Once that is done, we can mark the shutdown node(s) as Down. And at last, we can add new node(s) to the cluster.
However, one thing to note here is that the orchestrations tools (like Kubernetes, DC/OS) cannot handle such failures because they are not aware of the network partitions. For them, the nodes are running fine. Hence they will not take any action.
Shutting Down a Node
From the above sections, one thing is obvious that we need to mark the Unreachable node(s) as Down. How we can do that?
Simple, we can use Akka Management to do that. For which we have to send a HTTP PUT request. Example,
curl -w '\n' -X PUT -H 'Content-Type: multipart/form-data' -F operation=down http://localhost:8558/cluster/members/Credibility@127.0.0.1:2552
This will initiate the process of marking the Unreachable node(s) in the cluster as Down. If you want more details on Akka Management, then you read about it here – Managing a Cluster.
Point to Remember
When we are marking an Unreachable node in a cluster as Down, then we need to make sure that the Unreachable node is Terminated first. Because if we don’t do so then the Unreachable node(s) may continue to operate and are they will become unaware of the fact they have been removed from the cluster. Also, this leads to a dangerous condition known as Split Brain, depicted in the diagram above, which will further lead to data consistency.
How? Suppose Node 1 is hosting Shard A, then all the requests related to Shard A should be redirected to Node 1. However, since Node 2 is hosting a copy of Shard A, and is not terminated, due to network partition it won’t know about Node 1’s existence. Hence Node 2 might also process some requests related to Shard A. This can lead to inconsistency in data if the application needs consistent data.
So, now we know how to manually heal an Akka Cluster in case of different kinds of failure. And what points we should take care of when we are healing the cluster. In our future post(s) we will explore more about the Split Brain problem. So stay tuned 🙂