Ideally we would never want our Akka Cluster(s) to fail. Instead we would like to keep them running in a perfect condition. But in real world failures are unavoidable and can be of many types. Like network failures, application failures, and many more. So let’s understand how an Akka Cluster reacts in case of a failure and what are the reasons behind the failures.
Changes in Node States when Failure occurs

In our previous blog post, Let’s stay in touch via Gossip, if we take a look at the Different States of a Node section, then we won’t see Unreachable and Down states. That is because in normal working conditions, a Node in an Akka Cluster would be present in either one of the following states: Joining / Up / Leaving / Exiting / Removed. Unreachable state is achieved by a node only when it is disconnected from the cluster due to some reason.
If the node recovers and gets connected to the cluster, then it is marked as Reachable again. But if the node remains disconnected for a long period of time (relatively) then it is marked as Down. After which the node is Removed from the cluster.
However, one thing to notice here is that, if a node is marked as Down even once, then it can’t rejoin the cluster. It has to be Removed and then added to the cluster again.
How Failures are Detected?

As we have already learnt that the Nodes in an Akka Cluster stay connected via Gossip, they monitor each other’s health via a heartbeat. Using the same heartbeat they also detect any failure.
One important point to remember here is that, in Akka Cluster a failed heartbeat does not lead to a fast failure of a node. Every node keeps a history of heartbeats and on the basis of that history, it decides whether a node is Reachable or not.
However, if one node decides that another node is Unreachable, it has to gossip that information to other nodes too. And once all nodes have seen that the node is Unreachable (i.e., Convergence is reached) then only that node is marked as Unreachable. Until then all the nodes will keep on trying to connect with the Unreachable node.
What are the Reasons of a Failure?
We all know that there can be hundreds of reasons for a failure. But few of them occur frequently and few occur often. Since the reasons due to which failures happen often are many, let’s cover up the ones which are frequent:
- Application Errors: The fatal exceptions occurring in the application(s) running in the Akka cluster is the top most reason for the failures.
- Resource Starvation: The second most frequent reason for failure is resource starvation like Memory leaks, Long GC pauses, etc.
- Slow Network: A slow network can also delay in responses of heartbeats which can further lead to a network partition.
However, the most important thing is to know the root cause of any failure before making corrections in the Akka Cluster.
Consequences of a Failure

What are the consequences of a failure? Well from the above description one point is clear that in case of a failure, a node becomes Unreachable. It means that Convergence cannot be achieved.
Since Convergence is not possible, a Leader cannot be selected. Which would lead to new nodes failing to join the cluster. Hence, Unreachable node(s) have to be either become Reachable or marked as Down.
However, the applications running on the cluster will remain unaffected from the failures and will continue to operate.
What happens with New Node(s) in case of Failure?

In the previous section, we came to know that due to failure, a node can become Unreachable. This would further lead to a failure of new nodes joining the cluster. But what exactly happens with the new node? Nothing, it gets marked as WeaklyUp.
Although WeaklyUp nodes can run the applications, but they cannot host Cluster Shards/Singletons. Hence they cannot be used for the applications where data consistency is important.
However, once Convergence is reached, the WeaklyUp node will be marked as Up and can host the Cluster Shards/Singletons.
Conclusion
So, now we know about the Failures that happen in an Akka Cluster and the way node(s) behave in case of a failure. In our future posts we will get to know about how we can fix these failures manually. So stay tuned 🙂
References
