Akka Cluster Formation Fundamentals

Table of contents

Reading Time: 3 minutes

Akka Cluster Formation

Every actor has an address in Akka. The actor could be present locally or could be remote. Remote Actors require communication over the network.

Each Actor system in a cluster is called a member or node. Node is addressed by a combination of hostname, port, and UUID (Regenerated when Actor System restarted). An actor can join the cluster with this combination to have Akka Cluster Formation.

A seed node is required to join a cluster. Any member in the cluster can be a seed node. Seed nodes are no longer used as soon as a node has fully joined and accepted.

Akka Management

Akka management is essential to maintain and monitor a running cluster. It can use to:

Remove a member from the cluster.
Check the health of members in the cluster.

Start an Akka management : AkkaManagement(actorSystem).start()

Cluster Communication

The state in the cluster is communicated via gossip protocol. This includes information about:

The status of a member in a cluster i.e. Joining, Up, etc.
If each member has seen this version of the cluster state (true/false).

Every member will see the same state in the stable cluster which is called Convergence.

The leader will be the first eligible node in the sorting set. It is responsible for marking a Joining node to Up.

Cluster Failure

A node can be disconnected due to some reason, which is then marked as Unreachable.
If the disconnection persists, the node is marked as Down.
The downed node is permanently Removed from the cluster, and cannot rejoin.

A new member cannot rejoin until and unless an Unreachable member(s) is Reached again or being marked as Down.

Convergence is not possible if a member is Unreachable. A joining member would be marked as WeaklyUp and can not host cluster shards. A joining member is marked as Up when Convergence is reached.

It is important to terminate the Unreachable node. If not terminated properly, it can continue to operate. It can form its own independent cluster, unaware that the node has been removed from the cluster.

Improper Downing of a node can lead to Split Brain Problem.

Split Brain Problem

The most common cause of split-brain is Auto-Downing and can be enabled with:

akka.cluster.auto-down-unreachable-after = 5 seconds

Auto Downing doesn’t let the Unreachable node to terminate which in many cases leads to Split Brain. This causes data inconsistency and data corruption.

The Lightbend Split Brain Resolver is the solution that is a part of the Lightbend ecosystem. It includes some customizable strategies to terminate a member in order to avoid a split-brain problem.

Please go through the blog written by Piyush Rana for a tour of various strategies that Lightbend Split Brain Resolver provides. Till then Stay Tuned!!