Akka Split Brain Resolver – Problem and its Solution


AKKA Split Brain Resolver

When operating an Akka cluster the developer will surely comes around network partitions (Split Brain scenarios) and machine crashes. There are multiple strategies present by lightbend to handle such erratic behavior and, after a deeper explanation of the problem we are facing, I will try to present them along with their pros and cons using the Split Brain Resolver in Akka, which is a part of the Reactive Platform.

The Problem Statement

The basic is as follows – a node cannot differentiate between complete crashes and temporary network failures that could get resolved.

– In our current scenarios the network tracks the state and health of the nodes it contains using a “heartbeat”. This is a mechanism where each node sends a simple message in a given interval (for example with a time delta) to every other node that basically says “I’m OK”. The system makes sure that all the messages have been received using “sink” node which collects the signals, orders them using a FIFO queue and eventually diagnoses if there are any problems with the ordering of the signals.

– The problem is that if the heartbeat of a node stops, the network and other nodes cannot identify the reason why what was the reason the node stopped responding. Thus we can’t be sure whether the node may work again in the future and when it will happen (some network problem) or has permanently stopped working (for example the JVM or hardware crashed) and will not recover and should be discarded.

– A network partition (a.k.a. Split Brain) is a situation where a network failure caused the network to split. The split means that the parts can no longer communicate with each other. They need to decide what to do next on their own based on the latest membership data of the cluster.

– We can distinguish three main failure scenarios/categories: – Network partitions (split brain scenarios) – Crashes (JVM crash, hardware crash, etc.) – Unresponsive process (CPU starvation, garbage collector pauses, etc.)

– The Akka cluster has a failure detector that will notice network partitions and machine crashes (but it cannot distinguish the two). It uses periodic heartbeat messages to check if other nodes are available and healthy. These observations by the failure detector are referred to as a node being unreachable and it may become reachable again if the failure detector observes that it can communicate with it again.

– The failure detector in itself is not enough for making the right decision in all situations. The naive approach is to remove an unreachable node from the cluster membership after a timeout. This works great for crashes and short transient network partitions, but not for long network partitions. Both sides of the network partition will see the other side as unreachable and after a while remove it from its cluster membership. Since this happens on both sides the result is that two separate disconnected clusters have been created. This approach is provided by the opt-in (off by default) auto-down feature in the OSS version of Akka Cluster.

– If you use the timeout based auto-down feature in combination with Cluster Singleton or Cluster Sharding that would mean that two singleton instances or two sharded entities with the same identifier would be running. One would be running: one in each cluster. For example when used together with Akka Persistence that could result in that two instances of a persistent actor with the same persistenceId are running and writing concurrently to the same stream of persistent events, which will have fatal consequences when replaying these events.

– The default setting in Akka Cluster is to not remove unreachable nodes automatically and the recomendation is that the decision of what to do should be taken by a human operator or an external monitoring system. This is a valid solution, but not very convenient if you do not have this staff or external system for other reasons.

Possible Solutions Or We Should Say Possible workarounds –

Using Strategies defined as –

1)Static Quorum
2)Keep Majority
3)Keep Oldest
4)Keep Referee

To be frank – there is no silver bullet for the problem described above. Each network needs to be analyzed individually for the optimal strategy to be chosen. The most widely used are: – Static Quorum – Keep Majority – Keep Oldest – Keep Referee

I will try to describe each one detailedly in the next sections along with their pros and cons.

1.Static Quorum

How it works?
This strategy will down the unreachable nodes if the number of remaining (healthy) nodes is equal or greater to a predefined constant number called the quorum size – i.e. this value defines the minimal number of nodes the cluster must posses to be operational.

When to use it?
It is best to use it when we have a fixed number of nodes in a cluster or when we can define a fixed number of nodes with a certain role.

Important notes
One important rule regarding Static Quorum: You mustn’t add more nodes to the cluster than quorum-size*2 -1 as this can lead to a situation during a split where both clusters think they have enough nodes to function and try to down each other.

2.Keep Majority

How it works?
This strategy will down the unreachable nodes if the current one is part of the majority based on the last membership information, otherwise it will down the reachable part (the one it is a part of). If the parts are of equal size the one containing the node with the lowest address will be kept.
When to use it?

When the number of nodes in the cluster can change over time (it’s dynamic) and therefore static strategies like static-quorum won’t work.

Important notes
A small chance exists that both parts have different membership information and thus produce different decisions. This can lead to both sides deciding to down each other as each one of them thinks they are the majority based on their membership information.
If there are more than two partitions and none of them has a majority then each one of them will shutdown terminating the cluster
If more than half of the nodes crash at the same time the remaining ones will terminate themselves based on the outdated membership information thus terminating the whole cluster.

3.Keep Oldest

How it works?
This strategy will down the part that does not contain the oldest node. The oldest member is important as the active Cluster Singleton runs on the oldest member.
When to use it?

When we use a Cluster Singleton and don’t want to shut down the node where it runs.

Important notes
This strategy only cares where the oldest member resides, so e.g. we have a 2-98 split and the oldest member is among the 2 nodes then the other 98 will get shut down.
There is a similar risk like the one described in the Keep Majority note – that both sides will have different information about the location of the oldest node. It can result in both nodes mutually trying to shut down one another, as each one think it possesses the oldest member.

4.Keep Referee

How it works?
This strategy will down the part that does not contain the given referee-node. The referee-node is an arbitrary member of the cluster we run the strategy on. It is up to us to specify which node would be suitable for this role. If the number of remaining nodes is less than a predefined constant called down-all-if-less-than-nodes then they will get shut down. If the referee node is removed then all nodes will get shut down.
When to use it?

When a single node is critical for the system running (the one we mark as the referee node).

Important notes
This strategy creates a single point of failure by design (the referee node).
It will never result in two separate clusters after a split occurs (as there can only be one referee node, thus one cluster alive).

Strategies in Akka

Here I will try to quickly show how to enable and configure the strategies that were explained in the previous section.

Enabling
You can enable a strategy in Akka with the configuration property akka.cluster.split-brain-resolver.active-strategy. All the strategies are inactive until the cluster membership and unreachable nodes information has been stable for a certain amount of time.

akka.cluster.split-brain-resolver {
#One of the following: static-quorum, keep-majority, keep-oldest, keep-referee
active-strategy = off

stable-after = 20s
}

After a part of the split decides to shut down it will issue a shut down command to all its reachable nodes. But this will not automatically close the ActorSystem and exit the JVM, we need to do this manually in the registerOnMemberRemoved callback:

Cluster(system).registerOnMemberRemoved {
system.registerOnTermination(System.exit(-1))
system.scheduler.scheduleOnce(10.seconds)(System.exit(-1))(system.dispatcher)
system.shutdown()
}

Roles
Each node can have a configured role which can be taken into account when executing a strategy. This is useful when some nodes are more valuable than others and should be the last to terminate when needed.

Static Quorum in Akka
Enabling:
akka.cluster.split-brain-resolver.active-strategy=static-quorum

Configuration:
akka.cluster.split-brain-resolver.static-quorum {
quorum-size = undefined
role = ""
}

Keep Majority in Akka
Enabling:
akka.cluster.split-brain-resolver.active-strategy=keep-majority

Configuration:
akka.cluster.split-brain-resolver.keep-majority {
role = ""
}

Keep Oldest
Enabling:
akka.cluster.split-brain-resolver.active-strategy=keep-oldest

Configuration:
akka.cluster.split-brain-resolver.keep-oldest {
# Enable downing of the oldest node when it is partitioned from all other nodes
down-if-alone = on

role = “”
}

Keep Referee
Enabling:
akka.cluster.split-brain-resolver.active-strategy=keep-referee

Configuration:
akka.cluster.split-brain-resolver.keep-referee {
# referee address on the form of "akka.tcp://system@hostname:port"
address = ""

down-all-if-less-than-nodes = 1
}

In the next Blog we will see the running example for implementing the Akka Split brain Resolver


KNOLDUS-advt-sticker

Advertisements
This entry was posted in Akka, Scala and tagged , , , , . Bookmark the permalink.

6 Responses to Akka Split Brain Resolver – Problem and its Solution

  1. duvya says:

    Informative .waiting for the example

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s