Rebalancing in Akka Cluster Sharding

Table of contents

Reading Time: 4 minutes

In this blog we will be discussing about one of the important feature of Akka Cluster Sharding which is Rebalancing. Before moving forward make sure you have some basic knowledge on Akka Cluster Sharding, if not then please read Introduction to Akka Cluster Sharding and Implementing Akka Cluster Sharding.

Before directly diving into this amazing feature which akka sharding provides, lets first understand the need of it.

Failover of nodes

When an application actually fails. Consider, if we have a failure of one of our nodes in our cluster. What happens is that the node must be hosting some number of shards and those shards are unique to that node. So when a failure occurs we need to do something about it. Now remember, each entity exists in one shard, each shard exists in one shard region, and each shard region exists on one node. The question then becomes, what happens if a node fails.

In our example image here we have three shard regions and each of those has some shards with some number of entities inside those shards. So in this shard region this node has failed. One thing that we are sure is that whenever a node fails all of these entities are going to become unavailable. Lets see what happens when a node fails.

Rebalancing

Sharding has many different features, one of the built in ones which is important to know is rebalancing which is automatic. One thing to note about rebalancing is that it not only occurs in the time of failure, it occurs when the size of the cluster changes.

The size of the cluster changes due to failure, deployment, scaling.
Now size of the cluster can change due to many other factors such as deployment of new nodes or scaling up/scaling down the cluster as per requirement.

This is when rebalancing come into picture.

The shard coordinator initiates the rebalancing process.
So whenever the size of the cluster changes, the Shard coordinator is going to take over and move the shard to a different region. And then the message is stashed and routed transparently to the new shard. Basically what it’s going to do is it will redistribute the shards among all available nodes. The goal then is to keep an even distribution of entities.

So in our diagram here. We have three nodes. And you can see that each of them is hosting two shards. When a rebalance occurs, in that case we’re going to take one shard from the node, and we will rebalance them to the other nodes. That means that now each node will be hosting three shards, which is still relatively balance. But on the other hand again thinking about this, if each of these nodes had been hosting only one shard, then there would be no way to rebalance that evenly, what you’d end up with is one node would host two shards and the other node would host one and that’s why that shard ID is quite critical.

Rebalancing can occur only on a healthy cluster.
Now one more point to remember is that rebalancing can occur only on healthy clusters. This is necessary to ensure that entities are not accidentally turns out to be duplicate. Well the first thing that we need to do is we need to make sure that the unreachable nodes are properly removed.

How rebalancing works

Once the rebalancing starts , the messages send to an entity on the moving shard will be in buffer. It gets stored in some local storage and going to buffer those messages, until the rebalance has completed.

Once the rebalance has completed. All the nodes will be informed that this shard has now moved. And now we can clean our local buffer and we can start forwarding those messages on again. But sometimes the delivery of these messages might fail and the messages can be lost. So for that we needed some kind of retry mechanism.

least-shard-allocation-strategy {
	rebalance-threshold = 1
	max-simultaneous-rebalance = 3
}

Shard Allocation Strategy controls the shard allocation and rebalancing.

LeastShardAllocationStartegy is the default implementation.
The way this works is that we configure a rebalance threshold, and a maximum simultaneous rebalance. These have default values. So if we provide nothing, then it’ll just use its default values.

rebalance-threshold must exceed before it will rebalance.
that means that there has to be a difference of more than one shard in order for it to decide to rebalance So in other words, in a two node cluster if one node had three shards and another node had four shards, it wouldn’t rebalance, but if one node had three and another node had five, then that would exceed this value and it would start to rebalance.

max-simultaneous-rebalance it will rebalance only this number of shards at a time.
Second is the maximum simultaneous rebalance, it will only rebalance up to three shards at a time. That means that if we have a scaling event which causes it to start rebalancing, it’s going to do it slowly. So that you don’t exceed that max rebalance. We can use the default values and if we want we can customise these as per our need.

So this is all about Rebalancing in this blog.

Hope it helps.