When we think of sharding or partitioning it’s typically related to databases. Databases uses sharding to improve resilience and elasticity. The Akka Toolkit provides Cluster Sharding as a way to introduce sharding in your application. Instead of distributing database records across a cluster, we’re distributing actors across nodes in the cluster, and it enables running at most one instance of a given actor at any point in time, acting as a consistency boundary.
Common use cases for cluster sharding are when you have many stateful actors that together consume more resources for example memory that can run safely on one machine and allows us to interact with them using a logical identifier without having to know the physical location in the cluster, which can change over time.
Akka Cluster Sharding consists of four main components
- Shard Region
- Shard Coordinator
Let us discuss each of these in detail.
The basic unit in akka cluster sharding is an actor called an entity and entities represent a specific data type. Entities has a unique identifier or Entity ID. For example, a unique device or a user. So if you have 3000 devices, you could chart them as 3000 device actors, each with a unique ID. Only one instance per ID runs in the cluster at any time. Even when we bounce to another node. There is only one entity per entity ID in the cluster. Messages are addressed to the entity ID and processed by the entity. This allows the entity to act as a single source of truth, acting as a consistency boundary for the data that it manages.
Entities are distributed in shards. Each shard manages a number of entities and creates entity actors on demand. And each shard has a unique ID mapping entities to a shard ID is how we control the distribution. And this distribution of entities is defined by a user function. Usually it’s a hashing function of the entity ID. By default the shard ID is the absolute value of the hash code of the entity ID modular, the total number of shards, which again is user define.
Shards gets distribute into different shard regions. Each shard region contains a number of shards. For a type of entity, there is usually one shard region per JVM. A shard region will look up the location of the shard for the entity the first time when it doesn’t already know its location, and then forwards the messages to the appropriate node region, and the entity. There are two types of regions:
- A non proxy and
- A proxy mode.
The non proxy mode is the one that will share a region that will actually create the entities, and the proxy mode is the mode where no entities will create. It’s simply the proxy, which will forward and send the messages onto the appropriate shard region.
The shard coordinator is responsible to manage shards, it’s a cluster singleton. And this will monitor all cluster nodes, and their status. It’s responsible for ensuring that the system knows where to send messages addressed to a specific entity. And it decides which shard gets to live in which region, which is to stay on which node. A customizable shard allocation strategy decides where to allocate a shard. Cluster sharding comes with the default version.
If the shard coordinator itself goes down, as a result, it starts up another node and replaces its state. And this is backed either by akka distributed data, which is custom crdt by default, or you can use akka persistence plugin.
Cluster sharding is useful when you need to distribute actors across several nodes in the cluster and want to be able to interact with them using their logical identifier, but without having to care about their physical location in the cluster, which might also change over time. Sharding has many different features and we’ll come to know about all of them in my upcoming blogs. So stay tuned!