Best Practices for Cassandra Data Modeling

Table of contents

Reading Time: 4 minutes

People new to NoSQL databases tend to relate NoSql as a relational database. But there is quite a difference between those. For people from relation background, CQL looks similar but the way to model it is different. Picking the right data model is the hardest part of using Cassandra. I will explain to you the key points that need to be kept in mind when designing a schema in Cassandra. By following these key points you will not end up in re-designing the schemas again and again.

DON’T

Before explaining what should be done let’s talk about the things that we should not be a concern of when designing a Cassandra data model:

1) Minimize the number of writes

We should not be worried about the writes to the Cassandra database. It is much efficient than reads. We should write the data in such a way that it improves the efficiency of read query.

2) Minimize Data Duplication

Data duplication is necessary for a distributed database like Cassandra. Disks are cheaper nowadays. To improved Cassandra reads we need to duplicate the data so that we can ensure the availability of data in case of some failures.

DO’s

Now let’s jump to the important part, what all things that we need to have a check on.

1) Spread Data Evenly Around the Cluster

Data should be spread around the cluster evenly so that every node should have the roughly same amount of data. Data distribution is based on the partition key that we take. Hash is calculated for each partition key and that hash value is used to decide which data will go to which node in the cluster. So we should choose a good primary key.

2) Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When we perform a read query, coordinator nodes will request all the partitions that contain data. So if we keep the data in different partitions then there will be a delay in response due to the overhead in requesting partitions. This doesn’t mean that we should not use partitions. If we have large data that data needs to be partitioned. So there should be a minimum number of partitions as possible.

To minimize partition reads we need to focus on modeling our data according to queries that we use. Minimising partition reads involve:

a) Model data according to the queries

We should always think of creating a schema based on the queries that we will issue to the Cassandra. If we have the data for the query in one table, there will be a faster read.

b) Create a table based on where you can satisfy your query by reading(roughly) one partition

This means we should have one table per query pattern. Different tables should satisfy different needs. It is ok to duplicate data among different tables but our focus should be to serve the read request from one table in order to optimize the read.

Let’s take an example to understand it better.

Assume we want to create an employee table in Cassandra. So our fields will be employee id, employee name, designation, salary etc. Now identify which all possible queries that we will frequently hit to fetch the data. Possible cases will be :

1) To get the details of an employee against a particular employee id

The schema looks like this:

CREATE TABLE employee (
    employee_id int PRIMARY KEY,
    employee_name text,
    designation text,
    salary int,
    location text
)

Lets match against the rules:

Spread data evenly around the cluster – Yes, as each employee has different partition

Minimise the number of partition read – Yes, only one partition is read to get the data.

2) To get the details of all the employees for a particular designation

Now the requirement has changed. Now we need to get the employee details on the basis of designation. The schema will look like this:

CREATE TABLE employee (
    employee_id int,
    employee_name text,
    designation text,
    salary int,
    location text,
    PRIMARY KEY (designation, employee_id)
)

In the above schema, we have composite primary key consisting of designation which is the partition key and employee_id as the clustering key.

This looks good but lets again match with our rules:

Spread data evenly around the cluster – Our schema may violate this rule. If say we have a large number of records falling in one designation then the data will be bind to one partition. There will not be an even distribution of data.

Minimise the number of partition read – Yes, only one partition is read to get the data.

3) To get the details of all employee details living in a particular location

If we have a large number of records falling in a single partition there will be an issue in spreading the data evenly around the cluster. We can resolve this issue by designing the model in this way:

CREATE TABLE employee (
    employee_id int,
    employee_name text,
    designation text,
    salary int,
    location text,
    PRIMARY KEY ((designation, location), employee_id)
)

Now the distribution will be more evenly spread across the cluster as we are taking into account the location of each employee.

Both of our rules satisfy this schema.

Now to sum up all, Cassandra and RDBMS are different. We need to think differently when we design a Cassandra data model. The above rules need to be followed in order to design a good data model which will be fast and efficient.

Thanks for reading this blog till the end.

Reference:

Datastrax