Cassandra Data Modeling

Reading Time: 3 minutes

The goal of this blog is to explain the basic rules you should keep in mind when designing your schema for Cassandra. If you follow these rules, you’ll get pretty good performance out of the box.

Let’s first discuss keys in Cassandra:

  • Primary Key – Made by a single column.

CREATE TABLE blogs (
key text PRIMARY KEY,
data text
);

  • Composite Key – Generated from more columns.

CREATE TABLE blogs (
key_one text,
key_two text,
data text
PRIMARY KEY(key_one, key_two)
);

  • Partition Key –  The “first part” of the composite key is called PARTITION KEY (in this example key_one is the partition key)
  • Clustering Key – The “second part” of the composite key is the CLUSTERING KEY (in this example key_two is the composite key)

Behind these names …

  • The Partition Key is responsible for data distribution across your nodes.
  • The Clustering Key is responsible for data sorting within the partition.

Non-GOALS

When we come from RDMS background we have one thing in mind is to not to have redundancy that is, data should be in a denormalized form.

  • Maximize Data Duplication – As Cassandra is a distributed database, so data duplication provides instant data availability and no single point of failure.
  • Minimize the Number of Writes – Cassandra is optimized for high write throughput, and almost all writes are equally efficient as Cassandra stores data in memory(Memtable) which is faster than disk.

Basic GOALS

Two basic goals in Cassandra which we should keep in mind:

  • Spread data evenly around the cluster – You want every node in the cluster to have roughly the same amount of data. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key.
  • Minimize the number of partitions read – Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible. Why is this important? [Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.]

Model Around Your Queries

The way to minimize partition reads is to model your data to fit your queries. Model around your queries. Here’s how you do that:

Step 1: Determine What Queries to Support

Try to determine exactly what queries you need to support. This can include a lot of considerations that you may not think of at first. For example, you may need to think about:

  • Grouping by an attribute
  • Ordering by an attribute
  • Filtering based on some set of conditions
  • Enforcing uniqueness in the result set
    etc … Changes to just one of these query requirements will frequently warrant a data model change for maximum efficiency.

Step 2: Try to create a table where you can satisfy your query by reading one partition

In practice, this generally means you will use roughly one table per query pattern. If you need to support multiple query patterns, you usually need more than one table.
That is, if you need different types of answers, you usually need different tables. This is how you optimize for reads.

Remember, data duplication is okay. Many of your tables may repeat the same data.

Applying the Rules: Examples

Example 1: Here is the table MusicPlaylist.

Create table MusicPlaylist (
SongId int,
SongName text,
Year int,
Singer text,
Primary key(SongId, SongName)
);

In the above example, table MusicPlaylist,

  • Songid is the partition key, and
  • SongName is the clustering column

Data will be clustered on the basis of SongName. Only one partition will be created with the SongId. There will not be any other partition in the table MusicPlaylist. Data retrieval will be slow by this data model due to the bad primary key.

Example 2: Here is another table MusicPlaylist.

Create table MusicPlaylist (
SongId int,
SongName text,
Year int,
Singer text,
Primary key((SongId, Year), SongName)
);

In the above example, table MusicPlaylist,

  • Songid and Year are the partition key, and
  • SongName is the clustering column.

Data will be clustered on the basis of SongName. In this table, each year, a new partition will be created. All the songs of the year will be on the same node. This primary key will be very useful for the data. Our data retrieval will be fast with this data model.

Conclusion

Data modeling in Cassandra is different than other RDBMS databases. Cassandra data modeling has some rules. These rules must be followed for good data modeling. Remember that there are many ways to model. The best way depends on your use case and query patterns.

References


knoldus-advt-sticker

Written by 

Charmy is a Software Consultant having experience of more than 1.5 years. She is familiar with Object Oriented Programming Paradigms and has familiarity with Technical languages such as Scala, Lagom, Java, Apache Solr, Apache Spark, Apache Kafka, Apigee. She is always eager to learn new concepts in order to expand her horizon. Her hobbies include playing guitar and Sketching.

1 thought on “Cassandra Data Modeling4 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading