Data modeling in Cassandra


Role of Partitioning & Clustering Keys in Cassandra

Primary and Clustering Keys should be one of the very first things you learn about when modeling Cassandra data.  With this post I will cover what the different types of Primary Keys are, how they can be used, what their purpose is, and how they affect your queries.

Primary key

Primary Keys are defined when you create your table.  The most basic primary key is a single column.  A single column is great for when you know the value that you will be searching for.  The following table has a single column, comp_id, as the primary key,

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (comp_id )
);

A single column Primary Key is also called a Partition Key.  When Cassandra is deciding where in the cluster to store this particular piece of data, it will hash the partition key. The value of that hash dictates where the data will reside and which replicas will be responsible for it.

Partition Key

The Partition Key is responsible for the distribution of data among the nodes.  suppose there are 4 nodes A , B ,C ,D and  let’s assume hash values are between 0-100  and also assume that 0-25 , 25-50 , 50-75 and 75-100 are the hash values for the nodes A,B,C,D .   When we insert the first row into the company_Information table, the value of comp_id will be hashed.  Let’s also assume that the first record will have a hash of 34.  That will fall into the values that Node 2’s partition is assigned.

Compound Key

  • A multi-column primary key is called a Compound Key.
  • Primary keys can also be more than one column.


CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (country_code , city , name , comp_id )
);

This example has four columns in the Primary Key clause. An interesting characteristic of Compound Keys is that only the first column is considered the Partition Key. There rest of the columns in the Primary Key clause are Clustering Keys.

Order By

You can change the default shorting order from ascending to descending by “order By” . There is an additional WITH clause that you need to add to the CREATE TABLE to make this possible.

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (country_code , city , name , comp_id )
) WITH CLUSTERING ORDER BY (city DESC, name ASC , comp_id DESC);

Now we’ve changed the ordering of the Clustering Keys to sort city in descending order . Did you notice that I did not specify what the sort is for country_code? Since it’s the partition key, there is nothing to sort as hashed values won’t be close to each other in the cluster.

Clustering Keys

Each additional column that is added to the Primary Key clause is called a Clustering Key. A clustering key is responsible for sorting data within the partition. In our example company_Information table, country_code is the partition key with city, name & comp_id acting as the clustering keys. By default, the clustering key columns are sorted in ascending order.

Composite Key

A Composite Key is when you have a multi-column Partition Key.  The above example only used country_code for partitioning.  This means that all records with a country_code value of “INDIA” are in the same partition.Avoiding wide rows is the perfect reason to move to a Composite Key.  Let’s change the Partition Key to include the comp_id & city columns.  We do this by nesting parenthesis around the columns that are to be a Composite Key, as follows:


CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY ((country_code , city , comp_id) , name )
);

What this does is it changes the hash value from being calculated off of only country_code. Now it will be calculated off of the combination of country_code, city & comp_id. Each combination of the three columns have their own hash value and will be stored in completely different partition in the cluster

This entry was posted in Cassandra, Scala and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s