Cassandra Data Modeling – Primary , Clustering , Partition , Compound Keys


In this post we are going to discuss more about different keys available in Cassandra . Primary key concept in Cassandra is different from Relational databases. Therefore it is worth spending time to understand this concept.

Lets take an example and create a student table which had a student_id as a primary key column.
1) primary key 
create table person (student_id int primary key, fname text, lname text, dateofbirth timestamp, email text, phone text );

In Cassandra a table can have number of rows. Each row is referenced by a primary key also called row key. There are number of columns in a row but the number of columns can vary in different rows.
For example one row in a table can have three columns where as the other row in the same table can have ten columns. It is also important to note that in Cassandra both column names and values have binary types. Which means column names can have binary values such as string, timestamp or an integer etc. This is different from SQL databases where each row in an SQL table has fixed number of columns and column names can only be text.

We saw that the student_id was used as a row key to refer to person data.

2)Compound primary key:
As the name suggests, compound primary key is comprised of one or more columns which are referenced in the primary key. One component of compound primary key is called partition key where as the other component is called clustering key. Following are different variations of primary keys. Please note that C1, C2, C3,… and so on represent columns in the table.

C1: primary key has only one partition key and no cluster key.
(C1, C2): column C1 is a partition key and column C2 is a cluster key.
(C1,C2,C3,…): column C1 is a partition key and columns C2, C3 and so on make cluster key.
(C1, (C2, C3,…)): It is same as 3 i.e column C1 is a partition key and columns C2,C3,… make cluster key.
((C1, C2,…), (C3,C4,…)): columns C1, C2 make partition key and columns C3,C4,… make cluster key.

It is important to note that when compound key is C1,C2,C3 then the first key C1 becomes partition key and rest of the keys become part of the cluster key. In order to make composite partition keys we have to specify keys in parenthesis such as: ( ( C1,C2) , C3, C4).  In this case C1 & C2 are part of partition keys and C3, C4 are part of cluster key.

3) Partition key
The purpose of partition key is to identify the partition or node in the cluster which stores that row. When data is read or write from the cluster a function called Partitioner is used to compute the hash value of the partition key. This hash value is used to determine the node/partition which contains that row. For example rows whose partition key values range from 1000 to 1234 may reside in node A and rows with partition key values range from 1235 to 2000 may reside in node B as shown in figure 1. If a row contains partition key whose hash value is 1233 then it will be stored in node A.

 

cluster

4) Clustering key
The purpose of clustering key is to store row data in sorted order. The sorting of data is based on columns which are included in the clustering key. This arrangement makes it efficient to retrieve data using clustering key.

5) Example : –
To make these concepts clear we will consider example of school system.

>Create a keyspace with replication strategy ‘SimpleStrategy’ and replication_factor 1.
create keyspace Students_Details with replication = {‘class’ : ‘SimpleStrategy’, ‘replication_factor’:1};

>Now switch to students_details keyspace:
cqlsh> use students_details ;

>command to check the no of tables present in keyspace
cqlsh:students_details> desc TABLES;

>We will create a table student which contains general information about any student. Type the following create statement into cqlsh.
create table student (stuid int, avg_marks float, description text, primary key (stuid));

> Type the following insert statements to enter some data into this table.
insert into student (stuid, avg_marks, description) values (1,25.5,’student 1′);
insert into student (stuid, avg_marks, description) values (2,35.5,’student 2′);

>to view the details just inserted –
cqlsh:students_details> select * from student;

stuid | avg_marks | description
——-+———–+————-
1 |      25.5 |   student 1
2 |      35.5 |   student 2

> We can see how Cassandra has stored this data under the hood by using cassandra-cli tool. Run cassandra-cli tool in a separate terminal window and type the following command on that terminal.( NOTE * Cassandra-CLI utility (deprecated)
Important: The CLI utility is deprecated and will be removed in Cassandra 3.0. For ease of use and performance, switch from Thrift and CLI to CQL and cqlsh.)

So if you using cassandra verison above 3.0 then use the below commands .

using the EXPAND Command in cqlsh , we can view the details info for the queries .
>EXPAND with no arguments shows the current value of expand setting.

cqlsh:students_details> EXPAND
Expanded output is currently disabled. Use EXPAND ON to enable.

>Enabling the expand command
cqlsh:students_details> EXPAND ON
Now Expanded output is enabled

>Now view the details inserted above- ( the studid will be present in red color in cqlsh, representing the primary key/row key)
>cqlsh:students_details> select * from student;

@ Row 1
————-+———–
stuid       | 1
avg_marks   | 25.5
description | student 1

@ Row 2
————-+———–
stuid       | 2
avg_marks   | 35.5
description | student 2

(2 rows)

We can see from the above output that the stuid has become the row key and it identifies individual rows.
cqlsh:students_details> select token(stuid) from student;

@ Row 1
———————+———————-
system.token(stuid) | -4069959284402364209

@ Row 2
———————+———————-
system.token(stuid) | -3248873570005575792

Also you can see that their are two tokens .

We can use columns in the primary key to filter data in the select statement. Type the following command in the cqlsh window:
select * from student where stuid = 1;

Now we will create another table called marks which records marks of each student for every day(say every day new exams and marks are recorded). Type the following command on cqlsh:

create table marks(stuid int,exam_date timestamp,marks float,exam_name text , primary key (stuid,exam_date));

> This statement creates marks table with primary key ( stuid , exam_date ). As primary key has two components therefore the first component is considered as partition key and the second component becomes the cluster key. Add some data into the table:

1.
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-10′,76 ,’examA’);
2.
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-11′,90 ,’examB’);
3.
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-12′,68 ,’examC’);

cqlsh:students_details> select * from marks;

@ Row 1
———–+————————–
stuid     | 1
exam_date | 2016-11-09 18:30:00+0000
exam_name | examA
marks     | 76

@ Row 2
———–+————————–
stuid     | 1
exam_date | 2016-11-10 18:30:00+0000
exam_name | examB
marks     | 90

@ Row 3
———–+————————–
stuid     | 1
exam_date | 2016-11-11 18:30:00+0000
exam_name | examC
marks     | 68

Now lets see the partition concept been applied :

cqlsh:students_details> select token(stuid) from marks;

@ Row 1
———————+———————-
system.token(stuid) | -4069959284402364209

@ Row 2
———————+———————-
system.token(stuid) | -4069959284402364209

@ Row 3
———————+———————-
system.token(stuid) | -4069959284402364209

We can see all the three rows have same partition token , hence Cassandra stores only one row for each partition key. All the data associated to that partition key is stored as columns in the datastore. The data which we have stored through three different insert statements have the same stuid value i.e. 1 therefore all the data is saved in that row as columns , i.e under one partition .

If you remember we discussed before that the second component of a primary key is called clustering key. The role of clustering key is to group related items together. All the data which is inserted against same clustering key is grouped together.

In this case all the columns such as: exam_name and marks will be grouped by value in exam_date i.e 2016-11-11 18:30:00+0000 , by default in Ascending order .

I hope these examples would have helped you to clarify few concepts of data modeling in Cassandra. Please feel free to leave any comments related to this post.


KNOLDUS-advt-sticker

This entry was posted in Best Practices, big data, Cassandra, database, NoSql, Scala and tagged , , , , , , . Bookmark the permalink.

4 Responses to Cassandra Data Modeling – Primary , Clustering , Partition , Compound Keys

  1. Sivani says:

    Awesome tutorial. My search ended here to understand the data model of cassandra. Thanks a lot.

  2. ahmadanwar31 says:

    very helpful, and opened my mind, thank you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s