Cassandra is a distributed database from Apache which is highly scalable and effective in managing large amounts of structured data. It provides high availability with no single point of failure. Cassandra is column oriented DB. Often used for time series data.
Primary keys in Cassandra
It is a primary key database which means data is persisted and organised around a cluster based on hash values(partition keys) of primary key .
Why use composite primary keys ?
- Benefit of using composite partition key or more than one partition keys is that it allows breaking the data into buckets.
- The data is still grouped but in smaller buckets which can be effective specially for time series data .
- This can be effective when Cassandra cluster experiences congestion in writing data to one node repeatedly
How to reduce the congestion ?
- Breaking incoming data into year:month:days:hour ,using four columns to route to a partition can decrease load on a node. Creating data buckets according to the partition keys.
Note : To retrieve the data from the table,we will need to specify column values existing in the partition key to fetch the data . If you want to query a column which is not a partition key of that table using WHERE clause then the query will not be executed and will give an error.
An example for composite primary key:
race_year and race_name is a composite primary key
// dummykeyspace is the keyspace . Databases are referred to as keyspace in cassandra cqlsh> USE dummykeyspace; CREATE TABLE rank_by_year_and_name ( event_year int, event_name text, candidate_name text, rank int, PRIMARY KEY ((event_year, event_name), rank) ); Also there is an additional column rank in the primary key
Secondary indexes in Cassandra
A primary index is global, whereas a secondary index is local.
Let`s say we have a five node Cassandra cluster. We have a user table with primary index as user_id and a secondary index of user_emails. If you were to query for a user by their ID i.e primary index, each node in the cluster communicates with each other and knows which node has the record of the user i.e one query, one read from disk.
However if you want to query a user by their email i.e their secondary index , each node has to query its own record of users i.e one query, five reads from disk.This could result in latency increase and the overall efficiency of reading drops – in some cases to the point of timing out on API calls.
Stayed tuned for next blog on impact of cardinality in cassandra !