Is Apache Cassandra really the Database you need?

Table of contents

Reading Time: 6 minutes

Welcome back, everyone. It has been quite some time since I have been working with Cassandra. To be honest, it is a quite cool database. Its decentralized nature, as well as its ability to handle such a large volume of writes, is really commendable.
But as we know nothing is perfect. So is the Cassandra Database. What I mean by this is that you cannot have a perfect package. If you wish for one brilliant feature then you might have to compromise on the other features. In today’s blog, we will be going through some of the benefits of selecting Cassandra as your database as well as the problems/drawbacks that one might face if he/she chooses Cassandra for his/her application.
I have also written some blogs earlier which you can go through for reference if you want to know What Cassandra is, How to set it up and how it performs its Reads and Writes.

The only question we have is that should we or should we not pick Cassandra over the other databases that are available. So let’s start by having a quick look at when to use the Cassandra Database. This will give a clear picture to all those who are confused in decided whether to give Cassandra a try or not.

When To Choose Cassandra

We all know that Cassandra is a NoSql Database. So it offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data.

Below I have summed up some of the strong points that make Cassandra a well-deserved candidate for the Database race :

Flexible schema.
Highly scalable and highly available with no single point of failure.
Very high write throughput and good read throughput.
SQL-like query language and support search through secondary indexes.
Tunable consistency and support for replication.
NoSQL column family implementation.

Now that we have the plus points of Cassandra in our mind, we just need to figure out when to use it. If your application has the following characteristics then you should definitely go with Cassandra :

The main focus is on the writes i.e the number of writes is exceptionally high as compared to the number of reads.
There are minimum updates in your application.
There is a requirement to integrate with Big Data, Hadoop, Hive, Spark etc.
There is a need for real-time data analytics and report generations.
The application is distributed.
There is no need for joins or aggregates.
You want your application to work globally.

We can say that these are some of the advantages/use-cases for going with Cassandra. And if the requirements of your application suit these criteria, then Cassandra is one of the best choices you can make.

Well, with all that said, Let’s come to the features/scenarios which might lead us to reconsider our thoughts of choosing Cassandra over the other databases.

When Not to Consider Cassandra

Now we have a rough idea about what Cassandra actually is for. But life is all about balance, Right?
So along with the advantages, there are some cons to the Cassandra Database as well.
We will be discussing them in this part.

Some of the Cons of Cassandra Database

No Support for ACID Properties :
Cassandra does not provide ACID and relational data properties. If you have a strong requirement for ACID properties, Cassandra would not be a fit in that case.
No support for Aggregates :
Cassandra does not support aggregates, if you need to do a lot of them, think another database.
Latency :
Making excessive requests and reading more data slows down the actual transaction, resulting in latency issues.
Joins can be an Issue :
No join or subquery support. You may be able to find a workaround for this one, but that might affect the performance and increase the overhead.
Data Duplication :
Here data is modeled around queries instead of its structure due to which same data is store multiple times.
Slow Reads :
Reads are slower. Cassandra was optimized from the beginning for fast writes. Reads were not as much of a concern but that quickly changed as more use cases were considered.
JVM Memory management can be an issue :
To store huge amount of data, JVM is required to manage the memory which itself is a language, and so garbage collection is not done by the application but by a language in Cassandra.

TradeOffs in Cassandra

Using lower consistency levels yield higher availability and better latency at the price of weaker consistency.
Using higher consistency levels yield lower availability and higher request latency with the benefit of stronger consistency.
Another tradeoff to consider is how Cassandra deals with data safety in the face of hardware failures. This means in case of disk failure or datacenter hardware damage, your safety depends mainly on the replication factor and consistency level used for the write.

THERE IS NO SUCH THING AS A SILVER BULLET

Every database server ever designed was built to meet specific design criteria. Those design criteria define the use cases where the database will fit well and the use cases where it will not.

When evaluating distributed data systems, you have to consider the CAP theorem – you can pick two of the following: consistency, availability, and partition tolerance. It is also very important to recognize how the data and workload will be distributed. Without understanding the design criteria, implementation, and distribution plan, any attempt to use a distributed database like Cassandra might fail.

Below we have tried to summarize when Cassandra would be or won’t be an optimal choice for you so that you can reconsider exploring the other available options.

SUMMARY :

Cassandra would be an optimal choice in the following cases : 

- Use if you need to work on huge amount of data.
- Use if you have a requirement for fast writes.
- Use if there is less secondary index needs.
- Use if there is no need for joins or aggregates.
- Use if there is a requirement to integrate with Big Data, Hadoop, Hive, and Spark.
- Use if there is a need for a distributed application.

Cassandra won't be an optimal choice in the following cases:

- Do Not Use if you are not storing volumes of data across racks of clusters.
- Do Not Use if you have a strong requirement for ACID properties.
- Do Not Use if you want to use aggregate function.
- Do Not Use if you are not partitioning your servers.
- Do Not Use if you are application has more read requests than writes.
- Do Not Use if you require strong Consistency.

To sum it up, Cassandra is an available, partition-tolerant system that supports eventual consistency. But it might not always be an optimal choice when it comes to choosing a database. It totally depends on your use case and also on what features you prefer. You might have to reconsider the tradeoffs as well.
Now that you have an idea about the positive as well as the negative points about Cassandra, I hope that the questions or doubts about choosing Cassandra will be quite clear by now. Now it’s up to you to make a choice.

Hope this helps. Happy Coding. 🙂