While transitioning from a Relational to a NoSQL Database, architects expect none or a minimal effect on performance with the scaling up of the size of data. Dealing with a huge amount of data may be the USP of a Database, but still, we need to design things in order to make them run well at scale. In this blog, I’d try to explain what an Index is and how it impacts the performance of a Database.
So no prize for guessing, what comes to our aid is known as an index. In short, an index is like the index page at the end of many books where you can simply look up a topic by skimming through the contents alphabetically and then finding out the page listed adjacent to it. It makes the discovery of data faster.
One of the biggest players in the industry, Couchbase has proven really worth my time in the projects I have worked with. It’s easy to integrate, has great documentation and support, and it scales well too. Moreover, with so many features, I’m yet to find a use-case where this is a misfit for a database.
For the basics of working with Couchbase, please follow some of our blogs. It is a Document-based NoSQL database. In this blog, I will be using the Couchbase database to give out the examples.
Need for Indexing a Database
Data we’re working on is now in terabytes or even petabytes. If your Couchbase bucket has 1 million documents (which on a Production database isn’t very surprising), a query like
select * from bucket_name where name = "Ayush Prashar";
The querying would go through each of the documents trying to find every document with that value. Now, this doesn’t seem like a very scalable option, does it? So there does come a requirement to minimize the number of documents to be scanned for a query aka Indexing.
It is simply an index of the document keys of all documents in the bucket. Couchbase database ensures uniqueness constraint over the document key. So whenever we have a query without any filter or no other index may prove to be useful, we rely on Primary Index.
To create a Primary Index in Couchbase, use the following command:
CREATE PRIMARY INDEX `
If we run this over a Couchbase node, let’s see what impact it has on the querying.
Here pay attention to the execution time.
Now looking at the performance and principle behind a Primary Index, it doesn’t really fit any business use-case. A product may have some sort of preference when it comes to the query of data. An employee management app might want to query on departments and designations for some feature, a banking app may query on account types and ledger balance. Since they become so frequently queried, it may be a good idea to create an index for it too. Couchbase provides us with the ability to do just that.
This index can be any field in the document or even be a combination of multiple fields. For the query to take benefit of this index, it has to include the exact fields in the query field.
Pay attention to the execution time of the query. This is a drastic reduction when compared to the use of a plain primary index. Now, this begs the question that do we even need a Primary Index if it is literally so slow on performance? Everything has some sort of use case in a certain scenario, but here and in a general production-based scenario, a primary index not really a very useful asset. Secondary indices could be tapered to the required queries. As mentioned it could work on multiple fields of a document too!
Pay attention to the execution time of the query and the execution plan. It can be seen that it makes use of the index that we created above.
Footnote for Spring Developers: Spring Data for Couchbase also works with the secondary index. We can annotate a model class with
@N1qlSecondaryIndexed(indexName = “foo”)
This adds an attribute “_class” to the document which has the value of the class name(full package name) and also creates an index in Couchbase against this field. Now every query provided by Spring data gets automatically appended with `and _class=”foo”`. So even if we may not add an index separately, we can get one using this annotation. However, it’s very highly recommended that we avoid this type of index and use a well thought about index for our use case in the production environment.
If you pay attention to the last executed query, nearly 8.4% of the time was spent in fetching the data. This time is spent getting data information which is not present in the index itself. To make our queries run faster, we can always taper our projections to include only relevant data and make it a part of our index. You may be surprised by the amount of time it saves.
Clearly, we have skipped the fetch phase of the query plan and you can see the effect it has in the execution time of the query.
In all, these few tips may be helpful for you in order to design your database indexes and to improve the performance of the database, especially when dealing with terabytes and petabytes of data. For more information about such use-cases, stay tuned to Knoldus Blogs.