SQL on Apache Druid – Part II: Which SQL workload will be faster on Druid?

Reading Time: 3 minutes

In the previous blog, we introduced What is Apache Druid and how we can run SQL query on Druid. In this post, we will talk about the Druid design in details and will discuss which kind of SQL workload will be best executed on Apache Druid.

Is Druid really Designed to run SQL queries?

Apache Druid provides columnar storage format in contrast to traditional RDBMS which stores row as one unit. You must be intrigued? To see what are the differences in the database designs in terms of row-based storage and columnar storage in detail, you may look into this blog. However, in a nutshell, The column-oriented databases outperform some of the query loads. So to justify the heading of the paragraph, Yes! you can execute SQL queries on Apache Druid to get good results in some applications.

Different kind of SQL workloads?  OLTP vs OLAP

OLTP: Online Transaction processing workloads are more of a transaction specific which affects (insert/update/delete) on the basis of some condition. e.g.

SELECT * FROM PERSON WHERE ID=1101 
INSERT INTO PRICE VALUES ( …. ) WHERE ITEM_ID=
UPDATE PERSON SET INCOME=INCOME-TAX WHERE ID=1101;

These types of operation affect a specific row/rows based on some condition. On the other hand:

OLAP: Online analytics processing and affects a very large number of rows but limited columns. The amount of data OLAP queries work on are huge. The goal is to scan/join or aggregate some columns for some analysis. e.g.

SELECT SUM(TAXES) FROM PERSON 
SELECT AVG(AGE) FROM PERSON WHERE COUNTRY IN ('INDIA')
SELECT ID FROM PERSON WHERE TAX > 10000;

Organization of Data stored in Druid

Druid stores its indexes in segments which are further partitioned by time interval. Segments are core underlying data structure which allows compression for storage optimization. Since similarity in data for the same column, compression becomes easy and efficient. Internally for storage, there are three column types:

Druid_storage

 The Timestamps and Metrics column types are kind of similar and are compressed integer or floating point values. However, Storage of Dimension columns are different because of the underlying data it stores and query it supports e.g. Dimension columns provide aggregates and filter operations on the column. To go into further details of the internal data structure and to know how dimension columns are stored internally, you can visit the official doc.

Why OLAP queries give better results on Druid?

As you can see the example, The OLAP queries access a very limited set of columns, Druid uses column-oriented storage, meaning it only needs to load the exact columns needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each column is stored optimized for its particular data type, which supports fast scans and aggregations.

When to Use Druid for SQL

  • Most of your queries are aggregation and reporting queries (“group by” queries).
  • When querying involves searching and scanning.
  • For Approximating aggregates
  • Needle in a Haystack searches
  • Wher SLA for query latency is between 100 ms to a few seconds.
  • When tables are fairly distributed such that each query hits a big distributed table.

When Not to Use Druid for SQL

  • When update affects a single row basis on some primary key
  • When you need to join two big distributed tables.
  • When a very low latency update is required.

Conclusion!

I hope we are now wise enough to decide which application area we can leverage Apache Druid for running SQL workloads, It will be good to actually run those queries the see the results!! In the upcoming blog in the series, We will be performing some benchmarks to see what difference do we get for running a different kind of queries with the help of an example dataset in Apache Druid. Stay tuned!!

References: 

Knoldus-Scala-Spark-Services

Written by 

Manish Mishra is Lead Software Consultant, with experience of more than 7 years. His primary development technology was Java. He fell for Scala language and found it innovative and interesting language and fun to code with. He has also co-authored a journal paper titled: Economy Driven Real Time Deadline Based Scheduling. His interests include: learning cloud computing products and technologies, algorithm designing. He finds Books and literature as favorite companions in solitude. He likes stories, Spiritual Fictions and Time Traveling fictions as his favorites.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading