In the previous blog, we introduced What is Apache Druid and how we can run SQL query on Druid. In this post, we will talk about the Druid design in details and will discuss which kind of SQL workload will be best executed on Apache Druid.
Is Druid really Designed to run SQL queries?
Apache Druid provides columnar storage format in contrast to traditional RDBMS which stores row as one unit. You must be intrigued? To see what are the differences in the database designs in terms of row-based storage and columnar storage in detail, you may look into this blog. However, in a nutshell, The column-oriented databases outperform some of the query loads. So to justify the heading of the paragraph, Yes! you can execute SQL queries on Apache Druid to get good results in some applications.
Different kind of SQL workloads? OLTP vs OLAP
OLTP: Online Transaction processing workloads are more of a transaction specific which affects (insert/update/delete) on the basis of some condition. e.g.
SELECT * FROM PERSON WHERE ID=1101
INSERT INTO PRICE VALUES ( …. ) WHERE ITEM_ID=
UPDATE PERSON SET INCOME=INCOME-TAX WHERE ID=1101;
These types of operation affect a specific row/rows based on some condition. On the other hand:
OLAP: Online analytics processing and affects a very large number of rows but limited columns. The amount of data OLAP queries work on are huge. The goal is to scan/join or aggregate some columns for some analysis. e.g.
SELECT SUM(TAXES) FROM PERSON
SELECT AVG(AGE) FROM PERSON WHERE COUNTRY IN ('INDIA')
SELECT ID FROM PERSON WHERE TAX > 10000;
Organization of Data stored in Druid
Druid stores its indexes in segments which are further partitioned by time interval. Segments are core underlying data structure which allows compression for storage optimization. Since similarity in data for the same column, compression becomes easy and efficient. Internally for storage, there are three column types:
The Timestamps and Metrics column types are kind of similar and are compressed integer or floating point values. However, Storage of Dimension columns are different because of the underlying data it stores and query it supports e.g. Dimension columns provide aggregates and filter operations on the column. To go into further details of the internal data structure and to know how dimension columns are stored internally, you can visit the official doc.
Why OLAP queries give better results on Druid?
As you can see the example, The OLAP queries access a very limited set of columns, Druid uses column-oriented storage, meaning it only needs to load the exact columns needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each column is stored optimized for its particular data type, which supports fast scans and aggregations.
When to Use Druid for SQL
- Most of your queries are aggregation and reporting queries (“group by” queries).
- When querying involves searching and scanning.
- For Approximating aggregates
- Needle in a Haystack searches
- Wher SLA for query latency is between 100 ms to a few seconds.
- When tables are fairly distributed such that each query hits a big distributed table.
When Not to Use Druid for SQL
- When update affects a single row basis on some primary key
- When you need to join two big distributed tables.
- When a very low latency update is required.
I hope we are now wise enough to decide which application area we can leverage Apache Druid for running SQL workloads, It will be good to actually run those queries the see the results!! In the upcoming blog in the series, We will be performing some benchmarks to see what difference do we get for running a different kind of queries with the help of an example dataset in Apache Druid. Stay tuned!!