In this post, we are going to learn about an ultra-fast, powerful and open source tool for real-time data analysis. Its druid this time, we will discuss why druid is so fast as well as the scenario in which condition it is suitable to use and how it internally stores the data and its architecture.
What is Apache Druid?
Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.
- Column-oriented distributed datastore to boost up the query for fast scan and aggregations. It only fetches those columns which are required to answer a query.
- Batch and real-time ingestion to ingest data either in real-time or in batch.
- Bitmap indexing for fast filtering and searching
- Data Partition on timestamp basis to support fast time-based slice or dice.
- Using indexing, data partitioning, query caching and data compression to result in low latency queries and Sub-Second query time which can be very responsive on the interactive realtime dashboard.
- Massive parallel processing. Internally druid hits the query on one big distributed table to hit more than one lookup tables.
- Scalable to petabytes of data.
- Highly Available. Since every node of druid whether its broker node, coordinator node, historical node or indexing node each take help of zookeeper to make cluster fault tolerant.
A druid cluster consists of different types of nodes and each node serves a special purpose. The different node types operate independently and there is minimal interaction so, in case of any failure, the impact is lesser. To understand a clear picture of druid architecture follow up the diagram and let’s understand what’s the purpose of each node type.
Historical Nodes: Responsible for loading and serving the immutable blocks of segments created by the indexing node. It knows all the information about the segment like where it is located in the deep storage, how to decompress and process the segment. Once this process is complete the segment announces to zookeeper and then that segment is queryable.
MiddleManager Nodes: Responsible for indexing of data – These nodes maintain an in-memory index buffer for data sets. In this, each persisted index is immutable and loaded in off-heap memory. These nodes are also known as Realtime nodes and indexing node. To avoid heap overflow these nodes persist their in-memory footprint to disk either periodically or after the max limit. Here, each indexing node schedules a background task which searches for its persisted index and then all of these indexes merge to form an immutable block called segment. During handoff Indexing-node stores these segment to a permanent storage backup typically S3 or HDFS for deep storage so, there is no data loss in any of the processes.
Broker Nodes: Broker nodes are responsible for routing the query to indexing nodes or historical nodes depends on segment metadata published in zookeeper and merging query result from both the nodes to the caller. Also, broker nodes maintain LRU cache result from historical nodes or indexing nodes for frequent querying in the future.
Coordinator Nodes: Coordinator nodes are responsible for data management and distribution on historical nodes. These nodes run periodically and sync the state of the cluster with zookeeper by comparing the actual state and current state of the cluster. Every instruction to historical nodes passes from coordinator nodes so, coordinator nodes are also in charge of which segment to be load or drop by historical nodes. Consider an example where a user queries a month ago data so, on the basis of rules defined coordinator nodes decides which segments to be loaded by historical nodes or which needs to be dropped.
This was all brief about the architecture, Apart from this there is druid has a vast structure for its storage and querying engine which we will discuss in further blogs. By the time stay tuned.
In a nutshell to summarize, As per Druid architecture and its storage/query capabilities its best suitable for:
- Data having high insert rates but fewer updates.
- Having a high aggregation or group by queries.
- Real-time Streaming data with high cardinality data columns.
And Not suitable for:
- A higher level of updates on data.
- Big joins on large fact tables.
All of the above cons can be overcome by different alternatives like updates could be done on background jobs and to overcome big joins we may use indexes or smaller partitioning.
For the upcoming druid, blogs stay tuned. 🙂