Quickstart with Druid

Reading Time: 3 minutes

Hi Druids,

In this blog we will download, setup cluster on single machine using tutorial conf and will also see how to load data from Kafka and query the data.

druid manages data:

Druid data is stored in “Datasources”, which are similar to tables in a traditional RDBMS. Each Datasource is partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a “chunk” (for example, a single day, if your Datasource is partitioned by day). Within a chunk, data is partitioned into one or more “segments“. Each segment is a single file, typically comprising up to a few million rows of data.

druid queries:

Druid uses three different techniques to maximize query performance:

  • Pruning which segments are accessed for each query.
  • Within each segment, using indexes to identify which rows must be accessed.
  • Within each segment, only reading the specific rows and columns that are relevant to a particular query.

be druid

First Download the 0.13.0-incubating release.

execute following commands to setup druid:

tar -xzf apache-druid-0.13.0-incubating-bin.tar.gz
cd apache-druid-0.13.0-incubating curl https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz
tar -xzf zookeeper-3.4.11.tar.gz
mv zookeeper-3.4.11 zk

start druid with following commands

bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf

execute Following commands to setup kafka

curl -O https://archive.apache.org/dist/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz
tar -xzf kafka_2.11-0.10.2.0.tgz
cd kafka_2.11-0.10.2.0
./bin/kafka-server-start.sh config/server.properties
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor (to ingest messages from our newly created wikipedia topic by using Druid's Kafka indexing service)

start kafka producer and load some data

cd quickstart/tutorial
gunzip -k wikiticker-2015-09-12-sampled.json.gz
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json (alternatively just start the producer with broker-list and topic then produce valid data)

query data from druid

Lets retrieve 10 Wikipedia pages with the most page edits on 2015-09-12 with druid native and sql query (Druid also supports a dialect of SQL for querying).

i) druid’s native query format expressed in JSON

curl -X POST \   'http://localhost:8082/druid/v2?pretty=' \   -H 'Content-Type: application/json' \   -H 'Postman-Token: 2218b3ba-03ae-409a-b964-4b3948136245' \   -H 'cache-control: no-cache' \   -d '{   "queryType" : "topN",   "dataSource" : "wikipedia",   "intervals" : ["2015-09-12/2015-09-13"],   "granularity" : "all",   "dimension" : "page",   "metric" : "count",   "threshold" : 10,   "aggregations" : [     {       "type" : "count",       "name" : "count"     }   ] }'

ii) druid’s query expressed in SQL

execute following command to run query on dsql client:

bin/dsql 
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10

Conclusion:

Druid is a distributed, column-oriented, real-time analytical data store. Druid is designed to power high performance applications and is optimized for low query latencies.
Druid’s data ingestion latency is heavily dependent on the complexity of the data set being ingested. The data complexity is determined by the number of dimensions in each event, the numberof metrics in each event, and the types of aggregations we want to perform on those metrics.
Also, Druid query performance can vary signficantly depending on the query being used. For example, sorting the values of a high cardinality dimension based on a given metric is much more expensive than a simple count over a time range.

Written by 

Software Consultant At Knoldus