In this blog we will download, setup cluster on single machine using tutorial conf and will also see how to load data from Kafka and query the data.
druid manages data:
Druid data is stored in “Datasources”, which are similar to tables in a traditional RDBMS. Each Datasource is partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a “chunk” (for example, a single day, if your Datasource is partitioned by day). Within a chunk, data is partitioned into one or more “segments“. Each segment is a single file, typically comprising up to a few million rows of data.
druid queries:
Druid uses three different techniques to maximize query performance:
Pruning which segments are accessed for each query.
Within each segment, using indexes to identify which rows must be accessed.
Within each segment, only reading the specific rows and columns that are relevant to a particular query.
tar -xzf apache-druid-0.13.0-incubating-bin.tar.gz cd apache-druid-0.13.0-incubating
curl https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz tar -xzf zookeeper-3.4.11.tar.gz mv zookeeper-3.4.11 zk
curl -O https://archive.apache.org/dist/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz tar -xzf kafka_2.11-0.10.2.0.tgz cd kafka_2.11-0.10.2.0 ./bin/kafka-server-start.sh config/server.properties ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor (to ingest messages from our newly created wikipedia topic by using Druid's Kafka indexing service)
start kafka producer and load some data
cd quickstart/tutorial gunzip -k wikiticker-2015-09-12-sampled.json.gz ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json (alternatively just start the producer with broker-list and topic then produce valid data)
query data from druid
Lets retrieve 10 Wikipedia pages with the most page edits on 2015-09-12 with druid native and sql query (Druid also supports a dialect of SQL for querying).
execute following command to run query on dsql client:
bin/dsql
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10
Conclusion:
Druid is a distributed, column-oriented, real-time analytical data store. Druid is designed to power high performance applications and is optimized for low query latencies. Druid’s data ingestion latency is heavily dependent on the complexity of the data set being ingested. The data complexity is determined by the number of dimensions in each event, the numberof metrics in each event, and the types of aggregations we want to perform on those metrics. Also, Druid query performance can vary signficantly depending on the query being used. For example, sorting the values of a high cardinality dimension based on a given metric is much more expensive than a simple count over a time range.