Integrating Presto With Carbondata

Reading Time: 2 minutes

Presto is a well known open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was developed by Facebook to analyse petabytes of data and was later open sourced. Presto does not provide any storage but can be used with a variety of data sources like Hive, Cassandra , Relational databases and even with some propriety databases as well.

In this blog we are going to discuss how we can use Presto to query data from one of the other upcoming open source solution Carbondata . CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. CarbonData allows faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency,    Presto with Carbondata helps in speeding up queries by an order of magnitude over PetaBytes of data.

For Installing Presto, you can download the tarball for latest version from here  and then untar it in the directory of your choice.  The tarball will contain a single top-level directory , in this case it ispresto-server-0.187, which we will call the installation directory.  All the configuration files for Presto lies in the etc folder inside the installation directory.  Configure the Presto server as defined here according to your server settings.  After installing and configuring the Presto Server you can run the server using below command from installation directory. The below command will run Presto as a daemon.

bin/launcher start

Alternatively if you want to run it in foreground you can use the below command for the same. Personally I prefer the below command as I can see all the log messages and errors on the screen.

bin/launcher run

The above steps help you to run Presto but now we need to integrate Presto with Carbondata . For integrating Carbondata we need to first clone the Carbondata repository using the below command

git clone https://github.com/apache/carbondata.git

then you can do a complete build running the below command inside the Carbondata folder

mvn -Pspark-2.1 -Phadoop-2.7.2 -DskipTests clean package

When the installation is complete you will be able to see the following folder created inside Carbondata directory

integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT

Now we need to make changes at Presto end so that Presto engine can connect to the Carbondata.

Step 1 :  We need to create a carbon.properties inside etc/catalog/ folder in presto installation directory. The above properties file will have only two properties,

connector.name=carbondata

carbondata-store=hdfs://localhost:54311/opt/example

The connector.name is to specify the catalog name that will be used by Presto to identify the catalog it needs to connect to.

carbondata-store specifies the Carbondata store location.

Step 2  :  Go to the plugin folder inside the presto installation directory and create a folder with the name provided as connector.name property . In this case it is carbondata as shown in Step 1.

cd plugin

mkdir carbondata

Step 3 : Copy all the Jars from the integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT to the carbondata folder created in step 2.

 

cp <carbon-data-installation-directory>/integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT/* <presto-installation-directory>/plugin/carbondata

 

Now you are all set to execute queries on Carbondata using Presto. For executing the queries you can use the Presto-CLI . The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable. You can download the Presto-CLI from here.

Following is the command to run the Presto CLI.

./presto --server localhost:8080 --catalog carbondata --schema default

 

Once the Presto CLI is started you can run all the queries that you want of CarbonData using Presto.

 

 

Written by 

Bhavya is CTO at Knoldus Inc. with 16+ years of experience. He is a Java & Scala expert and experienced in managing large customers. He is currently focused on Bigdata and Reactive Stack. Technology and process improvements have been a forte of Bhavya and he has worked on varied technology stack starting from COBOL, Mainframe, JAVA, Scala, Dataware House, Oracle, PL/SQL Salesforce, JMS - Active MQ etc. His hobbies include reading and playing badminton.

1 thought on “Integrating Presto With Carbondata3 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading