Presto is a well known open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was developed by Facebook to analyse petabytes of data and was later open sourced. Presto does not provide any storage but can be used with a variety of data sources like Hive, Cassandra , Relational databases and even with some propriety databases as well.
In this blog we are going to discuss how we can use Presto to query data from one of the other upcoming open source solution Carbondata . CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. CarbonData allows faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, Presto with Carbondata helps in speeding up queries by an order of magnitude over PetaBytes of data.
For Installing Presto, you can download the tarball for latest version from here and then untar it in the directory of your choice. The tarball will contain a single top-level directory , in this case it is
presto-server-0.187, which we will call the installation directory. All the configuration files for Presto lies in the etc folder inside the installation directory. Configure the Presto server as defined here according to your server settings. After installing and configuring the Presto Server you can run the server using below command from installation directory. The below command will run Presto as a daemon.
Alternatively if you want to run it in foreground you can use the below command for the same. Personally I prefer the below command as I can see all the log messages and errors on the screen.
The above steps help you to run Presto but now we need to integrate Presto with Carbondata . For integrating Carbondata we need to first clone the Carbondata repository using the below command
git clone https://github.com/apache/carbondata.git
then you can do a complete build running the below command inside the Carbondata folder
mvn -Pspark-2.1 -Phadoop-2.7.2 -DskipTests clean package
When the installation is complete you will be able to see the following folder created inside Carbondata directory
Now we need to make changes at Presto end so that Presto engine can connect to the Carbondata.
Step 1 : We need to create a carbon.properties inside etc/catalog/ folder in presto installation directory. The above properties file will have only two properties,
The connector.name is to specify the catalog name that will be used by Presto to identify the catalog it needs to connect to.
carbondata-store specifies the Carbondata store location.
Step 2 : Go to the plugin folder inside the presto installation directory and create a folder with the name provided as connector.name property . In this case it is carbondata as shown in Step 1.
Step 3 : Copy all the Jars from the integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT to the carbondata folder created in step 2.
cp <carbon-data-installation-directory>/integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT/* <presto-installation-directory>/plugin/carbondata
Now you are all set to execute queries on Carbondata using Presto. For executing the queries you can use the Presto-CLI . The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable. You can download the Presto-CLI from here.
Following is the command to run the Presto CLI.
./presto --server localhost:8080 --catalog carbondata --schema default
Once the Presto CLI is started you can run all the queries that you want of CarbonData using Presto.