Apache PIG : Installation and Connect with Hadoop Cluster


Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster.

Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Hadoop YARN for access the dataset stored in HDFS(Hadoop Distributed File System).

When we want to use Apache PIG we have to create Hadoop cluster and than run Pig on it. For creating the Hadoop cluster, you can follow this blog. While following this blog when we are creating hdfs-site.xml please make changes according to this.

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Now when you have done with Hadoop, now it is time for install Pig. We can download latest version of Pig (pig-0.16.0) from here. After downloading we extract Pig from tar file and create a new Pig folder in /usr/lib/ and copy Pig within it.

$ tar -xvf pig-0.16.0.tar.gz
$ mkdir /usr/lib/pig/
$ cd /usr/lib/pig/
$ mv /pig-0.16.0 /usr/lib/pig/

After copying Pig folder in /usr/lib/pig update the .bashrc file.

export PIG_HOME="/usr/lib/pig/pig-0.16.0"
export PIG_CONF_DIR="$PIG_HOME/conf"
export PIG_CLASSPATH="$PIG_CONF_DIR"
export PATH="$PIG_HOME/bin:$PATH"

Now we have almost complete the process, restart the console or reload the .bashrc and we check we have install Pig correctly or not.

$ pig -version

Screenshot from 2016-08-26 18:28:23

Now we have completed with the installation of Hadoop and Pig.

We can start Pig with these commands :

$ pig -x local
$ pig -x mapreduce
or
$ pig

Here I want to discuss about these two commands for starting the Apache Pig with local or mapreduce.

When we run Pig at local mode, it will run on your local machine with the help of your localhost and local file system while when we run Pig on mapreduce mode, it will run on Hadoop cluster with HDFS.

By default Pig start in the mapreduce mode and when we want to run it in local mode we have to specify it. If we do not specify any mode it will run it mapreduce mode.

In other hand we created a folder in Hadoop cluster where we will keep all the script or text files.

$ hdfs dfs -mkdir hdfs://localhost:54310/pig_Data

Screenshot from 2016-08-26 18:38:52.png

Screenshot from 2016-08-26 18:40:20.png

Now we put data file in Hadoop cluster.

$ hdfs dfs -put /home/anurag/student_data.txt hdfs://localhost:54310/pig_Data/

Screenshot from 2016-08-26 20:50:00

Now we start Pig environment with the command :

$ pig -x mapreduce

Screenshot from 2016-08-26 20:55:10

Now we are ready for running Pig commands on the Hadoop cluster.

1. We will LOAD file from the Hadoop cluster :

students = LOAD 'hdfs://localhost:54310/pig_Data/student_data.txt'
  USING PigStorage(',') as ( id:int, firstname:chararray,
  lastname:chararray, phone:chararray, city:chararray );

 

Screenshot from 2016-08-26 21:03:23

2. We can STORE data directly on the cluster in the new directory. When we want
use STORE command it always make new directory. It does not use any present
directory.

STORE students INTO ' hdfs://localhost:54310/pig_Output/ ' USING PigStorage (',');

Screenshot from 2016-08-26 21:28:31.png

We can see data from the both end :

  1. In Hadoop cluster we can see data :
    hdfs dfs -cat hdfs://localhost:54310/pig_Output/part-m-00000
    
    Screenshot from 2016-08-29 10:57:01
  2. In Pig environment we can see data with :
    DUMP students;
    
    Screenshot from 2016-08-29 10:54:57.png

We can run scripts directly from the command prompt with the run command. Here is an example for the Word Count. In this example we will LOAD a file from the HDFS and perform a Word Count operation on that file. So lets start :

  1. We assume that we have a text file on the HDFS with the name sample_data.txt.
  2. Now we put our script wordcount_script.pig on the cluster. We use Pig Latin for creating script.
    lines = LOAD 'hdfs://localhost:54310/pig_Data/sample_data.txt' AS (line:chararray);
    words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
    grouped = GROUP words BY word;
    wordcount = FOREACH grouped GENERATE group, COUNT(words);
    DUMP wordcount;
    
  3. Now use run command for running the script.
    run wordcount_script.pig;
    
    Screenshot from 2016-08-29 11:41:11.png

Till now we have seen set-up process of Hadoop cluster and connect Apache Pig with that and run some basic commands and script.

I hope it will help you to start Apache Pig with Hadoop cluster.

Thanks  🙂

References:

Pig Documentation

 


KNOLDUS-advt-sticker

This entry was posted in database, Scala and tagged , , , . Bookmark the permalink.

One Response to Apache PIG : Installation and Connect with Hadoop Cluster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s