Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster.
Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Hadoop YARN for access the dataset stored in HDFS(Hadoop Distributed File System).
When we want to use Apache PIG we have to create Hadoop cluster and than run Pig on it. For creating the Hadoop cluster, you can follow this blog. While following this blog when we are creating hdfs-site.xml please make changes according to this.
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property>
Now when you have done with Hadoop, now it is time for install Pig. We can download latest version of Pig (pig-0.16.0) from here. After downloading we extract Pig from tar file and create a new Pig folder in /usr/lib/ and copy Pig within it.
$ tar -xvf pig-0.16.0.tar.gz $ mkdir /usr/lib/pig/ $ cd /usr/lib/pig/ $ mv /pig-0.16.0 /usr/lib/pig/
After copying Pig folder in /usr/lib/pig update the .bashrc file.
export PIG_HOME="/usr/lib/pig/pig-0.16.0" export PIG_CONF_DIR="$PIG_HOME/conf" export PIG_CLASSPATH="$PIG_CONF_DIR" export PATH="$PIG_HOME/bin:$PATH"
Now we have almost complete the process, restart the console or reload the .bashrc and we check we have install Pig correctly or not.
$ pig -version
Now we have completed with the installation of Hadoop and Pig.
We can start Pig with these commands :
$ pig -x local $ pig -x mapreduce or $ pig
Here I want to discuss about these two commands for starting the Apache Pig with local or mapreduce.
When we run Pig at local mode, it will run on your local machine with the help of your localhost and local file system while when we run Pig on mapreduce mode, it will run on Hadoop cluster with HDFS.
By default Pig start in the mapreduce mode and when we want to run it in local mode we have to specify it. If we do not specify any mode it will run it mapreduce mode.
In other hand we created a folder in Hadoop cluster where we will keep all the script or text files.
$ hdfs dfs -mkdir hdfs://localhost:54310/pig_Data
Now we put data file in Hadoop cluster.
$ hdfs dfs -put /home/anurag/student_data.txt hdfs://localhost:54310/pig_Data/
Now we start Pig environment with the command :
$ pig -x mapreduce
Now we are ready for running Pig commands on the Hadoop cluster.
1. We will LOAD file from the Hadoop cluster :
students = LOAD 'hdfs://localhost:54310/pig_Data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
2. We can STORE data directly on the cluster in the new directory. When we want
use STORE command it always make new directory. It does not use any present
STORE students INTO ' hdfs://localhost:54310/pig_Output/ ' USING PigStorage (',');
We can see data from the both end :
- In Hadoop cluster we can see data :
hdfs dfs -cat hdfs://localhost:54310/pig_Output/part-m-00000
- In Pig environment we can see data with :
We can run scripts directly from the command prompt with the run command. Here is an example for the Word Count. In this example we will LOAD a file from the HDFS and perform a Word Count operation on that file. So lets start :
- We assume that we have a text file on the HDFS with the name sample_data.txt.
- Now we put our script wordcount_script.pig on the cluster. We use Pig Latin for creating script.
lines = LOAD 'hdfs://localhost:54310/pig_Data/sample_data.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
- Now use run command for running the script.
Till now we have seen set-up process of Hadoop cluster and connect Apache Pig with that and run some basic commands and script.
I hope it will help you to start Apache Pig with Hadoop cluster.