Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster.
Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Hadoop YARN for access the dataset stored in HDFS(Hadoop Distributed File System).
When we want to use Apache PIG we have to create Hadoop cluster and than run Pig on it. For creating the Hadoop cluster, you can follow this blog. While following this blog when we are creating hdfs-site.xml please make changes according to this.
Now when you have done with Hadoop, now it is time for install Pig. We can download latest version of Pig (pig-0.16.0) from here. After downloading we extract Pig from tar file and create a new Pig folder in /usr/lib/ and copy Pig within it.
After copying Pig folder in /usr/lib/pig update the .bashrc file.
Now we have almost complete the process, restart the console or reload the .bashrc and we check we have install Pig correctly or not.
Now we have completed with the installation of Hadoop and Pig.
We can start Pig with these commands :
Here I want to discuss about these two commands for starting the Apache Pig with local or mapreduce.
When we run Pig at local mode, it will run on your local machine with the help of your localhost and local file system while when we run Pig on mapreduce mode, it will run on Hadoop cluster with HDFS.
By default Pig start in the mapreduce mode and when we want to run it in local mode we have to specify it. If we do not specify any mode it will run it mapreduce mode.
In other hand we created a folder in Hadoop cluster where we will keep all the script or text files.
Now we put data file in Hadoop cluster.
Now we start Pig environment with the command :
Now we are ready for running Pig commands on the Hadoop cluster.
1. We will LOAD file from the Hadoop cluster :
2. We can STORE data directly on the cluster in the new directory. When we want
use STORE command it always make new directory. It does not use any present
directory.
We can see data from the both end :
- In Hadoop cluster we can see data :
- In Pig environment we can see data with :
We can run scripts directly from the command prompt with the run command. Here is an example for the Word Count. In this example we will LOAD a file from the HDFS and perform a Word Count operation on that file. So lets start :
- We assume that we have a text file on the HDFS with the name sample_data.txt.
- Now we put our script wordcount_script.pig on the cluster. We use Pig Latin for creating script.
- Now use run command for running the script.
Till now we have seen set-up process of Hadoop cluster and connect Apache Pig with that and run some basic commands and script.
I hope it will help you to start Apache Pig with Hadoop cluster.
Thanks 🙂
References:
Reblogged this on anuragknoldus.
Anurag, thanks for sharing your knowledge. I have installed and set up hadoop 3.02. multinode cluster and pig 0.16 and added the variables to .bashrc and sourced it. After loading hdfs with data pig is still unable to read data from hdfs. What could I be missing.
JobId Alias Feature Message Outputs
job_1190663215465_0292 bigdata MAP_ONLY Message: Job failed! hdfs://192.112.88.8:9000/pig_Output,
Input(s):
Failed to read data from “hdfs://192.112.88.8:9000/mydataFolder/data.csv