In this blog we will install and configure hdfs and yarn with minimal configuration to create a local machine cluster. After that we will try to submit job to yarn cluster with the help of spark-shell, So lets start.
Before install hadoop in your standalone machine some prerequisite are:
- Java 7
Now to install hadoop on standalone machine we create a dedicated user for it as follows. Its not mandatory but its recommended.
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser
Above steps create a hduser and hadoop group in your machine.
Second step is to configure ssh in your local machine, Hadoop require ssh access to manage its nodes. For configure ssh for hduser to login in localhost without password, we need to run following commands.
$ su - hduser $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Now you can check your ssh setup by connecting to localhost with following command.
$ ssh localhost
After done with all above steps now we go for installing hadoop and create local cluster in our local machine by following commands.
$ cd /usr/local $ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz $ tar -xvf hadoop-2.6.0.tar.gz
we need set HADOOP_HOME and JAVA_HOME environment variables in .bashrc file.
export HADOOP_HOM=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 export PATH=$PATH:$HADOOP_HOME/bin
Now we need to create a directory which is use to save data on hdfs.
$ sudo mkdir /tmp/hadoop_data $ sudo chown hduser:hadoop /tmp/hadoop_data $ sudo chmod 777 /tmp/hadoop_data
To create a cluster we need to set some configuration of hadoop so we need to edit hadoop configuration files which is at /usr/local/hadoop-2.6.0/etc/hadoop
First we need to configure JAVA_HOME variable in hadoop-env.sh
Configure directory in which hadoop save its data and URI.
we need to add following configuration in core-site.xml between
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop_data</value> <description>directory for hadoop data</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description> data to be put on this URI</description> </property>
Next we will add configuration to define map reduce job tracker host and port by adding following configuration in mapred-site.xml between <configuration></configuration>
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>... </description> </property>
And last we configure replication factor of hdfs in hdfs-site.xml as same between <configuration></configuration>.
<property> <name>dfs.replication</name> <value>1</value> </property>
You are done with configuration and installation of hadoop now format the namenode and start cluster as follows.
$ /usr/local/hadoop-2.6.0/bin/hadoop namenode -format $ ./usr/local/hadoop-2.6.0/sbin/start-dfs.sh $ ./usr/local/hadoop-2.6.0/sbin/start-yarn.sh
TO check your your Namenode, datanode, secondarynode, job tracker and task
tracker run successfully run command jps. And you will see as below
To see Web UI of name node and yarn cluster go to folling links
Yarn Cluster: http://localhost:8088/
So Now your hadoop cluster is up, Next we are going to start spark shell on yarn with following commands
we need to ssh to localhost because now hadoop cluster start on localhost and we have to start spark-shell on client mode.
./bin/spark-shell –master yarn-client
If its successfully start you can see your spark-shell as an application in cluster UI as above and if its give any exception verify all your environment variables and permission.