Spark-shell on yarn resource manager: Basic steps to create hadoop cluster and run spark on it


In this blog we will install and configure hdfs and yarn with minimal configuration to create a local machine cluster. After that we will try to submit job to yarn cluster with the help of spark-shell, So lets start.

Before install hadoop in your standalone machine some prerequisite are:

  • Java 7
  • ssh

Now to install hadoop on standalone machine we create a dedicated user for it as follows. Its not mandatory but its recommended.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Above steps create a hduser and hadoop group in your machine.

Second step is to configure ssh in your local machine, Hadoop require ssh access to manage its nodes. For configure ssh for hduser to login in localhost without password, we need to run following commands.

$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now you can check your ssh setup by connecting to localhost with following command.

$ ssh localhost

After done with all above steps now we go for installing hadoop and create local cluster in our local machine by following commands.

$ cd /usr/local
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
$ tar -xvf hadoop-2.6.0.tar.gz

we need set HADOOP_HOME and JAVA_HOME environment variables in .bashrc file.

export HADOOP_HOM=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin

Now we need to create a directory which is use to save data on hdfs.

$ sudo mkdir /tmp/hadoop_data
$ sudo chown hduser:hadoop /tmp/hadoop_data
$ sudo chmod 777 /tmp/hadoop_data

To create a cluster we need to set some configuration of hadoop so we need to edit hadoop configuration files which is at /usr/local/hadoop-2.6.0/etc/hadoop

First we need to configure JAVA_HOME variable in hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

Configure directory in which hadoop save its data and URI.
we need to add following configuration in core-site.xml between
<configuration></configuration>.

<property>
 <name>hadoop.tmp.dir</name>
 <value>/tmp/hadoop_data</value>
 <description>directory for hadoop data</description>
</property>
<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description> data to be put on this URI</description>
</property>

Next we will add configuration to define map reduce job tracker host and port by adding following configuration in mapred-site.xml between <configuration></configuration>

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>...
</description>
</property>

And last we configure replication factor of hdfs in hdfs-site.xml as same between <configuration></configuration>.

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

You are done with configuration and installation of hadoop now format the namenode and start cluster as follows.

$ /usr/local/hadoop-2.6.0/bin/hadoop namenode -format
$ ./usr/local/hadoop-2.6.0/sbin/start-dfs.sh
$ ./usr/local/hadoop-2.6.0/sbin/start-yarn.sh

TO check your your Namenode, datanode, secondarynode, job tracker and task
tracker run successfully run command jps. And you will see as below

30465 DataNode
30311 NameNode
30807 ResourceManager
30939 NodeManager
2987 Jps
30652 SecondaryNameNode

To see Web UI of name node and yarn cluster go to folling links

Namenode:  http://localhost:50070/

Screenshot from 2016-01-30 13:15:38

Yarn Cluster:  http://localhost:8088/

Screenshot from 2016-01-30 13:18:54

So Now your hadoop cluster is up, Next we are going to start spark shell on yarn with following commands

ssh localhost

we need to ssh to localhost because now hadoop cluster start on localhost and we have to start spark-shell on client mode.

./bin/spark-shell –master yarn-client

If its successfully start you can see your spark-shell as an application in cluster UI as above and if its give any exception verify all your environment variables and permission.

 

Advertisements

About sandeep

I m working as an software consultant in Knoldus Software LLP . I m working on scala, play, spark,hive, hdfs, hadoop and many big data technologies.
This entry was posted in Scala. Bookmark the permalink.

5 Responses to Spark-shell on yarn resource manager: Basic steps to create hadoop cluster and run spark on it

  1. Pingback: Apache PIG : Installation and Connect with Hadoop Cluster | Knoldus

  2. Great and helpful blog to everyone.. Installation procedure are very clear and step by so easy to understand.. All installation commands are very clear and i learnt installation procedure easily form this blog so i install hadoop in my system very quickly.. thanks a lot for sharing this blog to us…

    big data training topics | hadoop training and placements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s