Install/Configure Hadoop HDFS,YARN Cluster and integrate Spark with it

Table of contents
Reading Time: 5 minutes

In our current scenario, we have 4 Node cluster where one is master node (HDFS Name node and YARN resource manager) and other three are slave nodes (HDFS data node and YARN Node manager)

In this cluster, we have implemented Kerberos, which makes this cluster more secure.

Kerberos services are already running in the different server which would be treated as KDC server.

In all of the nodes, we have to do a client configuration for Kerberos which I have already written in my previous blog. please go through below kerberos authentication links for more info.

kerberos authentication

To start the installation of Hadoop HDFS and Yarn follow the below steps:

Prerequisites:
  • All nodes should have an IP address as mentioned below
    • Master : 10.0.0.70
    • Slave 1 : 10.0.0.105
    • Slave 2 : 10.0.0.85
    • Slave 3 : 10.0.0.122
  • SSH password less should be there from master node to all the slave node in order to avoid password prompt
  • each node should communicate with each other.
  • 1.8 OpenJDK should be installed on all four nodes.
  • jsvc package should be installed on all four nodes.
  • Need to make host file entry in all of the nodes to communicate with each other by name as local DNS.
  • I am assuming Kerberos packages already installed in all of the four nodes and configuration has also done.

Lets begin the configuration:

add master node’s ssh public key in all of the worker’s nodes authorized_keys which would be found in ~/.ssh/authorized_keys

after adding keys, master node will be able to login in to all of the worker’s nodes without password or keys.

install jdk 1.8 in all four nodes

sudo apt-get install openjdk-8-jdk -y

install jsvc too in all of the four nodes

sudo apt-get install jsvc -y

please make host file entry as mentioned in all of the four nodes

open /etc/hosts file

vim /etc/hosts

add below mentioned parameters for each node and also add one entry for Kerberos(10.0.0.33 EDGE.HADOOP.COM)

10.0.0.70 master         
10.0.0.105 worker1                
10.0.0.85 worker2         
10.0.0.122 worker3

10.0.0.33 EDGE.HADOOP.COM

I am performing here all the operation from root user.

master node:

Download Hadoop 3.0.0 from the official link of apache then extract it and move as hadoop directory.

cd /
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.0.0/hadoop-3.0.0.tar.gz
tar -xzf hadoop-3.0.0.tar.gz
mv -v hadoop-3.0.0 hadoop

now we have to add environment variable in master node

for this, please add all the variables in .bashrc of user root

vim ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
export HADOOP_CONF_DIR=/hadoop/etc/hadoop
export HADOOP_USER_NAME=root
export LD_LIBRARY_PATH=/hadoop/lib/native:$LD_LIBRARY_PATH
export HDFS_SECONDARYNAMENODE_USER=root

to reflect changes instantly please run source ~/.bashrc

please go on the configuration path of hadoop

cd $HADOOP_CONF_DIR
HDFS configurations:

and Open core-site.xml and add these configurations parameter.

vim core-site.xml

<configuration>
         <property>
                 <name>fs.default.name</name>
                 <value>hdfs://master:9000</value>
          </property>
         <property> 
                 <name>hadoop.security.authentication</name>
                 <value>kerberos</value> <!-- A value of "simple" would disable security. -->
         </property>
         <property>
                 <name>hadoop.security.authorization</name>
                 <value>true</value>
         </property>
 </configuration>

now open hdfs-site.xml and place below mentioned configuration

vim hdfs-site.xml 

<configuration>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>/data/name-node</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>/data/data-node</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/hadoop-hdfs/cache/ubuntu</value>
        </property>
         <property>
                <name>dfs.namenode.rpc-bind-host</name>
                <value>0.0.0.0</value>
        </property>
          <property>
                <name>dfs.namenode.servicerpc-bind-host</name>
                <value>0.0.0.0</value>
        </property>
        <property>
                 <name>dfs.client.use.datanode.hostname</name>
                <value>true</value>
                  <description>Whether clients should use datanode hostnames when connecting to datanodes
 </description>
        </property>
        <property>
           <name>dfs.datanode.use.datanode.hostname</name>
                 <value>true</value>
                 <description>Whether datanodes should use datanode hostnames when connecting to other d
                  </description>
          </property>
          <!-- General HDFS security config -->
        <property>
                <name>dfs.block.access.token.enable</name>
                <value>true</value>
        </property>
        <!-- NameNode security config -->
        <property>
                  <name>dfs.namenode.keytab.file</name>
                  <value>/home/ubuntu/keytabs/hdfs-master.keytab</value> <!-- path to the HDFS keytab -->
        </property>
        <property>
                <name>dfs.namenode.kerberos.principal</name>
                  <value>hdfs/_HOST@HADOOP.COM</value>
        </property>
        <property>                <name>dfs.namenode.kerberos.internal.spnego.principal</name>
                  <value>HTTP/_HOST@HADOOP.COM</value>
        </property>
        <!-- Secondary NameNode security config -->
        <property>                 <name>dfs.secondary.namenode.keytab.file</name>
                  <value>/home/ubuntu/keytabs/hdfs-master.keytab</value> <!-- path to the HDFS keytab -->
</property>
        <property>                <name>dfs.secondary.namenode.kerberos.principal</name>
                  <value>hdfs/_HOST@HADOOP.COM</value>
        </property>
        <property>               <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
                  <value>HTTP/_HOST@HADOOP.COM</value>
        </property>
        <!-- DataNode security config -->
        <property>
                  <name>dfs.datanode.data.dir.perm</name>
                  <value>700</value>
        </property>
        <property>
                  <name>dfs.datanode.address</name>
                  <value>0.0.0.0:1004</value>
        </property>
        <property>
                  <name>dfs.datanode.http.address</name>
                  <value>0.0.0.0:1006</value>
        </property>
        <property>
                  <name>dfs.datanode.keytab.file</name>
                  <value>/home/ubuntu/keytabs/worker.keytab</value> <!-- path to the HDFS keytab -->
        </property>
        <property>
                <name>dfs.datanode.kerberos.principal</name>    
            <value>root@HADOOP.COM</value>
        </property>
        <!-- Web Authentication config -->
        <property>          <name>dfs.web.authentication.kerberos.principal</name>
                  <value>HTTP/_HOST@HADOOP.COM</value>
         </property>
</configuration>

open worker file and add all workers nodes DNS Name

vim workers

worker1
worker2
worker3

Open the hadoop-env.sh file and replace with mentioned values

vim hadoop-env.sh

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"
export HADOOP_HOME="/hadoop"
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
export HDFS_DATANODE_SECURE_USER=root
export HDFS_NAMENODE_SECURE_USER=root
YARN configurations:

create a yarn user in all of the nodes

useradd yarn

add mentioned parameters in container-executor.cfg

vim container-executor.conf

yarn.nodemanager.linux-container-executor.group=yarn
allowed.system.users=root,ubuntu,knoldus,Administrator,yarn
feature.tc.enabled=0

then give mentioned permission to container-executor.conf file

chmod 644 container-executor.conf

open yarn-site.xml conf file and add mentioned configurations

<configuration>
<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>master</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>master:8032</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>master:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>master:8031</value>
        </property>
        <property>
        <name>yarn.log-aggregation-enable</name>
                <value>true</value>
        </property>
        <property>
                <name>yarn.dispatcher.exit-on-error</name>
                <value>true</value>
        </property>
        <property>
                <description>List of directories to store localized files in.
                </description>
                <name>yarn.nodemanager.local-dirs</name>
                 <value>/hadoop-yarn/cache/ubuntu/nm-local-dir</value>
        </property>
        <property>
                <description>Where to store container logs.</description>
                <name>yarn.nodemanager.log-dirs</name>
                <value>/var/log/hadoop-yarn/containers</value>
        </property>
        <property>
                <description>Where to aggregate logs to.</description>
                <name>yarn.nodemanager.remote-app-log-dir</name>
                <value>/var/log/hadoop-yarn/apps</value>
        </property>
        <property>
                <description>Classpath for typical applications.</description>
                <name>yarn.application.classpath</name>
                <value>
                        $HADOOP_CONF_DIR,
                        /hadoop/*,/hadoop/lib/*,
                        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
                        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
                        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
               </value>
        </property>
        <property>
                <name>yarn.scheduler.minimum-allocation-mb</name>
                <name>1536</name>
        </property>
        <property>
                <name>yarn.scheduler.maximum-allocation-mb</name>
                <value>4608</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>4608</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.resource.mb</name>
                <value>3072</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.command-opts</name>
                <value>-Xmx2457m</value>
        </property>
        <property>
                <name>yarn.webapp.ui2.enable</name>
                <value>true</value>
        </property>
 <!-- resource manager secure configuration info -->
        <property>
                 <name>yarn.resourcemanager.principal</name>
                 <value>yarn/_HOST@HADOOP.COM</value>
        </property>


        <property>
                 <name>yarn.resourcemanager.keytab</name>
                 <value>/home/ubuntu/keytabs/yarn-worker.keytab</value>
        </property>
<!-- remember the principal for the node manager is the principal for the host this yarn-site.xml file is on -->
<!-- these (next four) need only be set on node manager nodes -->
        <property>
                 <name>yarn.nodemanager.principal</name>
                 <value>yarn/_HOST@HADOOP.COM</value>
        </property>
        <property>
                 <name>yarn.nodemanager.keytab</name>
                 <value>/home/ubuntu/keytabs/yarn-worker.keytab</value>
        </property>
        <property>
                 <name>yarn.nodemanager.container-executor.class</name>
                 <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
        </property>
        <property>
                 <name>yarn.nodemanager.linux-container-executor.group</name>
                 <value>yarn</value>
        </property>
</configuration>

now give desirable permissions to container-executor in all of the nodes

cd /hadoop/bin/
chown root:yarn container-executor
chmod 050 container-executor
chmod u+s container-executor
chmod g+s container-executor

duplicate the Hadoop directory with all the configuration file(core-site.xml,hdfs-site.xml, and yarn-site.xml to all of 3 workers’ nodes

scp -r /hadoop worker1:/
scp -r /hadoop worker2:/
scp -r /hadoop worker3:/

Please add below mentioned fields in environment variables of master node

vim ~/.bashrc

export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

hdfs and yarn configuration has been done.

Spark configuration and integration with YARN

download the spark binary from the mentioned path then extract it and move it as spark directory.

wget https://mirrors.estointernet.in/apache/spark/spark-2.4.6/spark-2.4.6-bin-without-hadoop-scala-2.12.tgz

tar -xvf spark-2.4.6-bin-without-hadoop-scala-2.12.tgz
mv -v spark-2.4.6-bin-without-hadoop-scala-2.12 spark

Set the environment variable for spark

vim ~/.bashrc

export SPARK_HOME=/spark
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin

Now rename spark default template

mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Edit $SPARK_HOME/conf/spark-defaults.conf and add mentioned parameters.

# Example:
 spark.master                     yarn
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              512m
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.yarn.am.memory    512m
spark.executor.memory           512m

spark.driver.memoryOverhead 512
spark.executor.memoryOverhead 512

rename and open slaves file under $SPARK_HOME/conf and paste below mentioned workers

mv -v slaves.template slaves

vim slaves
worker1
worker2
worker3
mv spark-env.sh.template spark-env.sh

add below parameter on spark-env.sh

export SPARK_DIST_CLASSPATH=$(hadoop --config $HADOOP_CONF_DIR classpath)

Now spark and yarn integration has been done.

it’s time to start the services of hdfs and yarn. before starting the configuration first need to format namenode.

hdfs namenode -format

Now start the services of hdfs

cd /hadoop/sbin
./start-dfs.sh

This will start name node in master node as well as data node in all of the workers nodes

We can check and verify by running jps in all of the nodes

jps

15126 Jps
18823 NameNode
19160 SecondaryNameNode

start the service of yarn

cd ~/hadoop/sbin/
./start-yarn.sh

this will start resource manager service in master node and node manager services in all of the worker nodes.

For accessing the url of hdfs. please use below link

http://10.0.0.70:9870

For accessing the url of YARN

http://10.0.0.70:8088

Note: please create the principal of Kerberos and keytab files as per your desire. because these would be used in the configuration file of hdfs and yarn.

Conclusion:

Here, we have created four-node HDFS, YARN clusters with one master node, and 3 worker nodes, and spark job will be running on top of YARN. Kerberos is also implemented for secure communication.

References:

https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/

1 thought on “Install/Configure Hadoop HDFS,YARN Cluster and integrate Spark with it11 min read

Comments are closed.