This blog pertains to Apache SPARK and YARN (Yet Another Resource Negotiator), where we will understand how Spark runs on YARN with HDFS. So let’s get started.
First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. YARN is cluster management technology and HDFS stands for Hadoop Distributed File System.
Now, let’s start and try to understand the actual topic “How Spark runs on YARN with HDFS as storage layer”. We will look into the steps involved in submitting a job to a cluster.
Let’s say a client came and submitted a job using “spark-submit” from the client machine to the master machine.
In the master machine resides the NameNode (daemon for HDFS) and the ResourceManager (daemon for YARN). Both of them are Java processes.
Note: The NameNode and ResourceManager can reside in the same machine or different machine depending upon the configuration of the cluster.
Now, when the client submits a job, it goes to the master machine. It will talk to the NameNode and the NameNode will do various checks like –
- It will check, whether the client has appropriate permission to read the Input path.
- Whether the client has appropriate permission to write onto the Output path.
- Whether the Input and Output path is valid or not.
- And many more.
Once it verifies that everything is in place, it will assign a Job ID to the Job and then allocate the Job ID into a Job Queue.
So, in Job Queue there can be multiple jobs waiting to get processed.
As soon as a job is assigned to the Job Queue, it’s corresponding information about the Job like Input/Output Path, the location of the Jar, etc. are written into the temp location of HDFS.
Let’s talk a little about the temp location of HDFS. This is the location in each data node where the intermediate data goes in. The location of this path is set in the file named “core-site.xml” under location “hadoop-dir/etc/hadoop”.
That is, every detail of each job will be stored in the temp location. After this, the Job is finally “Accepted”.
In the next step, whenever the turn of a Job comes for execution from the Job Queue, the Resource Manager will randomly select a DataNode (worker node) and start a Java process called Application Master in the DataNode.
Note: For each Job, there will be an Application Master.
Now, on behalf of the Resource Manager, the Application Master will go to the temp location and the details of the Job will be checked/collected from the temp location. Subsequently, the Application Master communicates with the NameNode which further takes the call to figure out where the files (blocks) are located in the cluster, how much resources (number of CPUs, number of nodes, memory required) will the job need. So, the NameNode will do its computation and figure out those things.
Once all the evaluations are done, the Application Master sends all the resource request information to the Resource Manager.
Now, the Resource Manager will look into the request and will send the resource allocation request of the job to the DataNodes.
Now, let’s assume a scenario, the resource request which the Resource Manager has received from the Application Master is of just 2 Cores and 2 GB memory. And the data nodes in the cluster have a configuration of 4 Cores and 16 GB RAM.
In this case, the Resource Manager will send the resource allocation request to one of the DataNode requesting it to allocate 2 Cores and 2 GB memory (i.e. a portion of RAM and Core) to the Job.
So, the Resource Manager sends the request of 2 Cores and 2 GB memory packed together as a Container. These containers are known as Executors.
The resource allocation requests are handled by the NodeManager of each individual worker node and are responsible for the resource allocation of the job.
Finally, the code/Task will start executing in the Executor.
In Spark, there are two modes to submit a job: i) Client mode (ii) Cluster mode.
Client mode: In the client mode, we have Spark installed in our local client machine, so the Driver program (which is the entry point to a Spark program) resides in the client machine i.e. we will have the SparkSession or SparkContext in the client machine.
Whenever we place any request like “spark-submit” to submit any job, the request goes to Resource Manager then the Resource Manager opens up the Application Master in any of the Worker nodes.
Note: I am skipping the detailed intermediate steps explained above here.
The Application Master launches the Executors (i.e. Containers in terms of Hadoop) and the jobs will be executed.
After the Executors are launched they start communicating directly with the Driver program i.e. SparkSession or SparkContext and the output will be directly returned to the client.
The drawback of Spark Client mode w.r.t YARN is that: The client machine needs to be available at all times whenever any job is running. You cannot submit your job and then turn off your laptop and leave from office until your job is finished. 😛
In this case, it won’t be able to give the output as the connection between Driver and Executors will be broken.
Cluster Mode: The only difference in this mode is that Spark is installed in the cluster, not in the local machine. Whenever we place any request like “spark-submit” to submit any job, the request goes to Resource Manager then the Resource Manager opens up the Application Master in any of the Worker nodes.
Now, the Application Master will launch the Driver Program (which will be having the SparkSession/SparkContext) in the Worker node.
That means, in cluster mode the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. Whereas in client mode, the driver runs in the client machine, and the application master is only used for requesting resources from YARN.
Integrate Spark with YARN
To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop configuration. This is done via the HADOOP_CONF_DIR environment variable. The SPARK_HOME variable is not mandatory but is useful when submitting Spark jobs from the command line.
- Edit the “bashrc” file and add the following lines:
export HADOOP_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export YARN_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export SPARK_HOME=/<path of spark dir>
export LD_LIBRARY_PATH=/<path of hadoop dir>/lib/native:$LD_LIBRARY_PATH
- Restart your session by logging out and logging in again.
- Rename the spark default template config file:
mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
- Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
Copy all jars of Spark from $SPARK_HOME/jars to hdfs so that it can be shared among all the worker nodes:
hdfs dfs -put *.jar /user/spark/share/lib
Add/modify the following parameters in spark-default.conf: