Apache Spark Cluster Internals: How spark jobs will be computed by the spark cluster

Table of contents
Reading Time: 2 minutes

In this blog we are explain how the spark cluster compute the jobs. Spark jobs are collection of stages and stages are collection of tasks. So before the deep dive first we see the spark cluster architecture.

cluster-overview

In the above cluster we can see the driver program it is a main program of our spark program, driver program is running on the master node of the spark cluster.

Cluster manager is the responsible for allocating resources for the given job.

And worker nodes have a executers in which the task will be running and stored the data in the cache.

This is Apache Spark basic architecture of the cluster.

Now we discuss about different RDD types created on transformations as follows:

  • HadoopRDD
  • FilteredRDD
  • ShuffleRDD

HadoopRDD: Spark make a RDD from the Hadoop InputFormat so it makes a new HadoopRDD and map the partitions with Hadoop block size by default its 64MB.

scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

FilteredRDD: If we make a transformation on RDD which doesn’t shuffle the RDD and make one to one partition then spark make it as a FilteredRDD like transformations map(),filter() makes a filteredRDD.

scala> val linesWithSpark = textFile.filter(line => line.contains(“Spark”))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09

ShuffleRDD: If we make a transformation on RDD which shuffle the RDD then spark make it ShuffleRDD which will the sign of changing the stage like transformations join(),reduceBykey(),groupByKey() etc.

Now we are jump into average length of words example and see how its make stages and tasks from the job and which RDD is made by Spark.

spark-train-f1

 

Let consider above diagram and now we go line by line how cluster execute it

avgleng=sc.textFile(file)

Above line make a HadoopRDD it will be created on driver program means at master node and distributed over all workers when it will be calculated.

.flatMap(lambda line: line.split( ))

Above line of code make a transformation and its make a filteredRDD because this transformation had one to one partition. And it will be calculated on worker nodes because spark master node assign the tasks to the worker with the help of DAG(Directed Acyclic Graph). So master node add this transformation in the DAG.

.map(lambda word: (word,len(word))

Above transformation also make a filteredRDD because its make a one to one partitions so master node again assign transformation to worker by adding it in DAG.

.groupByKey()

the above transformation will group all the values of the same key so it will shuffle all the RDD’s so spark make shuffleRDD for this, once shuffle transformation comes the spark cluster change the stage of the job.

.map(lambda (key,value): (key,sum(value)/len(value)))

above transformation make a one to one partition so it will make a filteredRDD.

and at last action

.count()

worker start executing all transformations using DAG when action encounter. And the result will be send back to the master node.

 

 

 

4 thoughts on “Apache Spark Cluster Internals: How spark jobs will be computed by the spark cluster3 min read

Comments are closed.