Hello everyone! In my previous blog, I explained the difference between RDD, DF, and DS you can find this blog Here
In this blog, I will try to explain How spark internally works and what are the Components of Execution: Jobs, Tasks, and Stages.
As we all know spark gives us two operations for solving any problem.
- Transformation
- Action
When we do the transformation on any RDD, it gives us a new RDD. But it does not start the execution of those transformations. The execution is performed only when an action is performed on the new RDD and gives us a final result.
So once you perform any action on RDD then spark context gives your program to the driver.
The driver creates the DAG(Directed Acyclic Graph) or Execution plan(Job) for your program. Once the DAG is created, driver divides this DAG to a number of Stages. These stages are the divided into smaller tasks and all the tasks are given to the executors for execution.
The Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure:
They create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical Directed Acyclic Graph (DAG) of operations.
When the driver runs, it converts this logical graph into a physical execution plan.
So let’s take an example of word count for better understanding:-
val rdd = sc.textFile("address of your file") rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_ + _).collect
Here you can see collect is an action which will collect all data and give final result. As explained above, when I perform collect action the spark driver creates a DAG.
In the image above, you can see one job is created and executed successfully.
Now let’s have a look at DAG and it’s stages.
Here you can see spark created the DAG for the program written above and divided the DAG into two stages.
In this DAG you can see a clear picture of the program, first the text file is read, then the transformations like map, flatMap are applied and finally the reduceBykey is executed.
But why spark divided this program into two stages, why not more than two or less than two? Basically it depends on shuffling, like whenever you perform any transformation where the spark needs to shuffle the data by communicating to the other partitions it creates other stages for such transformations. And the transformation which does not require the shuffling of your data, it creates a single stage for it.
Now let’s have a look at tasks that how many tasks have been created by the spark –
As I mentioned earlier that spark driver divides DAG stages into tasks. Here you can see for each stage is divided into two tasks.
But why spark divided only two tasks for each stage? So it depends on your number of partitions.
In this program, we have only two partitions so each stage is divided into two tasks. And single task runs on a single partition.
The number of tasks for a Job is = ( no of your stages * no of your partitions )
Now I think you people may have a clear picture that how spark works internally.
Wow, this is first Knoldus blogpost I really liked. I had some general understanding, but you put all things in their places. Thank you.
Thanks, enverest 🙂