Once you have downloaded the spark and are ready with the SparkShell and executed some shortcode examples. After that, to understand what’s happening behind your sample code you should be familiar with some of the critical concepts of the Spark application.
Some important terminology used are:
A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.
An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you and in a Spark application, you create a SparkSession object yourself.
Spark jobs lets you run Spark applications on clusters and monitor their status.
Each job gets divided into smaller sets of tasks called stages that depend on each other.
A single unit of work or execution that will be sent to a Spark executor.
Spark Application and SparkSession
SparkSession introduced in version Spark 2.0. It is a unified entry point of a spark application. It provides a way to interact with various spark’s functionality with a lesser number of constructs. With Spark 2.0 a new class org.apach
.sql.SparkSession has been introduced to use which is a combined class for all different contexts we used to have prior to 2.0 (
HiveContext e.t.c) release. Hence, Spark Session can be used in replace with SQL Context, HiveContext and other contexts defined prior to 2.0.
To create SparkSession in Python Scala, you need to use the builder pattern method builder
() and call getOrCreate
() method. It will create a new SparkSession or if a SparkSession already exists it will return it.
val spark = SparkSession.builder() .master("local") .appName("SparkByExamples.com") .getOrCreate();
During interactive sessions with Spark shells, your Spark driver will covert your Spark application in one or more jobs. It then transforms each job into a DAG(Directed Acyclic Graph). This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages.
Spark Stages are created on the basis of operations that can be performed serially or in parallel. And as every Spark operation can’t happen in a single stage that’s why there is the necessity of dividing it into multiple stages.
Type of Spark Stages
- ShuffleMapStage in Spark
- ResultStage in Spark
In Spark, a Task is the smallest individual unit of execution that corresponds to an RDD. Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor. After that, each task maps to a single core and works on a single partition of data.
As such, an executor with 16 core can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel!
There are two types of transformations –
- Narrow Transformations
- Wide Transformations
These are transformations that do not require the process of shuffling.And these action can be executed in a single stage.
Example – map() and filter()
These are transformations that require shuffling across various partitions.Therefore, different stages need to be created for communication across different partitions.
In this blog, we’ve learned about the spark application concepts and the terminology used in spark application.After that, we have also seen the types of transformations in spark i.e narrow and wide transformation.
Hope you enjoyed the blog. Thanks for reading