We all know that Apache Spark is a super-fast cluster computing technology that is designed for fast computation on large volumes of big data. Spark’s ability to do this is dependent on its design philosophy which centers around four key characteristics:
- Ease of use
So let us dive into this design philosophy and see how Apache Spark operates
Sparks Components & Architecture
Apache Spark applies master-slave architecture to every application, at a high level a Spark application consists of a driver program that is responsible for arranging parallel operations on the Spark cluster. The driver accesses the distributed components which are the cluster manager and spark executors through Spark Session.
It’s the Spark Driver which is responsible for instantiating the Spark Session, apart from this Spark Driver also have multiple roles like:
- It communicates with cluster manager and request for resources like CPU, memory etc for spark executors.
- It transforms all the Spark operations into DAG operations, schedules them and distributes their execution as task across the Spark executors.
- Once resources are allocated the driver can communicate directly to the executors.
With Apache Spark 2.0 the Spark Channel becomes the channel for all the Spark Operation and data making itself a single point of all entry points to Sparks like Spark Context SQL context HiveContext, SparkConf, and StreamingContext making working with Spark simpler and easier.
import org.apache.spark.sql.SparkSession //Build a spark session val aSparkSession = SparkSession .builder .appName("This is Spark Blog") .config("spark.sql.shuffle.partitions", 6) //Using a spark Session to issue a SQL query val aSqlQuery = aSparkSession.sql("SELECT song, lyric from table_name")
Cluster Manager is responsible for managing and allocating the resources for the cluster of nodes on which Spark applications run.
Spark cluster supports 4 types of cluster managers:
- Built-in Standalone Cluster Manager
- Apache Hadoop YARN
- Apache Mesos
A spark executor runs on each worker node and is also the one that communicates with the driver program and is responsible for executing tasks on workers. In most of the deployments mode, a single executor runs on each node.
Distributed Data and Partitions
Spark breaks up the actual physical data into chunks called partitions and distributes it either in HDFS or cloud storage. Spark treats each partition as a high-level logical data concept known as a DataFrame in memory.
This distributed scheme of breaking data into chunks allows Spark executors to process data i.e close to them, minimizing the network bandwidth making sure each executor core gets its own data partition to work upon.
For example, this code will break the physical data stored across each cluster into 20 partitions:
val rdd = spark.sparkContext.parallelize((0,20)) print(rdd.getNumPartitions())
Well, this was all for Apache Spark architecture a brief explanation from which I hope you all have gained something, thanks for making it to the end of this blog. If you like this blog please take a look at my more blogs by clicking here.