Tale of Apache Spark

Table of contents

Reading Time: 6 minutes

Data is being produced extensively in today’s world and it is going to be generated more rapidly in future. 90% of total data that is produced in the world is produced in last two years only and it is estimated that in 2020 world’s total data would reach 45 ZB and data generated each day would be enough that if we try to store it in Blu-ray discs (50 GB each) and stack these discs one above other then the height of this stack would be equal to four Eiffel Towers stacked one above other. This humongous amount of structured, non-structured or semi-structured data cannot be stored and processed using traditional relational SQL-databases.

So, to process and store such data we use Apache Spark as our processing framework to handle and work on Big Data.
Apache Spark is an open-source parallel processing framework for storing and processing Big Data across clustered computers.

Now many of you must be thinking “Why spark if we already have Hadoop to handle Big Data”?

To answer this question, we need to understand the concept of batch & real-time processing. Hadoop basically works on batch processing where processing happens on the block of data. But in case of Spark, it provides you with an add-on to also go for real-time processing. The USP for Spark was that it could process data in real-time and about 100 times faster than Hadoop in batch processing of large data sets.

As it is clearly seen that in the case of Hadoop, it is following the batch process. That means the data gets stored at a period of time & then it is processed in Hadoop. While in case of Spark it follows real-time processing.

In this part of blog, our main focus will be Spark Architecture i.e How Spark internally works ?

Before discussing Architecture, let’s first discuss Spark, its features and what resides in it.

Why Spark?

Apache Spark is an open-source cluster computing framework for real-time data processing. The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming.

Features

Speed – Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is also able to achieve this speed through controlled partitioning.
Caching – Simple programming layer provides powerful caching and disk persistence capabilities.
Deployment – It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
Multiple Language Support – In Spark, there is Support for multiple languages like Java, R, Scala, Python.

Components

1) SPARK CORE

Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:

memory management and fault recovery
scheduling, distributing and monitoring jobs on a cluster
interacting with storage systems

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

//adding dependency to write spark application
groupId = org.apache.spark
artifactId = spark-core_2.12
version = 2.4.4

where,
Resilient – capable of rebuilding data on failure
Distributed – distributes data among various nodes in cluster
Dataset – collection of partitioned data with values

RDD support two types of operations:

Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.
Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
(we will be looking more into Transformations & Actions in details in our next blog)

2) SPARK SQL

SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition, to provide support for various data sources, it makes it possible to weave SQL queries with code transformations which result in a very powerful tool. In brief, Spark SQL is-
– Spark Module for structured data processing
– Two ways to manipulate data:
a) DataFrame/ Datasets API
b) SQL query
(will be discussing all these in details in our next blog)

3) SPARK STREAMING

Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.

4) MLlIB

MLlib is a machine learning library which helps Spark to perform machine learning such as Clustering, Logistic Regression, Linear regression and many more such algorithms are built to work on it.

5) GraphX

GraphX is a library for manipulating graphs and performing graph-parallel operations.

Apache Spark Architecture

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions:
1) Resilient Distributed Dataset(RDD)
2) Directed Acyclic Graph(DAG)

Apache Spark follows a master/slave architecture with a cluster manager. A spark cluster has a single Master and any number of Slaves/Workers. A Driver Program and Spark Context is resident on master daemon and Executors are resident on slave daemon.

Driver Program is the program which contains the main method and is the starting point of the program which actually drives our program.

Inside the driver program, the first thing you do is, you create a Spark Context. Spark Context is the main entry point for Spark functionality. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark Context.

//Initialising Spark with Spark Context
 val conf = new SparkConf().setAppName(appName).setMaster(master)
 new SparkContext(conf)

Now, this Spark context works with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.

Executors : It is the part of the RAM on the slave node where the block of data and the task or the code that is to be implemented is residing. Every spark application has its own executor process.

WorkFlow of Spark

Step 1: When a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG) where it perform some optimizations.

Step 2: Then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. Then tasks are bundled to be sent to the Spark Cluster.

Step 3: The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then launches executors on the worker nodes on behalf of the driver. At this point the driver sends tasks to the executor based on data placement. Before executors begin execution, they register themselves with the driver program so that the driver has complete view of all the executors that are executing the task.

Step 4: During the course of execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.

Example of a Simple WordCount Problem using Scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SparkWordCount {
  def main(args: Array[String]) {
    // create Spark context with Spark configuration
    val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))

    // read in text file and split each document into words
    val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))

    // count the occurrence of each word
    val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
   
   // Display output  
   wordCounts.take(10)
  }
}

This is how Apache Spark works.

In our next blog, we will be discussing Spark Transformations & Actions.

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.