Welcome to the world of Apache Spark

Table of contents

Reading Time: 5 minutes

Welcome to another very important & interesting topic of big data Apache Spark.

What is Apache Spark?

Spark has been called a “general-purpose distributed data processing engine” for big data and machine learning. It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources.

Why would you want to use Spark?

Spark has some big pros:

High speed data querying, analysis, and transformation with large data sets
Great for iterative algorithms
easy to use APIs make a big difference in terms of ease of development, readability, and maintenance.
Super fast, especially for interactive queries. (100x faster than classic Hadoop Hive queries without refactoring the code!)
Supports multiple languages (Java, Python, Scala, R)
Helps make complex data pipelines coherent and easy.

Features

Let’s start looking into the features that make spark unique from other

Choose Spark for the following features :

Swift Processing

Apache Spark offers high data processing speed. That is about 100x faster in memory and 10x faster on the disk. However, it is only possible by reducing the number of read-writes to the disk.

Dynamic in Nature

Spark provides 80 high-level operators, which makes it easy to generate parallel applications.

Reusability

Apache Spark provides the provision of code reusability for batch processing, join streams against historical data, or run adhoc queries on stream state.

In – Memory Processing

The concept of In-Memory computation takes the data processing to a faster as well as more efficient stage. The overall performance of the system is upgraded.

Here the data is being cached so we need not fetch data from the disk every time thus the time is saved.

Fault Tolerance in Spark

fault tolerance in Apache Spark is the capability to operate and to recover loss after a failure occurs. Apache Spark provides fault tolerance through Spark abstraction-RDD.

Real-Time Stream Processing

Data streaming is a way of collecting data continuously in real-time from multiple data sources in the form of data streams. Spark Streaming allows us to build a scalable, high-throughput, and fault-tolerant streaming application of live data streams.

Lazy Evaluation

Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately. we can execute an operation at any time by calling an action on data.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components :

Standalone: Spark Standalone deployment means that Spark occupies the top position in HDFS (Hadoop Distributed File System) and allocates spaces for HDFS. Here, Spark and MapReduce will run constantly for Spark jobs on the cluster.
Hadoop Yarn: Hadoop Yarn deployment means, Spark runs on Yarn without any pre-installation or root access. It provides to integrate Spark into Hadoop ecosystem or Hadoop stack.
Spark in MapReduce (SIMR): Used to launch Spark jobs. In addition to the standalone deployment. With the help of SIMR, users can start with Spark and use its shell without any administrative access.

Apache Spark Components

Responsible for basic I/O functionalities, scheduling & monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery & efficient memory management.

Leverage the power of declarative queries and optimized storage by running SQL like queries on Spark data that is present in RDDs and other external resources

Allow developers to perform batch processing and streaming of data with ease with the same application

MLlib eases the deployment & development of scalable machine learning pipelines.

A data scientist can work with graph and non-graph sources to achieve flexibility and resilience in graph construction and transformation

Apache Spark Architecture

Spark consists of a driver program & group of executors on a cluster. Driver process is responsible for the execution of the main program of spark application and creates SparkContext which coordinates the execution of jobs. The executors are running on worker nodes which are basically responsible for executing the task assigned by the driver to them.

The connection between driver and worker node is managed by the cluster manager, whose main work is to allocate the resources to spark application. Every spark application needs an entry point through which it communicates with dataSource & performs various operations such as reading & writing data.

Spark1. x was introduced three entry points namely (SparkContext, SQLContext, HiveContext) Since, Spark2.x, spark introduced a new entry point called SparkSession which basically contains all the combined function

Its time to meet Apache Spark UI

Apache Spark default comes with the spark-shell the command that is used to interact with Spark from the command line. Apache Spark supports spark-shell for Scala, pyspark for Python, and sparkr for R language.

With the scope of this blog, we will be focusing on spark-shell for Scala.

Pre-requisites: Before you proceed make sure you have Apache Spark installed

1. Launch Spark Shell (spark-shell) Command

Go to the Apache Spark Installation directory from the command line and type
./bin/spark-shell

Yields below output.

By default, spark-shell creates a Spark context which internally creates a Web UI with URL http://localhost:4040.

Apache Spark future

1. Domain/services where spark playing key role:

2. Growth Statisctics

Here I am presenting a few statistics of the growth of Apache Spark,. The statistics were recently shared in AI Summit 2020 by Zheng Kai (Tiejie), Senior Technical Expert at Alibaba

Conclusion

In this blog, we learned what is Spark, the need for Spark. Discusses their features, Spark integration with Hadoop, architecture, along with that we also had interaction with spark using web UI. By looking into Spark’s future growth, I am sure the hype around Spark will continue.

I believe we learned a lot, Isn’t it interesting!

We can continue this blog for further more topics about RDD, DataFrame, Dataset, and DAG.

Stay tuned for more blogs.