Spark Streaming: Adding Spark to Streaming

Table of contents

Reading Time: 4 minutes

In today’s world we have a lot of data. And this data will only grow more and more in future. According to a study, in 2020, the data produced is abound 44 zettabytes each day. And by 2025, approximately 463 exabytes would be created every 24 hours worldwide. Do you ever imagine how one can store or process this much data ?A solution to this is Apache Spark and in this blog I’m going to discuss about Spark Streaming here.

What is Spark Streaming?

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Spark DStream

DStream (also known as discretized stream) is an abstraction of Spark Streaming. It represents a continuous stream of data. You can create DStreams in two ways :

By taking an input stream from sources such as Kafka, Flume etc.
By applying functions on input DStream that will produce another DStream.

Internally, a DStream is a sequence of RDD and so we can also say that it is a continuous stream of RDD . RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Every RDD in DStream contains data from the certain interval. Also if you will apply any operation on a DStream, it applies to all the underlying RDDs.

Spark Streaming Sources and Receivers

Spark streaming provides 2 categories of Spark Streaming Sources. You can create an input DStream using these sources.

The categories are following :

Basic sources: Sources directly available in the StreamingContext API. For example: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking by adding extra dependencies.

Also, every input stream(except file stream) has a receiver object. The work of receiver object is to receive the data from input stream and store it in Spark’s memory.

There are two kinds of receiver(based on some sources allow acknowledgement and some not):

Reliable Receiver : A reliable receiver correctly sends acknowledgment to a reliable source when the data has been received and stored in Spark with replication.
Unreliable Receiver: An unreliable receiver does not send acknowledgment to a source.

You can create multiple input DStreams to receive multiple streams of data in parallel. This will create multiple receivers according to the number of streams.

Things to remember

The stream should have enough cores(locally threads) to work so that it can process the data and run the receivers as well.
Never use local or local[1] as master url while running Spark Streaming Application on local. It means that there is only one thread to work with and it will be used for running the receiver. Therefore, while running on local always use local[n] where n > the number of recievers.

Operations on DStream

There are 2 types of operation that can be applied on DStream:

Transformation Operation

Transformations are the functions that you can use with a DStream to perform some modification on the data of DStream.

There are two types of Transformation:

Stateless Transformation: In this there is no dependency of processing of current batch on the output of the processed data of previous batch. It includes common functions such as :
- map()
- flatMap()
- filter()
- count()
- reduce()
Stateful Transformations: In this, it uses data or intermediate results from previous batches and computes the result of the current batch. It includes function such as :
- windowed operation: It allows you to apply transformation over a sliding window of data. You have to specify 2 parameters window length and sliding interval.
- updateStateByKey() : It allows you to maintain an arbitrary state while continuously updating it with new information. In this, you have to define a state and an update function as well.

Output Operation

After performing some transformation we can use output operations. Output operation allows the data to be pushed to an external system. By applying an output operation, actual execution of transformation of DStream starts. Some of the available operations are the following :

print()
saveAsTextFile()
foreachRDD()

Advantages of Spark Streaming

It recovers very fast from failures and stragglers.
Spark Streaming provides better resource usage and data load balancing.
Streaming data can be combined with static datasets as well as interactive queries.
We can also integrate it with advanced processing libraries, such as SQL, machine learning, graph processing.

You can look for an example of how to use spark-dstream here. I hope this blog was helpful.