RealTimeProcessing of Data using kafka and Spark

Table of contents
Reading Time: 3 minutes

Before Starting it you should know about kafka, spark and what is Real time processing of let’s do some brief introduction about it.

Real Time Processing – Processing the Data that appears to take place instead of storing the data and then processing it or processing the data that stored somewhere else.

Kafka – Kafka is the maximum throughput of data from one end to another . it uses a concept of producer and consumer for producing and consuming the data. producer sends the data into topics that’s put on Kafka cluster and consumer subscribes the data from these topics. you can read about more Kafka here

Spark – spark is an open source processing engine built around speed, ease of use, and analytic. If you have large amounts of data that requires low latency processing that then Spark is the way to go. you can read about spark here

Spark Streaming – Spark Streaming is an extension of the core Spark API that enables processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

A Quick Example –  

For ingesting data from Kafka, you will have to add the corresponding dependencies.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"

First, we import the names of the Spark Streaming classes and some implicit conversions from StreamingContext into our environment in order to add useful methods to other classes we need (like DStream). StreamingContext is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads and a batch interval of 10 seconds.

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[2]").setAppName("simpleApp")
val ssc = new StreamingContext(conf, Seconds(10))

Now we will give topic name and assign the number of thread to it for example –

val topic = Map("mytopic"->2)

so here topic name – mytopic and number of threads which I assign here is 2.

KafkaUtils API is used to connect the Kafka cluster to Spark streaming. This API has the significant method createStream signature defined as below.

def createStream(
ssc: StreamingContext,
zkQuorum: String,
groupId: String,
topics: Map[String, Int],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[(String, String)]

this method used to Create an input stream that pulls messages from Kafka Brokers.

ssc − StreamingContext object.

zkQuorum − Zookeeper quorum.

groupId − The group id for this consumer.

topics − return a map of topics to consume.

storageLevel − Storage level to use for storing the received objects.

Now we will create Dstream using KafkaUtils

val lines = KafkaUtils.createStream(ssc , "localhost:2181","defaultId",topic).map(_._2)

Here we got Dstream that is lines  and perform any Action on it. for example-

lines.flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_ + _).print

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call


you can find whole example Here




Written by 

Shubham is a Software Consultant, with experience of more than 1.5 years.He is familiar with Object Oriented Programming Paradigms. He is always eager to learn new and advance concepts. Aside from being a programmer, his hobbies include playing badminton,watching movies, writing blogs. He has experience working in C, C++, CoreJava, Adv Java, HTML, CSS, JS, Ajax.

1 thought on “RealTimeProcessing of Data using kafka and Spark3 min read

Comments are closed.