RealTimeProcessing of Data using kafka and Spark

Before Starting it you should know about kafka, spark and what is Real time processing of let’s do some brief introduction about it.

Real Time Processing – Processing the Data that appears to take place instead of storing the data and then processing it or processing the data that stored somewhere else.

Kafka – Kafka is the maximum throughput of data from one end to another . it uses a concept of producer and consumer for producing and consuming the data. producer sends the data into topics that’s put on Kafka cluster and consumer subscribes the data from these topics. you can read about more Kafka here

Spark – spark is an open source processing engine built around speed, ease of use, and analytic. If you have large amounts of data that requires low latency processing that then Spark is the way to go. you can read about spark here

Spark Streaming – Spark Streaming is an extension of the core Spark API that enables processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

A Quick Example –  

For ingesting data from Kafka, you will have to add the corresponding dependencies.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"

First, we import the names of the Spark Streaming classes and some implicit conversions from StreamingContext into our environment in order to add useful methods to other classes we need (like DStream). StreamingContext is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads and a batch interval of 10 seconds.

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[2]").setAppName("simpleApp")
val ssc = new StreamingContext(conf, Seconds(10))

Now we will give topic name and assign the number of thread to it for example –

val topic = Map("mytopic"->2)

so here topic name – mytopic and number of threads which I assign here is 2.

KafkaUtils API is used to connect the Kafka cluster to Spark streaming. This API has the significant method createStream signature defined as below.

def createStream(
ssc: StreamingContext,
zkQuorum: String,
groupId: String,
topics: Map[String, Int],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[(String, String)]

this method used to Create an input stream that pulls messages from Kafka Brokers.

ssc − StreamingContext object.

zkQuorum − Zookeeper quorum.

groupId − The group id for this consumer.

topics − return a map of topics to consume.

storageLevel − Storage level to use for storing the received objects.

Now we will create Dstream using KafkaUtils

val lines = KafkaUtils.createStream(ssc , "localhost:2181","defaultId",topic).map(_._2)

Here we got Dstream that is lines  and perform any Action on it. for example-

lines.flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_ + _).print

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call


you can find whole example Here




This entry was posted in Scala and tagged , , , , . Bookmark the permalink.

One Response to RealTimeProcessing of Data using kafka and Spark

  1. Ashish says:

    Nice article brother … Thank you.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s