They said Spark Streaming simply means Discretized Stream


I am working in a company (Knoldus Software LLP) where Apache Spark is literally running into people’s blood means there are certain people who are really good at it. If you ever visit our blogging page and search for stuff related to spark, you will find enough content which is capable of solving your most of spark related queries, starting form introductions to solutions for specific problems you will find many things.

So by taking inspirations from my colleagues and after learning basics of apache spark from their blogs, now I trying to find out “What is Spark Streaming”.

As the documentation says Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It receives live input data streams from various sources like  Kafka, Flume, Kinesis, or TCP sockets and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

DStream or Discretized Stream is an abstraction provided by the Spark Streaming. DStream denotes a series of RDDs (i.e small immutable batches of data) which makes DStream resilient. We can easily apply any transformation on input data stream using DStream (using map, flatMap ….). We only have to provide our business logic to the DStream and we can easily achieve the required result.

DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams.

After understanding these basic teams, I followed the documentation and tried the quick hands-on, on the code and I was able to fetch the stream of data from TCP socket and apply the word counting logic on that but as soon as I was trying to consume the stream of  data from simple text file, I was stuck a little.

The documentation says for simple text files, there is an easier method streamingContext.textFileStream(dataDirectory)where dataDirectory is the directory path and Spark Streaming will monitor the directory and process any files created in that directory (files written in nested directories not supported).

But file must be compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), so to create a stream of data from a simple text file you need to install Hadoop onto your local system then create a directory  into that and only then you will be able to create a simple text file onto that folder.Means there will be some series of steps you need to follow before starting the spark job.

These steps are.

Step 1: Install Hadoop following steps mentioned in the Hadoop’s documentation. Here is the link.

Step 2: Navigate to that folder where Hadoop resides onto your system and follow these commands.

    1. Format the filesystem:
        $ bin/hdfs namenode -format
      
    2. Start NameNode daemon and DataNode daemon:
        $ sbin/start-dfs.sh
      

      The Hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).Kafka

    3. Browse the web interface for the NameNode; by default, it is available at:
    4. Make the HDFS directories.
        $ bin/hdfs dfs -mkdir /user

Step 3: Create the stream of data from your code by passing the path of that directory

sparkStreamingContext.textFileStream("hdfs://localhost:9000/user")

Step 4: Start the spark job so that Spark Streaming will start monitoring the directory and process any files created in that directory

Step 5: After starting the spark job, create a text file in that directory which is now under the monitoring of spark job.

sudo ./bin/hdfs dfs -cp /home/knoldus/resource/test.txt /user

Now you will be able to create a stream of data from simple text file and on that stream, you may apply a various number of transformations.

Here is the full example:-

object SimpleTextFile extends App {

 val conf = new SparkConf().setMaster("local[2]").setAppName("DemoSparkForWordCount")
 val ssc = new StreamingContext(conf, Seconds(1))

  val lines = ssc.textFileStream("hdfs://localhost:9000/user")

  // Split each line into words
  val words = lines.flatMap(_.split(" "))

  // Count each word in each batch
  val pairs = words.map(word => (word, 1))
  val wordCounts = pairs.reduceByKey(_ + _)

  // Print the first ten elements of each RDD generated in this DStream to the console
  wordCounts.print()
  ssc.start() // Start the computation
  ssc.awaitTermination() // Wait for the computation to terminate
}

In this example,

1). I am creating a  local Spark Streaming Context (i.e ssc) with two working thread, as master requires 2 cores to prevent a starvation scenario and batch interval of 1 second.

2). Then I am creating DStream that consumes the data from directory path “hdfs://localhost:9000/user”

3). Then I am splitting each line into words and paring them with their count values.

4). Then, in the end, I am just printing the first ten output values.

Conclusion:

This blog is all about my first experience with one of the most awesome frameworks named as Apache Spark and it focuses on spark streaming. This blog will surely help you whenever you get stuck while following the documentation of spark streaming.

References:

1). Spark Streaming documentation

2). Knoldus blogs.


knoldus-advt-sticker


 

This entry was posted in apache spark, Scala, Spark and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s