Structured Streaming: How it works?

Table of contents

Reading Time: 2 minutes

In our previous blog post – Structured Streaming: What is it? we got to know that Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.

Now it’s time to learn – How it works? So, in this blog post, we will look at the working of a structured stream via an example.

So, let’s take a look at the example:

	import org.apache.spark.sql.SparkSession

	object StructuredNetworkWordCount extends App {

	val spark = SparkSession
	.builder
	.appName("StructuredNetworkWordCount")
	.master("local")
	.config("spark.sql.shuffle.partitions", 8)
	.getOrCreate()

	import spark.implicits._

	// Create DataFrame representing the stream of input lines from connection to localhost:9999
	val lines = spark.readStream
	.format("socket")
	.option("host", "localhost")
	.option("port", 9999)
	.load()

	// Split the lines into words
	val words = lines.as[String].flatMap(_.split(" "))

	// Generate running word count
	val wordCounts = words.groupBy("value").count()

	// Start running the query that prints the running counts to the console
	val query = wordCounts.writeStream
	.outputMode("complete")
	.format("console")
	.start()

	query.awaitTermination()

	}

view raw StructuredNetworkWordCount.scala hosted with ❤ by GitHub

Above is an example of a structured stream which has Socket as the source & Console as the sink. It has 3 major sections:

Source – The first part is the source, which is represented by lines variable. It is nothing but a DataFrame created from a streaming source.
Operations – Since, we get the same DataFrame API, calculating Word-Count has the same code as it is for a batch data processing.
Sink – The last part is the Sink which is represented by query variable. It is nothing but a streaming sink where results are sent.

Now, when we have seen the code and its anatomy in words it’s time to take a look at it’s working via an illustration:

In the above illustration, the query variable represents a DataFrame with the unlimited number of rows. Every push to the Socket will append new rows or update old rows in the DataFrame, which eventually sends the data to the sink, i.e., Console.

So, we have seen how a Structured Streaming works, i.e., it acts like a table with an unlimited number of rows. I hope you liked this blog. Please feel free to suggest or comment.

We will be back with more blogs on Structured Streaming. Till then stay tuned 🙂