Structured Streaming: How it works?

Table of contents
Reading Time: 2 minutes

In our previous blog post – Structured Streaming: What is it? we got to know that Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.

Now it’s time to learn  – How it works? So, in this blog post, we will look at the working of a structured stream via an example.

So, let’s take a look at the example:

import org.apache.spark.sql.SparkSession
object StructuredNetworkWordCount extends App {
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.master("local")
.config("spark.sql.shuffle.partitions", 8)
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
}

Above is an example of a structured stream which has Socket as the source & Console as the sink. It has 3 major sections:

  1. Source – The first part is the source, which is represented by lines variable. It is nothing but a DataFrame created from a streaming source.
  2. Operations – Since, we get the same DataFrame API, calculating Word-Count has the same code as it is for a batch data processing.
  3. Sink – The last part is the Sink which is represented by query variable. It is nothing but a streaming sink where results are sent.

Now, when we have seen the code and its anatomy in words it’s time to take a look at it’s working via an illustration:

Structured Streaming

In the above illustration, the query variable represents a DataFrame with the unlimited number of rows. Every push to the Socket will append new rows or update old rows in the DataFrame, which eventually sends the data to the sink, i.e., Console.

So, we have seen how a Structured Streaming works, i.e., it acts like a table with an unlimited number of rows. I hope you liked this blog. Please feel free to suggest or comment.

We will be back with more blogs on Structured Streaming. Till then stay tuned 🙂


knoldus-advt-sticker

Written by 

Himanshu Gupta is a software architect having more than 9 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition". He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

1 thought on “Structured Streaming: How it works?1 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading