“Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it.
So, in this blog post we will get to know Spark Structured Streaming with the help of a simple example. But, before we begin with the example lets get to know it first.
Structured Streaming is a scalable and fault-tolerant stream processing engine built upon the strong foundation of Spark SQL. It leverages Spark SQL’s powerful APIs to provide a seamless query interface which allows us to express our streaming computation in the same way we would express a SQL query over our batch data. Also, it optimizes the execution of our streaming computation to provide low-latency and continually updated answers.
Now, when we have defined Spark Structured Streaming, let’s see it’s example of it.
In this example we will compute the famous Word-Count but with a time-based window on it. To compute Word-Count in a particular time window, we will tag each line, received from network. with a timestamp that will help us in determining the window it falls into. So, let’s start coding:
First we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities in Spark.
Now, we have to create a streaming DataFrame that represents text data received from a network listening to localhost:9999, and transform the DataFrame to compute word-count over a window of 10 seconds which slides after every 5 seconds. Also, we have to tag each line, which we receive from network, with a timestamp.
lines DataFrame represents an unbounded table containing the streaming text data. Here each line, in the streaming text data, is a row in the table. Next, we have converted the DataFrame to a Dataset of (String, Timestamp) using
.as[(String, Timestamp)], so that we can apply the
flatMap operation to split each line into multiple words and tag each word with its
timestamp. Finally, we have defined the
windowedCounts DataFrame by grouping by the unique values in the DataFrame and aggregating them on the basis of their
timestamp. Note that this is a streaming DataFrame which represents the running windowed word counts of the stream.
We now have to set up the query on the streaming data, i.e., specify a sink for it. So, that we can actually start receiving data, as without a sink a stream cannot work. For this, we set it up to print the complete set of counts (specified by
outputMode("complete")) to the console every time they are updated. And then start the streaming computation using
Now, we are ready to run our example. But, before that we have to run Netcat as a data server to send some data.
Now, in a different terminal when we ran the example and we get following output.
Here we can see that we are getting word-count computed over a window of 10 seconds which is sliding after every 5 seconds.
It’s that simple, isn’t it. I mean we just created a streaming application, although for a very naive use case (Word Count 😛 ), but at least we got an idea about Spark Structured Streaming.
However, this is not the end, it’s just the beginning. We will come back with more posts on Spark Structured Streaming where you will get to know it better. So, stay tuned 🙂