In our previous blogs:
We got to know 2 major points about Structured Streaming –
- It is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
- It treats the live data stream as a table that is being continuously appended/updated which allows us to express our streaming computation as a standard batch-like query as on a static table, whereas Spark runs it as an incremental query on the unbounded input table.
In this blog post, we will talk about the philosophy or the programming model of the Structured Streaming. So, let’s get started with the example that we saw in the previous blog post.
The query in the above example generates a “Result Table” (under the hood). With every push of the new data (words to socket), new rows get appended/updated to the “Result Table“. Now, whenever the result table gets updated, the changed result row(s) are sent to an external sink (console in the above example). To understand the workflow in a better way, let’s take a look at the diagram below:
Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.
Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:
- Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.
- Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.
- Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.
At last, there is only one thing to note, i.e., Structured Streaming does not materialize the entire table. It just reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data. It only keeps around the minimal intermediate state data as required to update the result, i.e., intermediate counts in the above example.
This model is significantly different from many other stream processing engines where they ask the user to maintain running aggregations themselves, thus having to reason about fault-tolerance, and data consistency (at-least-once/at-most-once/exactly-once). In this model, Spark is responsible for updating the “Result Table” when new data is available, thus relieving the users from reasoning about it.
So, that’s the whole philosophy of Structured Streaming. I hope you liked. If you have any comment or suggestion, then please leave a comment below.
We will be back with more blogs on Structured Streaming. Till then stay tuned 🙂