With the advent of streaming frameworks like Spark Streaming, Flink, Storm etc. developers stopped worrying about issues related to a streaming application, like – Fault Tolerance, i.e., zero data loss, Real-time processing of data, etc. and started focussing only on solving business challenges. The reason is, the frameworks (the ones mentioned above) provided inbuilt support for all of them. For example:
In Spark Streaming, by just adding checkpoint directory path, like it is done in below code snippet, recovery from failure(s) became easy.
|val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount")|
|// Create the context with a 1 second batch size|
|val ssc = new StreamingContext(sparkConf, Seconds(1))|
And in Flink, we just have to enable checkpointing in the execution environment, like it is done in below code snippet.
Everything was working fine in the streaming data world, but then came the Structured Data era, where data was in tabular form (stored in large Data Warehouses) and data was processed using simple SQL queries. For example, in Spark SQL/Flink Table, reading data became as simple as “select *“, like the one(s) below:
|spark.sql("SELECT * FROM employee").show()|
|// | age| name||
|// | 30| Andy||
|// | 19| Justin||
This helped a wider range of people, i.e., the ones who do not know how to code like Data Scientists, Business Analysts, etc. but were aware of SQL. Both Spark SQL/Flink Table became an instant hit in the big data industry.
However, this success was limited to only batch data, i.e., files, tables, etc. The streaming world was totally untouched by it. Everyone wanted to have the capability of running their SQL queries on streaming data as well, so, that they can draw insights from their data in real-time.
This compelled the big data industry experts to develop API(s) which can process streaming data present in the structured/semi-structured form. As a result, a lot of frameworks were developed which can process streaming data using SQL queries. For example:
- Spark Structured Streaming
- KSQL (Kafka-SQL)
- Flink Table, and many more
They all have their own Pros & Cons, but in this blog post, we will talk about only Spark Structured Streaming. According to Spark’s official documentation-
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
It means that we can express our streaming computation the same way we would express a batch computation on static data. Since Structured Streaming is built over Spark SQL engine, it comes with a lot of advantages inbuilt, like-
- Incremental and Continous update of the final result(table) is taken care of by the API itself.
- Dataset/DataFrame API can be used/re-used in any language (Scala, Java, Python or R) to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
- Computations are optimized as the same Spark SQL engine is used.
- And, the application guarantees end-to-end exactly-once fault-tolerance through Checkpointing & WALs (Write Ahead Logs).
So, long story short, Structured Streaming is a fast, scalable, fault-tolerant, end-to-end exactly-once stream processing API which helps the user in building streaming applications without having to reason about it.
We will explore more about Structured Streaming in our future blogs. Till then stay tuned 🙂
Please feel free to suggest and comment.