Structured Streaming: What is it?

Table of contents

Reading Time: 3 minutes

With the advent of streaming frameworks like Spark Streaming, Flink, Storm etc. developers stopped worrying about issues related to a streaming application, like – Fault Tolerance, i.e., zero data loss, Real-time processing of data, etc. and started focussing only on solving business challenges. The reason is, the frameworks (the ones mentioned above) provided inbuilt support for all of them. For example:

In Spark Streaming, by just adding checkpoint directory path, like it is done in below code snippet, recovery from failure(s) became easy.

	val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount")
	// Create the context with a 1 second batch size
	val ssc = new StreamingContext(sparkConf, Seconds(1))
	ssc.checkpoint("/path/to/checkpoint")

view raw SparkStreamingExample.scala hosted with ❤ by GitHub

And in Flink, we just have to enable checkpointing in the execution environment, like it is done in below code snippet.

	StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
	env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
	env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
	env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

view raw FlinkExample.java hosted with ❤ by GitHub

Everything was working fine in the streaming data world, but then came the Structured Data era, where data was in tabular form (stored in large Data Warehouses) and data was processed using simple SQL queries. For example, in Spark SQL/Flink Table, reading data became as simple as “select *“, like the one(s) below:

	spark.sql("SELECT * FROM employee").show()
	// +----+-------+
	// \| age\| name\|
	// +----+-------+
	// \|null\|Michael\|
	// \| 30\| Andy\|
	// \| 19\| Justin\|
	// +----+-------+

view raw SparkSQLExample.scala hosted with ❤ by GitHub

	Table table = tableEnv.sqlQuery("SELECT * FROM employee");

	DataSet<WC> result = tEnv.toDataSet(table, WC.class);

	result.print();

view raw FlinkTableExample.java hosted with ❤ by GitHub

This helped a wider range of people, i.e., the ones who do not know how to code like Data Scientists, Business Analysts, etc. but were aware of SQL. Both Spark SQL/Flink Table became an instant hit in the big data industry.

However, this success was limited to only batch data, i.e., files, tables, etc. The streaming world was totally untouched by it. Everyone wanted to have the capability of running their SQL queries on streaming data as well, so, that they can draw insights from their data in real-time.

This compelled the big data industry experts to develop API(s) which can process streaming data present in the structured/semi-structured form. As a result, a lot of frameworks were developed which can process streaming data using SQL queries. For example:

Spark Structured Streaming
KSQL (Kafka-SQL)
Flink Table, and many more

They all have their own Pros & Cons, but in this blog post, we will talk about only Spark Structured Streaming. According to Spark’s official documentation-

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

It means that we can express our streaming computation the same way we would express a batch computation on static data. Since Structured Streaming is built over Spark SQL engine, it comes with a lot of advantages inbuilt, like-

Incremental and Continous update of the final result(table) is taken care of by the API itself.
Dataset/DataFrame API can be used/re-used in any language (Scala, Java, Python or R) to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
Computations are optimized as the same Spark SQL engine is used.
And, the application guarantees end-to-end exactly-once fault-tolerance through Checkpointing & WALs (Write Ahead Logs).

So, long story short, Structured Streaming is a fast, scalable, fault-tolerant, end-to-end exactly-once stream processing API which helps the user in building streaming applications without having to reason about it.

We will explore more about Structured Streaming in our future blogs. Till then stay tuned 🙂

Please feel free to suggest and comment.

References: