Structured Streaming: What is it?

Table of contents
Reading Time: 3 minutes

spark-logo-croppedWith the advent of streaming frameworks like Spark Streaming, Flink, Storm etc. developers stopped worrying about issues related to a streaming application, like – Fault Tolerance, i.e., zero data loss, Real-time processing of data, etc. and started focussing only on solving business challenges. The reason is, the frameworks (the ones mentioned above) provided inbuilt support for all of them. For example:

In Spark Streaming, by just adding checkpoint directory path, like it is done in below code snippet, recovery from failure(s) became easy.

val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint("/path/to/checkpoint")

And in Flink, we just have to enable checkpointing in the execution environment, like it is done in below code snippet.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);


Everything was working fine in the streaming data world, but then came the Structured Data era, where data was in tabular form (stored in large Data Warehouses) and data was processed using simple SQL queries. For example, in Spark SQL/Flink Table, reading data became as simple as “select *“, like the one(s) below:

spark.sql("SELECT * FROM employee").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Table table = tableEnv.sqlQuery("SELECT * FROM employee");
DataSet<WC> result = tEnv.toDataSet(table, WC.class);
result.print();

This helped a wider range of people, i.e., the ones who do not know how to code like Data Scientists, Business Analysts, etc. but were aware of SQL. Both Spark SQL/Flink Table became an instant hit in the big data industry.

However, this success was limited to only batch data, i.e., files, tables, etc. The streaming world was totally untouched by it. Everyone wanted to have the capability of running their SQL queries on streaming data as well, so, that they can draw insights from their data in real-time.

This compelled the big data industry experts to develop API(s) which can process streaming data present in the structured/semi-structured form. As a result, a lot of frameworks were developed which can process streaming data using SQL queries. For example:

  • Spark Structured Streaming
  • KSQL (Kafka-SQL)
  • Flink Table, and many more

They all have their own Pros & Cons, but in this blog post, we will talk about only Spark Structured Streaming. According to Spark’s official documentation-

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

It means that we can express our streaming computation the same way we would express a batch computation on static data. Since Structured Streaming is built over Spark SQL engine, it comes with a lot of advantages inbuilt, like-

  1. Incremental and Continous update of the final result(table) is taken care of by the API itself.
  2. Dataset/DataFrame API can be used/re-used in any language (Scala, Java, Python or R) to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
  3. Computations are optimized as the same Spark SQL engine is used.
  4. And, the application guarantees end-to-end exactly-once fault-tolerance through Checkpointing & WALs (Write Ahead Logs).

So, long story short, Structured Streaming is a fast, scalable, fault-tolerant, end-to-end exactly-once stream processing API which helps the user in building streaming applications without having to reason about it.

We will explore more about Structured Streaming in our future blogs. Till then stay tuned 🙂

Please feel free to suggest and comment.

References:

 


 

knoldus-advt-sticker

Written by 

Himanshu Gupta is a software architect having more than 9 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition". He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

4 thoughts on “Structured Streaming: What is it?3 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading