Spark Structured Streaming (Part 2) – The Internals

Table of contents
Reading Time: 2 minutes

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Introduction to Structured Streaming“. So I’ll exactly start from the point where I left in the previous blog.

Structure of Streaming Query

When we call start() API, Spark internally translates this code into a Logical Plan (an abstract representation of what the code does), then it converts the Logical Plan into an Optimized Logical Plan (where Spark figures out what type of optimization can be done, how to generate the underlying low-level RDD code in most optimal form based on all the built-in high-level API we have used) and finally from the Optimized Logical Plan it generates a continuous series of incremental Execution Plans (Physical Plans).

To read about an in-depth explanation of Logical and Physical Plan, you can check my blog – “Understanding Logical and Physical Plan in Spark

Image source: Databricks

So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.

It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.

Fault-tolerance

With check-pointing we get complete fault-tolerance guarantees, thus ensuring recovery from failure.

How?

Before processing each micro-batch Spark writes out the offset information of Kafka topic that it has already processed into WAL (Write Ahead Logs) file. So, that in case of any failure it can exactly re-process the same data ensuring that we get end-to-end exactly-once guarantees. These WAL files which store information of processed offsets are saved to fault-tolerant storage like HDFS or S3 bucket or Azure Blob storage in JSON format for forward-compatibility.

We can even resume our processing from failure by doing limited changes in our streaming transformations (e.g. adding new filter clause to remove corrupted data, etc) and the WAL files can work seamlessly.

Semantic

Structured Streaming can provide exactly-once delivery semantic in combination of specific sinks/sources with WAL files. The source/sink must be re-playable like Kafka, Kinesis, etc. Otherwise, it will guarantee at-least once delivery semantic.

Conclusion

spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "...")
      .option("subscribe", "topicName")
      .option("startingOffsets", "latest")
      .load
      .select($"value".cast(StringType))
      .select(from_json($"value", schema).as("data"))
      .writeStream
      .format("parquet")
      .option("path", "...")
      .trigger(Trigger.ProcessingTime(1.minutes))
      .outputMode(OutputMode.Append)
      .option("checkpointLocation", "...")
      .start()

With this few lines of code (explained in Part 1), we have our ETL pipeline where we are reading data from Kafka and writing it to Parquet. Thus our raw, unstructured, binary JSON data will be available in a structured Parquet format (columnar) within a few seconds.

This is all from this blog, it was short but I hope you understood the internal of Structured Streaming.

Next blog of this series is Stateful Streaming in Structured Streaming.

Hope you enjoyed this blog and it helped you!! Stay connected for more future blogs. Thank you!! 🙂

Written by 

Sarfaraz Hussain is a Big Data fan working as a Senior Software Consultant (Big Data) with an experience of 2+ years. He is working in technologies like Spark, Scala, Java, Hive & Sqoop and has completed his Master of Engineering with specialization in Big Data & Analytics. He loves to teach and is a huge fitness freak and loves to hit the gym when he's not coding.

1 thought on “Spark Structured Streaming (Part 2) – The Internals3 min read

Comments are closed.