Understanding the Apache Spark Streaming

Reading Time: 2 minutes

The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault-tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed.

Four Major Aspects of Spark Streaming

  • Fast recovery from failures and stragglers
  • Better load balancing and resource usage
  • Combining of streaming data with static datasets and interactive queries
  • Native integration with advanced processing libraries (SQL, machine learning, graph processing)

Need for Apache Spark Streaming

We have the stream processing systems designed with a continuous operator model to process the data, which works as follows:

  • From data sources, we receive the streaming data (e.g. live logs, system telemetry data, IoT device data, etc.), and then it goes into some data ingestion systems like Apache Kafka, Amazon Kinesis, etc.
  • The data is then processed in parallel on a cluster.
  • Results go to downstream systems like HBase, Cassandra, Kafka, etc.
spark streaming

Now we will cover some important topics used in spark streaming and will get a brief idea about them and their uses :

  • Error Recovery
  • Checkpointing

Error Recovery

Your application’s error management should be robust and self-sufficient. What we mean by this is that if an exception is non-critical, then manage the exception, perhaps log it, and continue processing.
For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailure ),
it will terminate processing.

Checkpointing in Spark Streaming

On batch processing, fault tolerance is present. This means the job does not lose its state if a node crash. Instead of that, the lost task goes on to other workers. Intermediate results are goes to persistent storage (which of course has to be fault-tolerant as well which is the case for HDFS, GPFS, or Cloud Object Storage)

Now we want to achieve the same guarantees in streaming as well since it might be crucial that we did not lose the data stream which we are processing.

It is possible to set up an HDFS-based checkpoint directory to store Apache Spark-based streaming information. And the process of writing received records at checkpoint intervals to HDFS is checkpointing.

Conclusion

In this blog, we’ve learned about Apache Spark Streaming and some other topics like error recovery, checkpointing, and the need for streaming in apache spark.

Hope you enjoyed the blog. Thanks for reading

References

https://databricks.com/glossary/what-is-spark-streaming#:~:text=Apache%20Spark%20Streaming%20is%20a,both%20batch%20and%20streaming%20workloads.

knoldus