Apache Spark Streaming Checkpointing

Reading Time: 2 minutes

Introduction

The need of spark streaming application is that it should be running 24/7. Hence, it must be resilient to failures unrelated to application logic such as system failure, JVM crashes etc. The recovery should also be speedy in case of any loss of data. Spark streaming achieves this by the help of checkpointing. With the help of this, input DStreams can restore before failure streaming state and continue stream processing. There are two types of data we checkpoint in Spark :

  1. Metadata Checkpointing : – Metadata means data about the data. Metadata checkpointing is used to recover the streaming application driver node from failure. It includes configurations used to create the application, DStream operations and incomplete batches.
  2. Data Checkpointing :- In stateful streaming, generated RDDs are saved to the storage such as HDFS. It is the case in which upcoming RDD for some transformations is dependent on previous RDDs. In this case, the length of dependency chain keeps on increasing with time. Thus to avoid increase in recovery time, the intermediate RDDs are peroidically checkpointed to some reliable storage.

Types of Checkpointing in Apache Spark

There are 2 types of Apache Spark checkpointing:

  1. Reliable Checkpointing :- Checkpointing in which actual RDD is saved to reliable distributed storage i.e. HDFS. We need to call the SparkContext.setCheckpointDir(directory: String) method to set the checkpointing directory. The directory must be a HDFS directory while running the job in cluster mode. The reason is that driver might try to reconstruct the checkpointed RDD from its own local path. It is not correct because checkpointing files are actually on the executor machines.
  2. Local Checkpointing :- Local checkpoint allows you to truncate RDD Lineage graph while skipping the expensive step of replicating the data to a reliable distributed file system. It is useful in cases where RDDs needs to be truncated peroidically. Eg. GraphX. In this we persist RDD to local storage in the executor.

Difference between Caching and Spark Streaming Checkpointing

Caching computes and materializes an RDD in memory while keeping track of its lineage. It allows you to make space and compute cost trade offs, and specify the behaviour of RDD when it runs out of memory. Since caching remembers a RDD’s lineage, spark has to recompute in case of any failure. Apart from this, cached RDD only lives in the context of running application.

Checkpointing saves a RDD to reliable storage system such as HDFS, S3 while forgetting the RDD lineage completely. Truncating dependencies become very important when lineage starts getting long. Along with this, other applications can also use the checkpointed RDD.

Conclusion

With the help of checkpointing method, one can easily achieve fault tolerance. Also, data can be saved forever and can be used consistently in future. There are also various use cases like where computation takes a lot of time, checkpointing is the solution.

Thanks for the reading it.

knoldus

Written by 

Amarjeet is a Software Consultant at Knoldus Software LLP. He has an overall experience of 2 years and 10 months.He has completed his Bachelor of Technology in Computer Science from National Institute of Technology, Hamirpur, Himachal Pradesh. He likes reading books, travelling and trekking.