Cache and Persist in Apache Spark Dataframe

background
Reading Time: 2 minutes

Spark computations are faster than map-reduce jobs. If we haven’t designed our jobs for reusing computations then our performance will degrade for billions and trillions of data. Hence, we may need to look at the stages and use optimization techniques as one of the ways to improve performance.

cache() and persist() methods provide an optimization mechanism to store the intermediate computation of a spark data frame. So we can reuse them in the subsequent action.

Spark persist()

When we persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset. Spark’s persisted data on nodes are fault-tolerant. If we lose any partition of a dataset, it will automatically recompute by using the original transformations.

Dataframe persist syntax and example

Spark persist() method is used to store the DataFrame or Dataset to one of the storage levels.

Syntax

1) persist() : Dataset.this.type – It will save in MEMORY_AND_DISK.
2) persist(newLevel : org.apache.spark.storage.StorageLevel)

Example

cache()

Spark cache()in Dataset class internally calls persist().

cache syntax and example

Syntax

cache() : Dataset.this.type

Example

Conclusion

This blog gives information about cache and persist() in spark. They are used as optimization techniques to save interim computation results of data frames or datasets.

Written by 

Rakhi Pareek is a Software Consultant at Knoldus. She believes in continuous learning with new technologies. Her current practice area is Scala. She loves to maintain a diary to put on her thoughts daily. Her hobby is doodle art.