Spark computations are faster than map-reduce jobs. If we haven’t designed our jobs for reusing computations then our performance will degrade for billions and trillions of data. Hence, we may need to look at the stages and use optimization techniques as one of the ways to improve performance.
cache()
and persist()
methods provide an optimization mechanism to store the intermediate computation of a spark data frame. So we can reuse them in the subsequent action.
Spark persist()
When we persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset. Spark’s persisted data on nodes are fault-tolerant. If we lose any partition of a dataset, it will automatically recompute by using the original transformations.
Dataframe persist syntax and example
Spark persist() method is used to store the DataFrame or Dataset to one of the storage levels.
Syntax
1) persist() : Dataset.this.type – It will save in MEMORY_
AND_DISK.
2) persist(newLevel : org.apache.spark.storage.StorageLevel)
Example




cache()
Spark cache()
in Dataset class internally calls persist().
cache syntax and example
Syntax
cache() : Dataset.this.type
Example



Conclusion
This blog gives information about cache and persist() in spark. They are used as optimization techniques to save interim computation results of data frames or datasets.