In this blog, we will try to understand the concept of Persistence in Apache Spark in a very layman term with scenario-based examples.
Note: The scenarios are only meant for your easy understanding.
Note: Cache memory can be shared between Executors.
What does it mean by persisting/caching an RDD?
Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.
When we persist an RDD, each node stores the partitions of it that it computes in memory and reuses them in other actions on that RDD (or RDD derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the
cache() methods on it. The first time it is computed in an action, it will be kept in cache memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
Let’s say I have this transformation –
RDD3 => RDD2 => RDD1 => Text File RDD4 => RDD3 RDD5 => RDD3
RDD3 is created from RDD2 and RDD2 is created from RDD1. Each time we do a transformation on RDD3, then RDD2 and RDD1 need to be recomputed again and again.
Here, the entire chain of transformation needs to be computed twice.
But we can persist this RDD3 into the cache memory of the Worker node so that each time we use it, RDD2 and RDD1 need not be re-computed.
RDD3.cache() RDD4.collect() //The first action which involves RDD3 will store it in cache memory RDD5.collect()
Here, to compute RDD5 Spark will read RDD3 from the cache memory and generate the result. Hence RDD2 and RDD1 will not be recomputed for RDD5
Note: rdd.cache() is same as rdd.persist()
– MEMORY_ONLY (default) – same as cache
rdd.persist(StorageLevel.MEMORY_ONLY) or rdd.persist()
– MEMORY_AND_DISK – Stores partitions on disk which do not fit in memory (This is also called Spilling)
– DISK_ONLY – Stores all partitions on the disk
– MEMORY_ONLY (default) – This is the simplest and most recommended to use. It stores all the partitions of RDD in cache memory.
– DISK_ONLY – Let’s say I have an RDD (named RDD1) and the computation of that RDD is very complicated (time-consuming, created after applying an ML algorithm) and the RDD is huge in size and available cache memory in the Worker node is less, so we cannot save the RDD into the cache memory. In this case, we can save the RDD into DISK.
You may wonder what’s the point of storing in Disk?
Definitely, if we store the RDD into Disk then I/O will occur, which is time-consuming. But we need to ensure that whether the I/O takes a long time or the re-computation of the RDD takes a longer time. Now, if we can find out that I/O is taking less time then the re-computation of the RDD, then in that case, it is better to store the RDD into Disk.
So, whenever RDD1 is required the next time in the subsequent transformation, then Spark will do an I/O operation and bring it to the Executor’s memory.
– MEMORY_AND_DISK – Let’s say I’ve 3 RDDs (none are cached) in the Executor’s memory and there is no available memory left. Meanwhile, another RDD (say RDD4) comes in. So Spark will remove the Least Recently Used (LRU) RDD from the Executor’s memory and make space for the new RDD to come in (in this case RDD4).
Now, let’s assume that 3 RDDs are cached in memory and when RDD4 arrives LRU will not kick out any RDD from the cache memory of the Worker Node. And there can be OOM issues.
But if we use MEMORY_AND_DISK persistence level with RDD4 then RDD4 will be stored in the Disk, if it doesn’t find enough space in the cache memory.
Also, if a huge RDD is cached in memory and there is not enough cache memory then the remaining partitions which are not able to fit in the cache memory are spilled to Disk if we use MEMORY_AND_DISK.
Again the challenge here is I/O operations.
Note: The data persisted in the Disk is stored in tmp location.
Serialization – We can choose to serialize the data stored in cache memory.
MEMORY_ONLY_SER and MEMORY_AND_DISK_SER
Persisting the RDD in a serialized (binary) form helps to reduce the size of the RDD, thus making space for more RDD to be persisted in the cache memory. So these two memory formats are space-efficient.
But the problem with this is that they are less time-efficient because we need to incur the cost of time involved in deserializing the data.
So it’s a developer’s choice whether performance matters or the storage matters. Definitely, the performance impact would not be much, but it would be a minute one.
Stores the partition on two nodes.
These options stores a replicated copy of the RDD into some other Worker Node’s cache memory as well.
Replicated data on the disk will be used to recreate the partition i.e. it helps to recompute the RDD if the other worker node goes down.
Unpersisting the RDD
– To stop persisting and remove from memory or disk
– To change an RDD’s persistence level
So, this how we can work with our cache memory in Apache Spark.
This is all from this blog, hope you enjoyed it and it helped you!! Stay connected for more future blogs. Thank you!!