We all know that Apache spark is a Big data processing engine that works on the model of in-memory computation. When we are dealing with extensive data even if we are able to reduce the use of even 1 MB of memory per minute it will result in thousands of dollars per month.
Hence it becomes essential to learn the spark best practices and optimization techniques for every developer and ops person.
In this Blog, I will cover a few techniques to use in the Production environment to reduce thousands of dollars.
Use DataFrame/DataSet over RDD
RDD does serialisation and de-serialisation of data whenever it distributes the data across clusters such as during repartition and shuffle, and we all know that serialisation and de-serialisation are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
Use Coalesce over Repartition
Whenever you need to reduce the number of partition use coalesce over repartition as coalesce do minimum data movement over partition. On the other hand repartition recreate all the partition so data movement is high. In order to increase the number of partitions, we have to use repartition.
Use Serialized data formats
Generally, whether it’s a streaming or batch job, spark writes computed results to some output file, and another spark job consumes that file, does some computation on it, and again writes to some output file.
In such a scenario using a serialised file format such as parquet gives us a significant advantage over CSV and JSON file formats.
Avoid User-Defined Functions
whenever you can use the built-in function of spark use that as UDF(user-defined function) is a black box for spark hence spark can not apply any optimization on UDF. We lose all the available features of optimization offered by spark DataFrame/DataSet.
Caching data in memory
Whenever we are doing a sequence of transformations of the data frame and we need to use an intermediate DataFrame again and again for further computations spark gives a feature to store specific DF in memory in form of a cache.
Job1:- df1 > df2 > df3 > df4 > df5
Job2:- df1 > df2 > df3 > df6 > df7
Here there are two jobs and two Actions will trigger and both will compute df1 > df 2 > df3. it is duplicate work and consumes resources, so in order to reduce this duplicate work we can cache the df3 while running Job1.
And Job2 doesn’t have to compute the df1 > df2 > df3. it will take df3 directly from the Cache and proceed further and save our resources.
Conclusion
These points only cover some of the decision points you can take to improve your spark job performance at a high level. In a future blog, we’ll cover some more points at the code level that will tune your spark job better.
References
- https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
- https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html