Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this happened because of easy to use API in Spark, lesser operational cost due to efficient use of resources, compatibility with a lot of existing technologies like YARN/Mesos, fault tolerance, and security.
Due to these reasons a lot of organizations have migrated their Big Data applications to Spark and the first thing they do is to learn – how to use RDDs. Which makes sense too as RDD is the building block of Spark and the whole idea of Spark is based on RDD. Also, it is a perfect replacement of MapReduce. So, whoever wants to learn Spark should know about RDDs.
But when it comes to building an enterprise grade application based on Spark, RDD isn’t a good choice. Why ? You will get to know that when you will read the reasons given below. But if RDD is not a good choice then what should we use then ? The obvious answer is DataFrames/Datasets.
Now, lets come to reasons for not using RDDs:
Yes! you read it right, RDDs are outdated. And the reason behind it is that, as Spark became mature, it started adding features that was more desirable by industries like Data Warehousing, Big Data Analytics, and Data Science.
Now, in order to fulfill the needs of these industries, Spark has to come up with a solution which can work like a silver bullet and solve the problem of being fit for all sorts of industries.
To do that it introduced DataFrame/Dataset, a distributed collection of data with the benefits of Spark SQL’s optimized execution engine. What is Spark SQL’s optimized execution engine, we’ll get to that later on, but for now we know that Spark has come up with two new type of data structures which have more benefits than RDD.
2. Hard to Use
Next reason to not use RDD is the API which it provides. Although, most of the operations like counting, grouping, etc. are pretty straight forward and easy to use API as functions for them are in built. But, when it comes to operations like aggregation, or finding average, then it becomes really hard to code using RDD.
For example, we have a text file and we want to find out the average frequency of all the words in it.
First lets code it using RDD:
Now, lets try to solve the same problem using DataFrames:
The difference between the 2 solutions can be seen clearly. I mean the first will definitely take you some time to understand as to what developer is trying to do there. But, the second one is pretty straight forward and anyone, who knows SQL at least, will understand it in one go.
So, we saw that RDDs can sometimes be tough to use, if the problem at hand is typical like the one given above.
3. Slow Speed
Last, but not the least reason to not use RDD is its performance which can be a major issue for some applications. Since, this is an important reason, we will take a closer look at it.
For example, we have a 100M numbers to be counted. Now, 100M doesn’t seem to be big number when we talk about big data, but the important thing here to notice will be the difference in speed of DataFrame/Dataset and RDD. Now, lets see the example:
When I ran ds.count, it gave me result, which is of course 100000000, in about 0.2s (on a 4 Core/8 GB machine). Also, the DAG created is as follows:
Whereas, when I ran ds.rdd.count, which first converts Dataset into RDD first and then run a count on it, then it me gave me result in about 4s (on the same machine). Also, the DAG it creates is different:
Now, looking at the results and DAGs given above, two questions will definitely arise in your mind:
- First is, why ds.rdd.count is creating only one stage whereas ds.count has created 2 stages ?
- Second is, even though ds.rdd.count have only one stage to execute, then why it is slower than ds.count ?
The answer to these questions is as follows:
- Both the counts are effectively two step operations. But, the difference is that – in case of ds.count, the final aggregation is performed by one of the executors, while ds.rdd.count aggregates the final result on the driver, therefore this step is not reflected in the DAG.
- ds.rdd.count has to initialize (and later garbage collect) 100 million Row objects, which is a costly operation and accounts for the majority of the time difference between the 2 ops.
So, the crux of these reasons is that, avoid RDDs wherever you can and use DataFrames/Datasets instead. Although, it doesn’t mean that we should not learn RDDs at all. After all they are the building blocks of Spark and one cannot ignore them while learning Spark. But, using them in our applications is where we should avoid them.
I hope you found this post interesting and now when you have few reasons not to use RDDs, you will be able to make others believe to not use RDDs anymore, like I tried in this post 😉