In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical.
So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Sciences, sorting has always been a critical task for them. Even Spark gained its fame from Daytona Gray Sort challenge, in which Spark set a new record.
Earlier, sortByKey() was the only way to sort data in Spark, until DataFrames were introduced in Spark 1.3.0. That too, was limited to sort a dataset by its key only. What would one do if sorting was to be done by value ? A probable solution for this question is to swap the Key-Value pairs and then apply sortByKey(), like this
In above code snippet, we want to find the 5 most frequent words written in “data.txt” file, hence we provided “false” to sortByKey(). From the code itself it is understandable that how cumbersome it is perform sorting with RDDs.
Now, lets see what magic Spark DataFrames has done to simplify sorting by taking the same example.
As we can see that their is no need of swapping values as we were doing in RDD. Since, data is organised in column format, we can perform sorting by just mentioning the name of the column on which sorting needs to be done. Also, DataFrames are not restricted to Key-Value pairs anymore.
Of course, there are few shortcomings like the imports that are necessary to work with DataFrames and its functions. But, the overall experience of coding in Spark with DataFrames was fun.
Reblogged this on pushpendupurkait.
Reblogged this on sandeepknol.
Reblogged this on himanshu2014.
Bur i think u r telling half the story. It is fine for a small data set but what about a massive dataset then how global sort will happen ?
ok got it .it adds kind of range partition class and then do sort in each range
Yes !!