Simplifying Sorting with Spark DataFrames

Table of contents
Reading Time: 2 minutes

In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical.

So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Sciences, sorting has always been a critical task for them. Even Spark gained its fame from Daytona Gray Sort challenge, in which Spark set a new record.

Earlier, sortByKey() was the only way to sort data in Spark, until DataFrames were introduced in Spark 1.3.0. That too, was limited to sort a dataset by its key only. What would one do if sorting was to be done by value ? A probable solution for this question is to swap the Key-Value pairs and then apply sortByKey(), like this

In above code snippet, we want to find the 5 most frequent words written in “data.txt” file, hence we provided “false” to sortByKey(). From the code itself it is understandable that how cumbersome it is perform sorting with RDDs.

Now, lets see what magic Spark DataFrames has done to simplify sorting by taking the same example.

As we can see that their is no need of swapping values as we were doing in RDD. Since, data is organised in column format, we can perform sorting by just mentioning the name of the column on which sorting needs to be done. Also, DataFrames are not restricted to Key-Value pairs anymore.

Of course, there are few shortcomings like the imports that are necessary to work with DataFrames and its functions. But, the overall experience of coding in Spark with DataFrames was fun.

Written by 

Himanshu Gupta is a software architect having more than 9 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition". He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

8 thoughts on “Simplifying Sorting with Spark DataFrames2 min read

  1. Bur i think u r telling half the story. It is fine for a small data set but what about a massive dataset then how global sort will happen ?

Comments are closed.