Simplifying Sorting with Spark DataFrames


In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical.

So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Sciences, sorting has always been a critical task for them. Even Spark gained its fame from Daytona Gray Sort challenge, in which Spark set a new record.

Earlier, sortByKey() was the only way to sort data in Spark, until DataFrames were introduced in Spark 1.3.0. That too, was limited to sort a dataset by its key only. What would one do if sorting was to be done by value ? A probable solution for this question is to swap the Key-Value pairs and then apply sortByKey(), like this

val lines = sc.textFile("data.txt")
val rdd = lines
           .flatMap(_.split(" "))
           .map((_, 1))
           .reduceByKey(_ + _)
val sortedRDD = rdd.map(_.swap).sortByKey(false)
val data = sortedRDD.take(5)

In above code snippet, we want to find the 5 most frequent words written in “data.txt” file, hence we provided “false” to sortByKey(). From the code itself it is understandable that how cumbersome it is perform sorting with RDDs.

Now, lets see what magic Spark DataFrames has done to simplify sorting by taking the same example.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val lines = sc.textFile("data.txt").toDF("line")
val df = lines.explode("line","word")((line: String) => line.split(" "))
val sortedDF = df.groupBy("word").count().sort(desc("count"))
val data = sortedDF.take(5)

As we can see that their is no need of swapping values as we were doing in RDD. Since, data is organised in column format, we can perform sorting by just mentioning the name of the column on which sorting needs to be done. Also, DataFrames are not restricted to Key-Value pairs anymore.

Of course, there are few shortcomings like the imports that are necessary to work with DataFrames and its functions. But, the overall experience of coding in Spark with DataFrames was fun.

This entry was posted in apache spark, big data, Spark and tagged , , , , . Bookmark the permalink.

8 Responses to Simplifying Sorting with Spark DataFrames

  1. Pingback: Simplifying Sorting with Spark DataFrames | knoldermanish

  2. Pingback: Simplifying Sorting with Spark DataFrames | sandeepknol

  3. sasanka ghosh says:

    Bur i think u r telling half the story. It is fine for a small data set but what about a massive dataset then how global sort will happen ?

  4. sasanka ghosh says:

    ok got it .it adds kind of range partition class and then do sort in each range

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s