Using Spark DataFrames for Word Count


As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease.

Also, DataFrame API came with many under the hood optimizations like Spark SQL Catalyst optimizer and recently, in Spark 1.5.0 it got Tungsten enabled in it. This has made Spark DataFrames efficient and faster than ever.

In the following blog post, we will learn “How to use Spark DataFrames for a simple Word Count ?”

The first step is to create a Spark Context & SQL Context on which DataFrames depend.

 val sc = new SparkContext(new SparkConf().setAppName("word-count").setMaster("local"))
 val sqlContext = new SQLContext(sc)

Now, we can load up a file for which we have to find Word Count. But before that we need to import implicits from “sqlContext” which are necessary to convert a RDD to a DataFrame.

 import sqlContext.implicits._

After this, we can find the Word Count of the file using following code snippet.

 val linesDF = sc.textFile("file.txt").toDF("line")
 val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
 val wordCountDF = wordsDF.groupBy("word").count()
 wordCountDF.show()

In above code snippet, we need to notice that “count()” function is not same as “count()” of a RDD. Here, it counts the occurrence of each grouped word, not all words in whole dataframe.

From above code, we can infer that how intuitive is DataFrame API of Spark. Now, we don’t have to use “map”, “flatMap” & “reduceByKey” methods to get the Word Count. Or, need to have sound knowledge of Spark RDD before start coding in Spark.

Hope, you enjoyed reading this blog post. You can download the code for this example from here.

Advertisements
This entry was posted in apache spark, big data, Scala, Spark, sql and tagged , , , , . Bookmark the permalink.

5 Responses to Using Spark DataFrames for Word Count

  1. Pingback: Simplifying Sorting with Spark DataFrames | Knoldus

  2. I am regular reader, how are you everybody?This piece of writing posted at this site is truly good.

  3. Pingback: Qudosoft goes Big Data Part 3 | Qudosoft

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s