Using Spark DataFrames for Word Count

Table of contents

Reading Time: 2 minutes

As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease.

Also, DataFrame API came with many under the hood optimizations like Spark SQL Catalyst optimizer and recently, in Spark 1.5.0 it got Tungsten enabled in it. This has made Spark DataFrames efficient and faster than ever.

In the following blog post, we will learn “How to use Spark DataFrames for a simple Word Count ?”

The first step is to create a Spark Context & SQL Context on which DataFrames depend.

val sc = new SparkContext(new SparkConf().setAppName("word-count").setMaster("local"))
val sqlContext = new SQLContext(sc)

Now, we can load up a file for which we have to find Word Count. But before that we need to import implicits from “sqlContext” which are necessary to convert a RDD to a DataFrame.

import sqlContext.implicits._

After this, we can find the Word Count of the file using following code snippet.

val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
wordCountDF.show()

In above code snippet, we need to notice that “count()” function is not same as “count()” of a RDDs. Here, it counts the occurrence of each grouped word, not all words in whole dataframe.

From above code, we can infer that how intuitive is DataFrame API of Spark. Now, we don’t have to use “map”, “flatMap” & “reduceByKey” methods to get the Word Count. Or, need to have sound knowledge of Spark RDD before start coding in Spark.

Hope, you enjoyed reading this blog post. You can download the code for this example from here.