Using Spark DataFrames for Word Count

As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease.

Also, DataFrame API came with many under the hood optimizations like Spark SQL Catalyst optimizer and recently, in Spark 1.5.0 it got Tungsten enabled in it. This has made Spark DataFrames efficient and faster than ever.

In the following blog post, we will learn “How to use Spark DataFrames for a simple Word Count ?”

The first step is to create a Spark Context & SQL Context on which DataFrames depend.

Now, we can load up a file for which we have to find Word Count. But before that we need to import implicits from “sqlContext” which are necessary to convert a RDD to a DataFrame.

After this, we can find the Word Count of the file using following code snippet.

In above code snippet, we need to notice that “count()” function is not same as “count()” of a RDDs. Here, it counts the occurrence of each grouped word, not all words in whole dataframe.

From above code, we can infer that how intuitive is DataFrame API of Spark. Now, we don’t have to use “map”, “flatMap” & “reduceByKey” methods to get the Word Count. Or, need to have sound knowledge of Spark RDD before start coding in Spark.

Hope, you enjoyed reading this blog post. You can download the code for this example from here.

Written by 

Himanshu Gupta is a lead consultant having more than 4 years of experience. He is always keen to learn new technologies. He not only likes programming languages but Data Analytics too. He has sound knowledge of "Machine Learning" and "Pattern Recognition".He believes that best result comes when everyone works as a team. He likes listening to Coding ,music, watch movies, and read science fiction books in his free time.

6 thoughts on “Using Spark DataFrames for Word Count

Leave a Reply

%d bloggers like this: