As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease.
Also, DataFrame API came with many under the hood optimizations like Spark SQL Catalyst optimizer and recently, in Spark 1.5.0 it got Tungsten enabled in it. This has made Spark DataFrames efficient and faster than ever.
In the following blog post, we will learn “How to use Spark DataFrames for a simple Word Count ?”
The first step is to create a Spark Context & SQL Context on which DataFrames depend.
Now, we can load up a file for which we have to find Word Count. But before that we need to import implicits from “sqlContext” which are necessary to convert a RDD to a DataFrame.
After this, we can find the Word Count of the file using following code snippet.
In above code snippet, we need to notice that “count()” function is not same as “count()” of a RDDs. Here, it counts the occurrence of each grouped word, not all words in whole dataframe.
From above code, we can infer that how intuitive is DataFrame API of Spark. Now, we don’t have to use “map”, “flatMap” & “reduceByKey” methods to get the Word Count. Or, need to have sound knowledge of Spark RDD before start coding in Spark.
Hope, you enjoyed reading this blog post. You can download the code for this example from here.
7 thoughts on “Using Spark DataFrames for Word Count2 min read”
Reblogged this on himanshu2014.
Reblogged this on sandeepknol.
I am regular reader, how are you everybody?This piece of writing posted at this site is truly good.
how do we find the word count in each line separately in a line
If you want the tokens to be for each line in a line you can you se the pre implemented tokenizer:
https://spark.apache.org/docs/2.2.0/ml-features.html (tokenizer part)
Comments are closed.