Simple WordCount Programme In Spark 2.0

Table of contents

Reading Time: 2 minutes

In this blog we will write a very basic word count programme in spark 2.0 using intellij and sbt so lets get started if you are not familiar with spark 2.0 you can learn it from here

start your intellij and create a new project first add the dependency for spark 2.0 in your build.sbt from here

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1"

let the dependencies to be resolved now add a text file in your resources folder on which we will apply word count logic, add a object in your main file and named it as word_count_example

now you have to perform the given steps

create a spark session from org.apache.spark.sql.sparksession api and specify your master and app name
using the sparksession.read.txt method read from the file wordcount.txt the return value of this method is a dataset in case you dont know what is data set you can learn from this link
split this dataset of type string by white space and create a map which contain the occurence of each word in that data set
created a class prettyPrintMap for pretty printing the result to consol
given below is the complete code

import java.io.StringWriter

import org.apache.spark.sql.{Dataset, SparkSession}


object Word_Count_Example extends App {

  val sparkSession = SparkSession.builder.
    master("local")
    .appName("Word_Count_Example")
    .getOrCreate()

  val stringWriter = new StringWriter()

  def getCurrentDirectory = new java.io.File(".").getCanonicalPath


  import sparkSession.implicits._

  try {
    val data: Dataset[String] = sparkSession.read.text(getCurrentDirectory + "/src/main/resources/wordCount.txt").as[String]
    val wordsMap = data.flatMap(value => value.split("\\s+")).
collect().toList.groupBy(identity).mapValues(_.size)
    println(wordsMap.prettyPrint)
  }

  catch {
    case exception: Exception => 
println(exception.printStackTrace())
  }

  implicit class PrettyPrintMap[K, V](val map: Map[K, V]) {
    def prettyPrint: PrettyPrintMap[K, V] = this

    override def toString: String = {
      val valuesString = toStringLines.mkString("\n")

      "Map (\n" + valuesString + "\n)"
    }
    def toStringLines = {
      map
        .flatMap{ case (k, v) => keyValueToString(k, v)}
        .map(indentLine)
    }

    def keyValueToString(key: K, value: V): Iterable[String] = {
      value match {
        case v: Map[_, _] => Iterable(key + " -> Map (") ++ v.prettyPrint.toStringLines ++ Iterable(")")
        case x => Iterable(key + " -> " + x.toString)
      }
    }

    def indentLine(line: String): String = {
      "\t" + line
    }

  }

}

in any case you can clone the code from github

2 thoughts on “Simple WordCount Programme In Spark 2.02 min read”

hey thanks for the post. I have a simple question. Any idea why when using the — “sparkSession.read.text(getCurrentDirectory + “/src/main/resources/wordCount.txt”).as[String] ” value.split(“\\s+”)).collect().toList.groupBy(identity).mapValues(_.size)” —- is there a way to use “groupByKey” function(like there is in “rdd” but not in sparkSession object..

hey thnks,simply use val df = spark.read.json(“examples/src/main/resources/people.json”) it gives you a dataframe then simply apply groupbykey same as in spark 1.6

Comments are closed.