Simple WordCount Programme In Spark 2.0

In this blog we will write a very basic word count programme in spark 2.0 using intellij and sbt so lets get started if you are not familiar with spark 2.0 you can learn it from here

start your intellij and create a new project first add the dependency for spark 2.0 in your build.sbt from here

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1"

let the dependencies to be resolved now add a text file in your resources folder on which we will apply word count logic, add a object in your main file and named it as word_count_example

now you have to perform the given steps

  • create a spark session from org.apache.spark.sql.sparksession api and specify your master and app name
  • using the sparksession.read.txt method read from the file wordcount.txt the return value of this method is a dataset in case you dont know what is data set you can learn from this link
  • split this dataset of type string by white space and create a map which contain the occurence of each word in that data set
  • created a class prettyPrintMap for pretty printing the result to consol
  • given below is the complete code
import java.io.StringWriter

import org.apache.spark.sql.{Dataset, SparkSession}


object Word_Count_Example extends App {

  val sparkSession = SparkSession.builder.
    master("local")
    .appName("Word_Count_Example")
    .getOrCreate()

  val stringWriter = new StringWriter()

  def getCurrentDirectory = new java.io.File(".").getCanonicalPath


  import sparkSession.implicits._

  try {
    val data: Dataset[String] = sparkSession.read.text(getCurrentDirectory + "/src/main/resources/wordCount.txt").as[String]
    val wordsMap = data.flatMap(value => value.split("\\s+")).
collect().toList.groupBy(identity).mapValues(_.size)
    println(wordsMap.prettyPrint)
  }

  catch {
    case exception: Exception => 
println(exception.printStackTrace())
  }

  implicit class PrettyPrintMap[K, V](val map: Map[K, V]) {
    def prettyPrint: PrettyPrintMap[K, V] = this

    override def toString: String = {
      val valuesString = toStringLines.mkString("\n")

      "Map (\n" + valuesString + "\n)"
    }
    def toStringLines = {
      map
        .flatMap{ case (k, v) => keyValueToString(k, v)}
        .map(indentLine)
    }

    def keyValueToString(key: K, value: V): Iterable[String] = {
      value match {
        case v: Map[_, _] => Iterable(key + " -> Map (") ++ v.prettyPrint.toStringLines ++ Iterable(")")
        case x => Iterable(key + " -> " + x.toString)
      }
    }

    def indentLine(line: String): String = {
      "\t" + line
    }

  }

}

in any case you can clone the code from github

2 thoughts on “Simple WordCount Programme In Spark 2.0

  1. hey thanks for the post. I have a simple question. Any idea why when using the — “sparkSession.read.text(getCurrentDirectory + “/src/main/resources/wordCount.txt”).as[String] ” value.split(“\\s+”)).collect().toList.groupBy(identity).mapValues(_.size)” —- is there a way to use “groupByKey” function(like there is in “rdd” but not in sparkSession object..

  2. hey thnks,simply use val df = spark.read.json(“examples/src/main/resources/people.json”) it gives you a dataframe then simply apply groupbykey same as in spark 1.6

Leave a Reply

%d bloggers like this: