Simple WordCount Programme In Spark 2.0


In this blog we will write a very basic word count programme in spark 2.0 using intellij and sbt so lets get started if you are not familiar with spark 2.0 you can learn it from here

start your intellij and create a new project first add the dependency for spark 2.0 in your build.sbt from here

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1"

let the dependencies to be resolved now add a text file in your resources folder on which we will apply word count logic, add a object in your main file and named it as word_count_example

now you have to perform the given steps

  • create a spark session from org.apache.spark.sql.sparksession api and specify your master and app name
  • using the sparksession.read.txt method read from the file wordcount.txt the return value of this method is a dataset in case you dont know what is data set you can learn from this link
  • split this dataset of type string by white space and create a map which contain the occurence of each word in that data set
  • created a class prettyPrintMap for pretty printing the result to consol
  • given below is the complete code
import java.io.StringWriter

import org.apache.spark.sql.{Dataset, SparkSession}


object Word_Count_Example extends App {

  val sparkSession = SparkSession.builder.
    master("local")
    .appName("Word_Count_Example")
    .getOrCreate()

  val stringWriter = new StringWriter()

  def getCurrentDirectory = new java.io.File(".").getCanonicalPath


  import sparkSession.implicits._

  try {
    val data: Dataset[String] = sparkSession.read.text(getCurrentDirectory + "/src/main/resources/wordCount.txt").as[String]
    val wordsMap = data.flatMap(value => value.split("\\s+")).
collect().toList.groupBy(identity).mapValues(_.size)
    println(wordsMap.prettyPrint)
  }

  catch {
    case exception: Exception => 
println(exception.printStackTrace())
  }

  implicit class PrettyPrintMap[K, V](val map: Map[K, V]) {
    def prettyPrint: PrettyPrintMap[K, V] = this

    override def toString: String = {
      val valuesString = toStringLines.mkString("\n")

      "Map (\n" + valuesString + "\n)"
    }
    def toStringLines = {
      map
        .flatMap{ case (k, v) => keyValueToString(k, v)}
        .map(indentLine)
    }

    def keyValueToString(key: K, value: V): Iterable[String] = {
      value match {
        case v: Map[_, _] => Iterable(key + " -> Map (") ++ v.prettyPrint.toStringLines ++ Iterable(")")
        case x => Iterable(key + " -> " + x.toString)
      }
    }

    def indentLine(line: String): String = {
      "\t" + line
    }

  }

}

in any case you can clone the code from github

Advertisements
This entry was posted in Scala. Bookmark the permalink.

2 Responses to Simple WordCount Programme In Spark 2.0

  1. Mudit says:

    hey thanks for the post. I have a simple question. Any idea why when using the — “sparkSession.read.text(getCurrentDirectory + “/src/main/resources/wordCount.txt”).as[String] ” value.split(“\\s+”)).collect().toList.groupBy(identity).mapValues(_.size)” —- is there a way to use “groupByKey” function(like there is in “rdd” but not in sparkSession object..

  2. Anubhavtarar says:

    hey thnks,simply use val df = spark.read.json(“examples/src/main/resources/people.json”) it gives you a dataframe then simply apply groupbykey same as in spark 1.6

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s