MapReduce vs Spark

person controlling flight simulator
Reading Time: 3 minutes

What is MapReduce in big data:

MapReduce is a programming model for processing large data sets in parallel across a cluster of computers. It is a key technology for handling big data. The model consists of two key functions: Map and Reduce. Map takes a set of data and converts it into another set of data. There individual elements are broken down into tuples (key/value pairs). Reduce takes the output from the Map as input and aggregates the tuples into a smaller set of tuples. The combination of these two functions allows for the efficient processing of large amounts of data by dividing the work into smaller, more manageable chunks.

Is there any point of learning MapReduce, then?

Definitely, learning MapReduce is worth it if you’re interested in big data processing or work in data-intensive fields. MapReduce is a fundamental concept that gives you a basic understanding of how to process and analyze large data sets in a distributed environment. The principles of MapReduce still play a crucial role during modern big data processing frameworks, such as Apache Hadoop and Apache Spark. Understanding MapReduce provides a solid foundation for learning these technologies. Also, many organizations still use MapReduce for processing large data sets accordingly, making it a valuable skill to have in the job market.

Example:

Let’s understand this with a simple example:

Imagine we have a large dataset of words and we want to count the frequency of each word. Here’s how we could do it in MapReduce:

Map:

  • The map function takes each line of the input dataset and splits it into words.
  • For each word, the map function outputs a tuple (word, 1) indicating that the word has been found once.

Reduce:

  • The reduce function takes all the tuples with the same word and adds up the values (counts) for each word.
  • The reduce function outputs a tuple (word, count) for each unique word in the input dataset.
import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.mapreduce.{Mapper, Reducer}
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path

class TokenizerMapper extends Mapper[Object, Text, Text, IntWritable] {
  val one = new IntWritable(1)
  val word = new Text()

  override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context): Unit = {
    val itr = new StringTokenizer(value.toString)
    while (itr.hasMoreTokens) {
      word.set(itr.nextToken)
      context.write(word, one)
    }
  }
}

class IntSumReducer extends Reducer[Text, IntWritable, Text, IntWritable] {
  val result = new IntWritable

  override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context): Unit = {
    var sum = 0
    val valuesIter = values.iterator
    while (valuesIter.hasNext) {
      sum += valuesIter.next.get
    }
    result.set(sum)
    context.write(key, result)
  }
}

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new Configuration
    val job = Job.getInstance(conf, "word count")
    job.setJarByClass(this.getClass)
    job.setMapperClass(classOf[TokenizerMapper])
    job.setCombinerClass(classOf[IntSumReducer])
    job.setReducerClass(classOf[IntSumReducer])
    job.setOutputKeyClass(classOf[Text])
    job.setOutputValueClass(classOf[IntWritable])
    FileInputFormat.addInputPath(job, new Path(args(0)))
    FileOutputFormat.setOutputPath(job, new Path(args(1)))
    System.exit(if (job.waitForCompletion(true)) 0 else 1)
  }
}

This code defines a MapReduce job that splits each line of the input into words using the TokenizerMapper class, maps each word to a tuple (word, 1) and then reduces the tuples to count the frequency of each word using the IntSumReducer class. The job is configured using a Job object and the input and output paths are specified using FileInputFormat and FileOutputFormat. The job is then executed by calling waitForCompletion.

And here’s how you could perform the same operation in Apache Spark:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val textFile = sc.textFile("<input_file>.txt")
    val counts = textFile.flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)
    counts.foreach(println)
    sc.stop()
  }
}

This code sets up a SparkConf and SparkContext, reads in the input data using textFile, splits each line into words using flatMap, maps each word to a tuple (word, 1) using map, and reduces the tuples to count the frequency of each word using reduceByKey. The result is then printed using foreach.

Conclusion:

MapReduce is a programming paradigm for processing large datasets in a distributed environment. The MapReduce process consists of two main phases: the map phase and the reduce phase. In the map phase, data is transformed into intermediate key-value pairs. In the reduce phase, the intermediate results are aggregated to produce the final output. Spark is a popular alternative to MapReduce. It provides a high-level API and in-memory processing that can make big data processing faster and easier. Whether to choose MapReduce or Spark, depends on the specific needs of the task and the resources available.

Written by 

Rituraj Khare is a Software Consultant at Knoldus Software LLP. An avid Scala programmer and Big Data engineer, he has experience with the tech stack such as - Scala| Spark| Kafka| Python| Unit testing| Git| Jenkins| Grafana.