Both the technologies are equipped with amazing features, however with the increased need for real-time analytics, these two giving tough competition to each other
What are MapReduce and Spark?
MapReduce is a programming model for processing huge amounts of data in a parallel and distributed. In this model, there are two tasks that are undertaken Map and Reduce and there is a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce is used by Hadoop.
example– Consider the problem of counting the number of occurrences of each word in a large collection of documents.
The map function emits each word plus an associated count of occurrences (just ‘R’ in this simple example). The reduce function sums together all counts emitted for a particular word.
Spark is a new and rapidly growing open source, scalable, massively parallel, an in-memory execution environment for running analytics applications so that works well on the cluster of computer nodes. Speed is one of the hallmarks of Apache Spark. Spark achieves its speed by using an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Much like MapReduce, Spark works to distribute data across a cluster, and process that data in parallel. The difference is, unlike MapReduce—which shuffles files around on disk—Spark works in memory, making it much faster at processing data than MapReduce.
Spark consists of a number of components:
Spark Core: The foundation of Spark that provides distributed task dispatching, scheduling and basic I/O
Spark Streaming: Analysis of real-time streaming data
Spark Machine Learning Library (MLlib): A library of prebuilt analytics algorithms that can run in parallel across a Spark cluster on data loaded into memory
Spark SQL + DataFrames: Spark SQL enables querying structured data from inside Java-, Python-, R- and Scala-based Spark analytics applications using either SQL or the DataFrames distributed data collection
GraphX: A graph analysis engine and set of graph analytics algorithms running on Spark
SparkR: The R programming language on Spark for executing custom analytics
What makes MapReduce stay behind in the race?
One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage as you can see from the above figure.
How Spark has an edge over MapReduce?
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input datasets in memory, so they don’t need to be read from disk for each operation. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new Java Virtual Machine for each of the tasks while Spark has an executor Java Virtual Machine deployed on each node due to this it is a simple task of making a Remote Procedure Call and thus it is extremely fast. Spark utilizes Direct Acyclic Graph that helps to do all the optimization and computation in a single stage rather than multiple stages in the MapReduce model. The core processing methodology of Spark is the RDD which is resilient distributed dataset which is an immutable distributed collection of objects for computing data on different nodes with logical partitions