Deep dive into Map Reduce: Part -1

Table of contents

Reading Time: 5 minutes

Prerequisite : Basic concepts of Hadoop and Distributed File system.

Map-Reduce Architecture is a programming model and a software framework utilised for preparing enormous measures of data. Map-Reduce program works in two stages, to be specific, Map and Reduce. Map requests that arrange with mapping and splitting of data while Reduce tasks reduce and shuffle the data .

Map-Reduce is a programming model
- Neither platform- nor language-specific
- Record-oriented data processing
- Operates on key and value pairs
- Facilitates distributing a task across multiple Nodes
Provides Parallel Job processing framework
- Where ever possible, each node processes data stored on that node
Consists of two phases / functions:
- Map
- Reduce
Written in java
Every Map/Reduce program must specify a Mapper and optionally a Reducer

Features

Scalability
Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge data across plenty of servers. All these servers were inexpensive and can operate in parallel.
Flexibility
MapReduce programming enables companies to access new sources of data. It enables companies to operate on different types of data. It allows enterprises to access structured as well as unstructured data, and derive significant value by gaining insights from the multiple sources of data.
Security and Authentication
The MapReduce programming model uses HBase and HDFS security platform that allows access only to the authenticated users to operate on the data. Thus, it protects reauthorized access to system data and enhances system security.
Cost-effective solution
Hadoop’s scalable architecture with the MapReduce programming framework allows the storage and processing of large data sets in a very affordable manner.
Fast
Hadoop uses a distributed storage method called as a Hadoop Distributed File System that basically implements a mapping system for locating data in a cluster.
Simple model of programming
Amongst the various features of Hadoop MapReduce, one of the most important features is that it is based on a simple programming model. Basically, this allows programmers to develop the MapReduce programs which can handle tasks easily and efficiently.
Parallel Programming
One of the major aspects of the working of MapReduce programming is its parallel processing. It divides the tasks in a manner that allows their execution in parallel.The parallel processing allows multiple processors to execute these divided tasks. So the entire program is run in less time.
Availability and resilient nature
Whenever the data is sent to an individual node, the same set of data is forwarded to some other nodes in a cluster. So, if any particular node suffers from a failure, then there are always other copies present on other nodes that can still be accessed whenever needed. This assures high availability of data.

Map-Reduce Key And IMP features :-

MapReduce abstracts all the ‘ housekeeping’ away from the developer
- Developer can concentrate simply on writing the Map and Reduce functions
Close integration with HDFS
Auto partitioning of job into sub tasks
Automatic parallelisation and distribution
Fault-tolerance & Auto retry on failures
Status and monitoring tools
Provides pluggable APIs and configuration mechanisms for writing applications
- Map and Reduce functions
- Input formats and splits
- Number of tasks, data types, etc…
Linear Scalability
Locality of task execution

Map-Reduce Architecture With Hadoop Key Concepts

A job is a‘ full program‘
- A complete execution of Mappers and Reducers over a dataset
JobTracker
- Resides on NameNode /Master
- Accepts MR jobs
- Assigns tasks to slaves
- Monitors tasks
- Handles failures
TaskTracker
- Resides on DataNode / Slave
- Run Map / Reduce task
- Manage intermediate output
Task
- ^A^task^is^the^execution^of^a^single^Mapper^or Reducer over a slice of data
- Report progress

Mapper
- Processes a single input split from HDFS, often a single HDFS block
- run on nodes which hold their portion of the data locally, to avoid network traffic
- Multiple Mappers run in parallel, each

processing a portion of the input data

Works on an individual record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk

Shuffle & Sort
- Once the Map tasks have finished, data is then transferred across the network to the Reducers
- All the values associated with the same intermediate key are transferred to the same Reducer
- Developer can specify the number of Reducers
Reducer
- Reducer is passed each key and a list of all its values
- Keys are passed in sorted order
- Aggregates results from the Mappers
- Intermediate keys produced by the Mapper are the keys on which the

aggregation will be based

Usually emits a single key/value pair for each input key
Output from the Reducers is written to HDFS

A job is a‘ full program‘
- A complete execution of Mappers and Reducers over a datasets
JobTracker
- Resides on NameNode /Master
- Accepts MR jobs
- Assigns tasks to slaves
- Monitors tasks
- Handles failures
Task Tracker
- Resides on Data Node / Slave
- Run Map / Reduce task
- Manage intermediate output
Task
- A task is the execution of a single Mapper or Reducer over a slice of data
- Report progress

Mapper
- Processes a single input split from HDFS, often a single HDFS block
- run on nodes which hold their portion of the data locally, to avoid network traffic
- Multiple Mappers run in parallel, each

processing a portion of the input data

Works on an individual record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk

Shuffle & Sort
- Once the Map tasks have finished, data is then transferred across the network to the Reducers
- All the values associated with the same intermediate key are transferred to the same Reducer
- Developer can specify the number of Reducers
Reducer
- Reducer is passed each key and a list of all its values
- Keys are passed in sorted order
- Aggregates results from the Mappers
- Intermediate keys produced by the Mapper are the keys on which the

aggregation will be based

Usually emits a single key/value pair for each input key
Output from the Reducers is written to HDFS

How it works ?

Map function
Takes a key/value pair and generates a set of intermediate key/value pairs

– map(k1,v1)->list(k2,v2)

Reduce function
- Takes intermediate values and associates them with the same intermediate key
- reduce(k2,list(v2))->list(k3,v3)

A client submits a job to the JobTracker
- JobTracker assigns a job ID
- Client calculates the input splits for the job
- Client adds job code and configuration to HDFS

The JobTracker creates a Map task for each input split
- TaskTrackers send periodic“heartbeats” to JobTracker
- These heartbeats also signal readiness to run tasks
- JobTracker then assigns tasks to these TaskTrackers

The TaskTracker then forks a new JVM to run the task
- This isolates the TaskTracker from bugs or faulty code
- A single instance of task execution is called a task attempt
- Status info periodically sent back to JobTracker