Prerequisite : Basic concepts of Hadoop and Distributed File system.
Map-Reduce Architecture is a programming model and a software framework utilised for preparing enormous measures of data. Map-Reduce program works in two stages, to be specific, Map and Reduce. Map requests that arrange with mapping and splitting of data while Reduce tasks reduce and shuffle the data .
Map-Reduce is a programming model
Neither platform- nor language-specific
Record-oriented data processing
Operates on key and value pairs
Facilitates distributing a task across multiple Nodes
Provides Parallel Job processing framework
Where ever possible, each node processes data stored on that node
Consists of two phases / functions:
Map
Reduce
Written in java
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Features
Scalability Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge data across plenty of servers. All these servers were inexpensive and can operate in parallel.
Flexibility MapReduce programming enables companies to access new sources of data. It enables companies to operate on different types of data. It allows enterprises to access structured as well as unstructured data, and derive significant value by gaining insights from the multiple sources of data.
Security and Authentication The MapReduce programming model uses HBase and HDFS security platform that allows access only to the authenticated users to operate on the data. Thus, it protects reauthorized access to system data and enhances system security.
Cost-effective solution Hadoop’s scalable architecture with the MapReduce programming framework allows the storage and processing of large data sets in a very affordable manner.
Fast Hadoop uses a distributed storage method called as a Hadoop Distributed File System that basically implements a mapping system for locating data in a cluster.
Simple model of programming Amongst the various features of Hadoop MapReduce, one of the most important features is that it is based on a simple programming model. Basically, this allows programmers to develop the MapReduce programs which can handle tasks easily and efficiently.
Parallel Programming One of the major aspects of the working of MapReduce programming is its parallel processing. It divides the tasks in a manner that allows their execution in parallel.The parallel processing allows multiple processors to execute these divided tasks. So the entire program is run in less time.
Availability and resilient nature Whenever the data is sent to an individual node, the same set of data is forwarded to some other nodes in a cluster. So, if any particular node suffers from a failure, then there are always other copies present on other nodes that can still be accessed whenever needed. This assures high availability of data.
Map-Reduce Key And IMP features :-
MapReduce abstracts all the ‘ housekeeping’ away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
Close integration with HDFS
Auto partitioning of job into sub tasks
Automatic parallelisation and distribution
Fault-tolerance & Auto retry on failures
Status and monitoring tools
Provides pluggable APIs and configuration mechanisms for writing applications
Map and Reduce functions
Input formats and splits
Number of tasks, data types, etc…
Linear Scalability
Locality of task execution
Map-Reduce Architecture With Hadoop Key Concepts
A job is a‘ full program‘
A complete execution of Mappers and Reducers over a dataset
JobTracker
Resides on NameNode /Master
Accepts MR jobs
Assigns tasks to slaves
Monitors tasks
Handles failures
TaskTracker
Resides on DataNode / Slave
Run Map / Reduce task
Manage intermediate output
Task
AtaskistheexecutionofasingleMapperor Reducer over a slice of data
Report progress
Mapper
Processes a single input split from HDFS, often a single HDFS block
run on nodes which hold their portion of the data locally, to avoid network traffic
Multiple Mappers run in parallel, each
processing a portion of the input data
Works on an individual record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
Shuffle & Sort
Once the Map tasks have finished, data is then transferred across the network to the Reducers
All the values associated with the same intermediate key are transferred to the same Reducer
Developer can specify the number of Reducers
Reducer
Reducer is passed each key and a list of all its values
Keys are passed in sorted order
Aggregates results from the Mappers
Intermediate keys produced by the Mapper are the keys on which the
aggregation will be based
Usually emits a single key/value pair for each input key
Output from the Reducers is written to HDFS
A job is a‘ full program‘
A complete execution of Mappers and Reducers over a datasets
JobTracker
Resides on NameNode /Master
Accepts MR jobs
Assigns tasks to slaves
Monitors tasks
Handles failures
Task Tracker
Resides on Data Node / Slave
Run Map / Reduce task
Manage intermediate output
Task
A task is the execution of a single Mapper or Reducer over a slice of data
Report progress
Mapper
Processes a single input split from HDFS, often a single HDFS block
run on nodes which hold their portion of the data locally, to avoid network traffic
Multiple Mappers run in parallel, each
processing a portion of the input data
Works on an individual record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
Shuffle & Sort
Once the Map tasks have finished, data is then transferred across the network to the Reducers
All the values associated with the same intermediate key are transferred to the same Reducer
Developer can specify the number of Reducers
Reducer
Reducer is passed each key and a list of all its values
Keys are passed in sorted order
Aggregates results from the Mappers
Intermediate keys produced by the Mapper are the keys on which the
aggregation will be based
Usually emits a single key/value pair for each input key
Output from the Reducers is written to HDFS
How it works ?
Map function
Takes a key/value pair and generates a set of intermediate key/value pairs
– map(k1,v1)->list(k2,v2)
Reduce function
Takes intermediate values and associates them with the same intermediate key
reduce(k2,list(v2))->list(k3,v3)
A client submits a job to the JobTracker
JobTracker assigns a job ID
Client calculates the input splits for the job
Client adds job code and configuration to HDFS
The JobTracker creates a Map task for each input split
TaskTrackers send periodic“heartbeats” to JobTracker
These heartbeats also signal readiness to run tasks
JobTracker then assigns tasks to these TaskTrackers
The TaskTracker then forks a new JVM to run the task
This isolates the TaskTracker from bugs or faulty code
A single instance of task execution is called a task attempt