Deep dive into Map Reduce: Part -1

Reading Time: 5 minutes

Prerequisite : Basic concepts of Hadoop and Distributed File system.

Map-Reduce Architecture is a programming model and a software framework utilised for preparing enormous measures of data. Map-Reduce program works in two stages, to be specific, Map and Reduce. Map requests that arrange with mapping and splitting of data while Reduce tasks reduce and shuffle the data .

  • Map-Reduce is a programming model
    • Neither platform- nor language-specific
    • Record-oriented data processing
    • Operates on key and value pairs
    • Facilitates distributing a task across multiple Nodes
  • Provides Parallel Job processing framework
    • Where ever possible, each node processes data stored on that node
  • Consists of two phases / functions:
    • Map
    • Reduce
  • Written in java
  • Every Map/Reduce program must specify a Mapper and optionally a Reducer

Features

  1. Scalability
    Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge data across plenty of servers. All these servers were inexpensive and can operate in parallel.
  2. Flexibility
    MapReduce programming enables companies to access new sources of data. It enables companies to operate on different types of data. It allows enterprises to access structured as well as unstructured data, and derive significant value by gaining insights from the multiple sources of data.
  3. Security and Authentication
    The MapReduce programming model uses HBase and HDFS security platform that allows access only to the authenticated users to operate on the data. Thus, it protects reauthorized access to system data and enhances system security.
  4. Cost-effective solution
    Hadoop’s scalable architecture with the MapReduce programming framework allows the storage and processing of large data sets in a very affordable manner.
  5. Fast
    Hadoop uses a distributed storage method called as a Hadoop Distributed File System that basically implements a mapping system for locating data in a cluster.
  6. Simple model of programming
    Amongst the various features of Hadoop MapReduce, one of the most important features is that it is based on a simple programming model. Basically, this allows programmers to develop the MapReduce programs which can handle tasks easily and efficiently.
  7. Parallel Programming
    One of the major aspects of the working of MapReduce programming is its parallel processing. It divides the tasks in a manner that allows their execution in parallel.The parallel processing allows multiple processors to execute these divided tasks. So the entire program is run in less time.
  8. Availability and resilient nature
    Whenever the data is sent to an individual node, the same set of data is forwarded to some other nodes in a cluster. So, if any particular node suffers from a failure, then there are always other copies present on other nodes that can still be accessed whenever needed. This assures high availability of data.

Map-Reduce Key And IMP features :-

  • MapReduce abstracts all the ‘ housekeeping’ away from the developer
    • Developer can concentrate simply on writing the Map and Reduce functions
  • Close integration with HDFS
  • Auto partitioning of job into sub tasks
  • Automatic parallelisation and distribution
  • Fault-tolerance & Auto retry on failures
  • Status and monitoring tools
  • Provides pluggable APIs and configuration mechanisms for writing applications
    • Map and Reduce functions
    • Input formats and splits
    • Number of tasks, data types, etc…
  • Linear Scalability
  • Locality of task execution

Map-Reduce Architecture With Hadoop Key Concepts

  • A job is a‘ full program‘
    • A complete execution of Mappers and Reducers over a dataset
  • JobTracker
    • Resides on NameNode /Master
    • Accepts MR jobs
    • Assigns tasks to slaves
    • Monitors tasks
    • Handles failures
  • TaskTracker
    • Resides on DataNode / Slave
    • Run Map / Reduce task
    • Manage intermediate output
  • Task
    • A task is the execution of a single Mapper or Reducer over a slice of data
    • Report progress
  • Mapper
    • Processes a single input split from HDFS, often a single HDFS block
    • run on nodes which hold their portion of the data locally, to avoid network traffic
    • Multiple Mappers run in parallel, each

processing a portion of the input data

  • Works on an individual record at a time
  • Each record has a key and a value
  • Intermediate data is written by the Mapper to local disk
  • Shuffle & Sort
    • Once the Map tasks have finished, data is then transferred across the network to the Reducers
    • All the values associated with the same intermediate key are transferred to the same Reducer
    • Developer can specify the number of Reducers
  • Reducer
    • Reducer is passed each key and a list of all its values
    • Keys are passed in sorted order
    • Aggregates results from the Mappers
    • Intermediate keys produced by the Mapper are the keys on which the

aggregation will be based

  • Usually emits a single key/value pair for each input key
  • Output from the Reducers is written to HDFS
  • A job is a‘ full program‘
    • A complete execution of Mappers and Reducers over a datasets
  • JobTracker
    • Resides on NameNode /Master
    • Accepts MR jobs
    • Assigns tasks to slaves
    • Monitors tasks
    • Handles failures
  • Task Tracker
    • Resides on Data Node / Slave
    • Run Map / Reduce task
    • Manage intermediate output
  • Task
    • A task is the execution of a single Mapper or Reducer over a slice of data
    • Report progress
  • Mapper
    • Processes a single input split from HDFS, often a single HDFS block
    • run on nodes which hold their portion of the data locally, to avoid network traffic
    • Multiple Mappers run in parallel, each

processing a portion of the input data

  • Works on an individual record at a time
  • Each record has a key and a value
  • Intermediate data is written by the Mapper to local disk
  • Shuffle & Sort
    • Once the Map tasks have finished, data is then transferred across the network to the Reducers
    • All the values associated with the same intermediate key are transferred to the same Reducer
    • Developer can specify the number of Reducers
  • Reducer
    • Reducer is passed each key and a list of all its values
    • Keys are passed in sorted order
    • Aggregates results from the Mappers
    • Intermediate keys produced by the Mapper are the keys on which the

aggregation will be based

  • Usually emits a single key/value pair for each input key
  • Output from the Reducers is written to HDFS

How it works ?

  • Map function
  • Takes a key/value pair and generates a set of intermediate key/value pairs

map(k1,v1)->list(k2,v2)

  • Reduce function
    • Takes intermediate values and associates them with the same intermediate key
    • reduce(k2,list(v2))->list(k3,v3)
  • A client submits a job to the JobTracker
    • JobTracker assigns a job ID
    • Client calculates the input splits for the job
    • Client adds job code and configuration to HDFS
  • The JobTracker creates a Map task for each input split
    • TaskTrackers send periodic“heartbeats” to JobTracker
    • These heartbeats also signal readiness to run tasks
    • JobTracker then assigns tasks to these TaskTrackers
  • The TaskTracker then forks a new JVM to run the task
    • This isolates the TaskTracker from bugs or faulty code
    • A single instance of task execution is called a task attempt
    • Status info periodically sent back to JobTracker

For more info visit to : Apache Hadoop

And for more interesting blogs visit to : Knoldus Blog

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading