Introduction To HADOOP !

Reading Time: 4 minutes

Here I am to going to  write a blog on Hadoop!

“Bigdata is not about data! The value in Bigdata [is in] the analytics. ”

-Harvard Prof. Gary King

So the Hadoop came into Introduction!

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.


Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006 to support distribution for the Nutch search engine. It was inspired by Google’s MapReduce.


The problem with RDBMS is , it can not processed semi-structured and unstructured data (text, videos, audios, Facebook posts, clickstream data, etc.). It can only work with structured data(banking transaction, location information, etc.). Both are also different in term of processing data.

RDBMS architecture with ER model is unable to deliver fast results with vertical scalability by adding CPU or more storages. It becomes unreliable if the main server is down. On the other hand, Hadoop system manages effectively with both large-sized structured and unstructured data in different formats such as XML, JSON, text at high fault-error tolerance. With clusters of many servers in horizontal scalability, Hadoop system’s performance is superior. It provides faster results from big data, unstructured data because its Hadoop architecture bases on the flat open source.

You can download it from here :


1)It is Open Source

2)Power of Java

3)Part of Apache group

4)Supported by Big Web Gaints Companies

Key Technologies:

1) Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.

2) HDFS(Inspired by GFS) : HDFS takes care of storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This distribution enables reliable and extremely rapid computations.


hdfsarchitecture                                                                                               –

Each DataNode sends a Heartbeat message to the NameNode periodically. If any of the datanodes gets failed , Namenode detects this condition by the absence of Heartbeat message. Now the namenode does not forward any new IO request to that datanode.

What is Replication Factor?

The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.

NameNode :

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Namenode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.

DataNode :

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

SecondaryNameNode :

Many of us confused by its name, it gives a sense that its a backup for the Namenode, but in reality its not.


The above image shows how Name Node stores information in disk.
Two different files are:

1. fsimage – Its the snapshot of the filesystem when namenode started.
2. Edit logs – Its the sequence of changes made to the filesystem after namenode started.

In production clusters, restart of namenode are rare which means edit logs can grow very large. which encounters:
1. Editlog become very large , which will be challenging to manage it.
2. Namenode restart takes long time because lot of changes has to be merged.
3. In the case of crash, we will lost huge amount of metadata since fsimage is very old.

So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on namenode reduces.

Secondary Namenode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage from the namenode.


The above figure shows the working of Secondary Namenode.

1. It gets the edit logs from the namenode in regular intervals and applies to fsimage.
2. Once it has new fsimage, it copies back to namenode.
3. Namenode will use this fsimage for the next restart,which will reduce the startup time.

ResourceManager & NodeManager :

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.yarn_architecture

Hope that this blog is helpful for you.

For hadoop cluster setup Refer to the blog:





2 thoughts on “Introduction To HADOOP !5 min read

Comments are closed.