Introduction to Apache Hadoop: The Need

Table of contents
Reading Time: 3 minutes

In this Blog we will read about the Hadoop fundamentals. After reading this blog we will be able to understand why we need Apache Hadoop, So lets starts with the problem.

Whats the Problem :- The problem is simple: the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up.

One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around five minutes.

Over 20 years later, 1-terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

Proposed Solution :- This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.

this is what Hadoop provides. Now,lets know more about

Apache Hadoop.

Hadoop was created by Doug Cutting, Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

What is Hadoop :- Hadoop is a reliable, scalable platform for storage and analysis. it runs on commodity hardware and is open source.

Why can’t we use databases with lots of disks to do large-scale analysis ? Why is Hadoop needed ? :-

This is because of Disk Drive. seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.

For updating a small proportion of records in a database,a traditional B-Tree (the data structure used in relational databases, which is limited by the rate at which it can perform seeks) works well. For updating the majority of a database, a B Tree is less efficient than Map-reduce, which uses Sort/Merge to rebuild the database.

Use Case Scenario For Map Reduce and RDBMS Are : –

  • Map-reduce is a good fit for problems that need to analyze the whole dataset
    in a batch fashion, particularly for ad hoc analysis.
  • An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data.
  • Map-reduce suits applications where the data is written once and read many times.
  • Relational Database is good for datasets that are continually updated.

For further information about Hadoop we need to know Terminologies So, There are as follow :-

Structured data is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema.

Semi-structured data, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data.

Unstructured data does not have any particular internal structure: for example, plain
text or image data.

RDBMS works on Structured Data i.e. Data has its own schema. while Hadoop works well on unstructured or semi-structured data also because it is designed to interpret the data at processing time. This is called schema-on-read. his provides flexibility and avoids the costly data loading phase of an RDBMS, since in Hadoop it is just a file copy.

The Era of Time :- Initially the record for sorting 1 Tera-byte of Data stood for 297 seconds. In April 2008, Hadoop broke the world Record to become the fastest system to sort the 1 Tera-byte of Data in 209 seconds by Processing Data on a 910-node cluster. In November 2008, Google reported that its Map-reduce implementation sorted 1 Tera-byte in 68 seconds. In n April 2009, it was announced that a team at Yahoo! had used Hadoop to sort 1 Tera-byte in 62 seconds. In 2004 Databricks used 207-node Spark cluster to sort 100 Tera-bytes of data in 1,406 seconds, a rate of 4.27 Tera-bytes per minute.

REFERENCES :- Hadoop The Definitive Guide 4th edition


Written by 

Rachel Jones is a Solutions Lead at Knoldus Inc. having more than 22 years of experience. Rachel likes to delve deeper into the field of AI(Artificial Intelligence) and deep learning. She loves challenges and motivating people, also loves to read novels by Dan Brown. Rachel has problem solving, management and leadership skills moreover, she is familiar with programming languages such as Java, Scala, C++ & Html.