Big Data Analytics: An Introduction

Reading Time: 5 minutes


Data can help businesses better understand their customers and improve their advertising campaigns. It can also help personalise their content, and improve their bottom lines. The advantages of data are many, but you can’t access these benefits without the proper data analytics tools and processes. While raw data has a lot of potentials, you need data analytics to unlock the power to grow your business.


Data are individual facts, statistics, or items of information, often numeric, that is collected through observations. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects.



Collection of raw facts and figures


A collection of information that is organized so that it can be easily accessed, managed, and updated. Computer databases typically contain aggregations of data records or files, containing information.


It is a computerised data-keeping system. The users of the system are given facilities to perform several kinds of operations. The operation can be either manipulation of the data or the management of the database structure itself. 


A relational database organises data into tables that can be linked—or related—based on data common to each. This capability enables you to retrieve an entirely new table from data in one or more tables with a single query


It allow local users to manage and access the data in the local databases. It also provides some sort of global data management that provides global users with a global view of the data.


It is the process of collecting and managing data from varied sources to provide meaningful business insights.


It is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. It can be analyzed for insights that lead to better decisions and strategic business moves


  • The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data is generating. 
  • Based on an IDC report prediction, the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2022
  • By 2025, IDC predicts there will be 163 zettabytes of data


  • Various formats, types, and structures
  • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc.
  • Static data vs. streaming data  
  • A single application can be generating/collecting many types of data  


5Vs” of Big Data which are also termed as the characteristics of Big Data are as follows
  • Volume 
    The volume of data refers to the size of data sets that an organisation has gather for analysis and processing
    . In today’s technology, these data sets are frequently seen pushing on the larger size of bytes, such as terabytes and petabytes.
  • Variety
    Variety refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. The most commonly added data are structured -texts, tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG reading, audio recordings, etc, are also important elements under Variety. Variety is all about the ability to classify incoming data into various categories.
  • Velocity
    With Velocity we refer to the speed with which data are being generating. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets on Twitter everyday, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks.
  • Variablity
    Variability refers to data whose meaning is constantly changing. Many a time, organizations need to develop sophisticated programs to be able to understand the context in them and decode their exact meaning.
  • Veracity
    Data veracity, in general, is how accurate or truthful a data set may be. In the context of big data, however, it takes on a bit more meaning. More specifically, when it comes to the accuracy of data, it’s not just the quality of the data itself but how trustworthy the data source, type, and processing it is.


  • NO SQL

NoSQL is a better choice for businesses whose data workloads are more geared toward the rapid processing and analyzing of vast amounts of varied and unstructured data.


MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster 


Hadoop stores and processes the data in a distributed manner across the cluster of commodity hardware.


HBase is a data model similar to Google’s big table that is design to provide random access to the high volume of structured or unstructured data.

  • YARN

It is an Apache Hadoop technology, stands for Yet Another Resource Negotiator. YARN is a large-scale, distributed operating system for big data applications.


ZooKeeper is an open-source Apache project that provides a centralized service for providing configuration information, naming, synchronization, and group services over large clusters in distributed systems.


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.


Apache Oozie is a Java Web application use to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.

  • PIG

Pig is a high-level platform or tool which is use to process large datasets. It provides a high level of abstraction for processing over MapReduce.

  • HIVE

Hive is an easy-to-use software application that lets one analyse large-scale data through the batch processing technique. 


  • Customer segmentation, managing customer relationships, and offering more customer-centric products
  • managing stocks and predicting needs in products
  • optimizing resource utilization and reducing costs
  • fraud detection
  • identifying and removing performance bottlenecks proactively
  • predicting equipment failures
  • launching new services
  • mitigating risks
  • identifying the causes of failures and problems in real-time


The availability of Big Data, low-cost commodity hardware, and new information management and analytic software have produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyse astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realise enormous gains in terms of efficiency, productivity, revenue, and profitability. To dig deep, you can read this blog.

Written by 

Lokesh Kumar is intern in AI/ML studio at Knoldus. He is passionate about Artificial Intelligence and Machine Learning , having knowledge of C , C++ , Python and Data Analytics and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies.

1 thought on “Big Data Analytics: An Introduction6 min read

Comments are closed.