Hadoop

HDFS Erasure Coding in Hadoop 3.0

HDFS Erasure Coding(EC) in Hadoop 3.0 is the solution of the problem that we have in the earlier version of Hadoop, that is nothing but its 3x replication factor which is the simplest way to protect our data even in the failure of Datanode but needs too much extra storage. Now,  in EC storage overhead magically reduced to 50% which is earlier 200% because of Continue Reading

Apache Hadoop vs Apache Spark

The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Spark and Hadoop MapReduce. Introduction Spark – It is an open source big data Continue Reading

Understanding HDFS Federation

In this blog, we will discuss about Hadoop federation, Hadoop architecture vs Hadoop Federated architecture and will talk about various issues solved by hdfs federation. So let us first see why it is gaining so much popularity. To address this question we must know the problems in the existing architecture of Hadoop which led to the creation of Hadoop federation: 1) Availability: If we have Continue Reading

Resolving the Failure Issue of NameNode

In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of Continue Reading

Working with Hadoop Filesystem Api

Reading data from and writing data to Hadoop Distributed File System (HDFS) can be done in a number of ways. Now let us start understanding how this can be done by using the FileSystem API, to create and write to a file in HDFS, followed by an application to read a file from HDFS and write it back to the local file system. To start Continue Reading

Tableau: Getting into Tableau Public

Big Data visualization and Business Intelligence got so easy using Tableau, millions and billions of records can be analyzed in just one go whether your data format is excel, csv, text or database, Tableau make it easy for you. So finally you have make your mind to generate visualizations using Tableau and want to know what are the heights of Tableau in visualizations?. You are Continue Reading

Business Intelligence-Data Visualization: Tableau

Spark, Bigdata, NoSQL, Hadoop are some of the most using and top in charts technologies that we frequently use in Knoldus, when these terms used than one thing comes into picture is ‘Huge Data, millions/billions of records’ Knoldus developers use these terms frequently, managing (and managing means here- storing data, rectifying data, normalizing it, cleaning it and much more) such amount of data is really Continue Reading

Setting Up Multi-Node Hadoop Cluster , just got easy !

In this blog,we are going to embark the journey of how to setup the Hadoop Multi-Node cluster on a distributed environment. So lets do not waste any time, and let’s get started. Here are steps you need to perform. Prerequisite: 1.Download & install Hadoop for local machine (Single Node Setup) http://hadoop.apache.org/releases.html – 2.7.3 use java : jdk1.8.0_111 2. Download Apache Spark from : http://spark.apache.org/downloads.html choose spark release Continue Reading

BigData Specifications – Part 1 : Configuring MySql Metastore in Apache Hive

Apache Hive is used as a data warehouse over Hadoop to provide users a way to load, analyze and query the data from various resources. Data is stored into databases or file systems like HDFS (Hadoop Distributed File System). Hive can use Spark SQL or HiveQL for the implementation of queries. Now Hive uses its metastore which contains the following information, Ids of tables, Ids Continue Reading

Hadoop Word Count Program in Scala

You must have seen Hadoop word count program in java, python or in c/c++ but probably not in Scala. so, lets learn how to build Word Count Program in Scala. Submitting a Job to Hadoop which is written in Scala is not that easy, because Hadoop runs on Java so, it does not understand the functional aspect of Scala. For writing Word Count Program in Continue Reading

Introduction to Apache Hadoop: The Need

In this Blog we will read about the Hadoop fundamentals. After reading this blog we will be able to understand why we need Apache Hadoop, So lets starts with the problem. Whats the Problem :- The problem is simple: the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept Continue Reading

Introduction To Hadoop Map Reduce

In this Blog we will be reading about Hadoop Map Reduce. As we all know to perform faster processing we needs to process the data in parallel. Thats Hadoop MapReduce Provides us. MapReduce :- MapReduce is a programming model for data processing. MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.MapReduce works Continue Reading

Apache PIG : Installation and Connect with Hadoop Cluster

Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster. Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Continue Reading

%d bloggers like this: