Hadoop

Tale of Apache Spark

Reading Time: 6 minutes Data is being produced extensively in today’s world and it is going to be generated more rapidly in future. 90% of total data that is produced in the world is produced in last two years only and it is estimated that in 2020 world’s total data would reach 45 ZB and data generated each day would be enough that if we try to store it Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

CuriosityX: RDDs – The backbone of Apache Spark

Reading Time: 5 minutes In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

HDFS Erasure Coding in Hadoop 3.0

Reading Time: 4 minutes HDFS Erasure Coding(EC) in Hadoop 3.0 is the solution of the problem that we have in the earlier version of Hadoop, that is nothing but its 3x replication factor which is the simplest way to protect our data even in the failure of Datanode but needs too much extra storage. Now,  in EC storage overhead magically reduced to 50% which is earlier 200% because of Continue Reading

Difference between Apache Hadoop and Apache Spark Mapreduce

Reading Time: 4 minutes The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Apache Hadoop and Apache Spark MapReduce. Introduction Spark – It is an open source Continue Reading

Understanding HDFS Federation

Reading Time: 3 minutes In this blog, we will discuss about Hadoop federation, Hadoop architecture vs Hadoop Federated architecture and will talk about various issues solved by hdfs federation. So let us first see why it is gaining so much popularity. To address this question we must know the problems in the existing architecture of Hadoop which led to the creation of Hadoop federation: 1) Availability: If we have Continue Reading

Resolving the Failure Issue of NameNode

Reading Time: 2 minutes In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of Continue Reading

Working with Hadoop Filesystem Api

Reading Time: 2 minutes Reading data from and writing data to Hadoop Distributed File System (HDFS) can be done in a number of ways. Now let us start understanding how this can be done by using the FileSystem API, to create and write to a file in HDFS, followed by an application to read a file from HDFS and write it back to the local file system. To start Continue Reading

Tableau: Getting into Tableau Public

Reading Time: 2 minutes Big Data visualization and Business Intelligence got so easy using Tableau, millions and billions of records can be analyzed in just one go whether your data format is excel, csv, text or database, Tableau make it easy for you. So finally you have make your mind to generate visualizations using Tableau and want to know what are the heights of Tableau in visualizations?. You are Continue Reading

Business Intelligence-Data Visualization: Tableau

Reading Time: 3 minutes Spark, Bigdata, NoSQL, Hadoop are some of the most using and top in charts technologies that we frequently use in Knoldus, when these terms used than one thing comes into picture is ‘Huge Data, millions/billions of records’ Knoldus developers use these terms frequently, managing (and managing means here- storing data, rectifying data, normalizing it, cleaning it and much more) such amount of data is really Continue Reading

Setting Up Multi-Node Hadoop Cluster , just got easy !

Reading Time: 3 minutes In this blog,we are going to embark the journey of how to setup the Hadoop Multi-Node cluster on a distributed environment. So lets do not waste any time, and let’s get started. Here are steps you need to perform. Prerequisite: 1.Download & install Hadoop for local machine (Single Node Setup) http://hadoop.apache.org/releases.html – 2.7.3 use java : jdk1.8.0_111 2. Download Apache Spark from : http://spark.apache.org/downloads.html choose spark release Continue Reading

BigData Specifications – Part 1 : Configuring MySql Metastore in Apache Hive

Reading Time: 2 minutes Apache Hive is used as a data warehouse over Hadoop to provide users a way to load, analyze and query the data from various resources. Data is stored into databases or file systems like HDFS (Hadoop Distributed File System). Hive can use Spark SQL or HiveQL for the implementation of queries. Now Hive uses its metastore which contains the following information, Ids of tables, Ids Continue Reading

Hadoop Word Count Program in Scala

Reading Time: 2 minutes You must have seen Hadoop word count program in java, python or in c/c++ but probably not in Scala. so, lets learn how to build Hadoop Word Count Program in Scala. Submitting a Job to Hadoop which is written in Scala is not that easy, because Hadoop runs on Java so, it does not understand the functional aspect of Scala. For writing Word Count Program Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!