CuriosityX: RDDs – The backbone of Apache Spark

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

HDFS Erasure Coding in Hadoop 3.0

HDFS Erasure Coding(EC) in Hadoop 3.0 is the solution of the problem that we have in the earlier version of Hadoop, that is nothing but its 3x replication factor which is the simplest way to protect our data even in the failure of Datanode but needs too much extra storage. Now,  in EC storage overhead magically reduced to 50% which is earlier 200% because of Continue Reading

Simple Java program to Append to a file in Hdfs

In this blog, I will present you with a java program to append to a file in HDFS. I will be using Maven as the build tool. Now to start with- First, we need to add maven dependencies in pom.xml. Now we need to import the following classes- import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.*;

Resolving the Failure Issue of NameNode

In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of Continue Reading

Introduction To HADOOP !

Here I am to going to  write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading

Smattering of HDFS

INTRODUCTION TO HDFS :- Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers.It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing Continue Reading

%d bloggers like this: