HDFS

Apache Spark

Deep Dive into Apache Spark Transformations and Action

Reading Time: 4 minutes In our previous blog of Apache Spark, we discussed a little about what Transformations & Actions are? Now we will get deeper into the topic and will understand what actually they are & how they play a vital role to work with Apache Spark? What is Spark RDD? Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects Continue Reading

Spark – Actions and Transformations

Reading Time: 4 minutes Hey guys, welcome to series of spark blogs, this blog being the first blog in this series we would try to keep things as crisp as possible, so let’s get started. So I recently get to start learning spark about believe me and now it has made me inquisitive about it, for a brief introduction of spark, I would say that it is a pretty Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

Using Vertica with Spark-Kafka: Write using Structured Streaming

Reading Time: 3 minutes In two previous blogs, we explored about Vertica and how it can be connected to Apache Spark. The first blog in this mini series was about reading data from Vertica using Spark and saving that data into Kafka. The next blog explained the reverse flow i.e. reading data from Kafka and writing data to Vertica but in a batch mode. i.e reading data from Kafka Continue Reading

Using Vertica with Spark-Kafka: Writing

Reading Time: 4 minutes In previous blog of this series, we took a glance over the basic definition of Spark and Vertica. We also did a code overview for reading data from Vertica using Spark as DataFrame and saving the data into Kafka. In this blog we will be doing the reverse flow i.e. working on reading the data from Kafka as a DataFrame and writing that DataFrame into Continue Reading

kafka with spark

Tuning a Spark Application

Reading Time: 4 minutes Having trouble optimizing your Spark application? If yes, then this blog will surely guide you on how you can optimize it and what parameters should be tuned so that our spark application gives the best performance. Spark applications can cause a bottleneck due to resources such as CPU, memory, network etc. We need to tune our memory usage, data structures tuning, how RDDs need to Continue Reading

HDFS: A Conceptual View

Reading Time: 5 minutes There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information Continue Reading

CuriosityX: RDDs – The backbone of Apache Spark

Reading Time: 5 minutes In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

HDFS Erasure Coding in Hadoop 3.0

Reading Time: 4 minutes HDFS Erasure Coding(EC) in Hadoop 3.0 is the solution of the problem that we have in the earlier version of Hadoop, that is nothing but its 3x replication factor which is the simplest way to protect our data even in the failure of Datanode but needs too much extra storage. Now,  in EC storage overhead magically reduced to 50% which is earlier 200% because of Continue Reading

Simple Java program to Append to a file in Hdfs

Reading Time: 2 minutes In this blog, I will present you with a java program to append to a file in HDFS. I will be using Maven as the build tool. Now to start with- First, we need to add maven dependencies in pom.xml. Now we need to import the following classes- import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.*;

Resolving the Failure Issue of NameNode

Reading Time: 2 minutes In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of Continue Reading

Introduction To HADOOP !

Reading Time: 4 minutes Here I am to going to  write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading

Smattering of HDFS

Reading Time: 2 minutes INTRODUCTION TO HDFS :- Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers.It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!