Hadoop

MarkLogic & Hadoop: for ease of technology solutions

Reading Time: 4 minutes Introduction The MarkLogic Connector for Apache Hadoop is a powerful tool that allows you to use MapReduce. MarkLogic Platform to move large volumes of data into your Hadoop cluster. With this integration, you can leverage existing technology and processes for ETL. In addition, this connector enables you to take advantage of many advanced features available only in MarkLogic. ” MarkLogic, Hadoop, Hadoop Integration, why MarkLogic Continue Reading

Working with Big Data and Hadoop in PDI

Reading Time: 4 minutes Big Data in Pentaho The term big data applies to very large, complex, and dynamic datasets that need to be stored and managed over a long time. To derive benefits from big data you need the ability to access, process, and analyze data as it is being created. The size and structure of big data make it very inefficient to maintain and process it using Continue Reading

Pentaho – Hadoop Cluster connection

Reading Time: 2 minutes Prerequisite: Basic overview of Pentaho. Using Pentaho you can simply solve all big-data analytics problems easily without writing a single line of code and generate required results/ Output for analysis. It can easily able to establish connections with other Big Data Platforms such as Google Dataproc, Hortonworks Data Platform (HDP)  Amazon Elastic MapReduce (EMR), etc Also, it can be integrated with its services like HDFS, Continue Reading

Deep Dive into Hadoop Map Reduce Part -2

Reading Time: 8 minutes Prerequisite: Hadoop Basic and understanding of Deep Dive in Hadoop Map reduce Part -1 Blog. MapReduce Tutorial: Introduction In this MapReduce Tutorial blog, I am going to introduce you to MapReduce, which is one of the core building blocks of processing in the Hadoop framework. Before moving ahead, I would suggest you to get familiar with HDFS concepts which I have covered in my previous HDFS tutorial blog. Continue Reading

Deep dive into Map Reduce: Part -1

Reading Time: 5 minutes Prerequisite : Basic concepts of Hadoop and Distributed File system. Map-Reduce Architecture is a programming model and a software framework utilised for preparing enormous measures of data. Map-Reduce program works in two stages, to be specific, Map and Reduce. Map requests that arrange with mapping and splitting of data while Reduce tasks reduce and shuffle the data . Map-Reduce is a programming model Neither platform- nor Continue Reading

Tale of Apache Spark

Reading Time: 6 minutes Data is being produced extensively in today’s world and it is going to be generated more rapidly in future. 90% of total data that is produced in the world is produced in last two years only and it is estimated that in 2020 world’s total data would reach 45 ZB and data generated each day would be enough that if we try to store it Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

CuriosityX: RDDs – The backbone of Apache Spark

Reading Time: 5 minutes In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

HDFS Erasure Coding in Hadoop 3.0

Reading Time: 4 minutes HDFS Erasure Coding(EC) in Hadoop 3.0 is the solution of the problem that we have in the earlier version of Hadoop, that is nothing but its 3x replication factor which is the simplest way to protect our data even in the failure of Datanode but needs too much extra storage. Now,  in EC storage overhead magically reduced to 50% which is earlier 200% because of Continue Reading

Difference between Apache Hadoop and Apache Spark Mapreduce

Reading Time: 4 minutes The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Apache Hadoop and Apache Spark MapReduce. Introduction Spark – It is an open source Continue Reading

Understanding HDFS Federation

Reading Time: 3 minutes In this blog, we will discuss about Hadoop federation, Hadoop architecture vs Hadoop Federated architecture and will talk about various issues solved by hdfs federation. So let us first see why it is gaining so much popularity. To address this question we must know the problems in the existing architecture of Hadoop which led to the creation of Hadoop federation: 1) Availability: If we have Continue Reading

Resolving the Failure Issue of NameNode

Reading Time: 2 minutes In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of Continue Reading

Working with Hadoop Filesystem Api

Reading Time: 2 minutes Reading data from and writing data to Hadoop Distributed File System (HDFS) can be done in a number of ways. Now let us start understanding how this can be done by using the FileSystem API, to create and write to a file in HDFS, followed by an application to read a file from HDFS and write it back to the local file system. To start Continue Reading