Big Data

Amazon EMR

Reading Time: 3 minutes Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor. Common problems of on-premises big data environments include a lack of agility, excessive costs, Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Spark: ACID Transaction with Delta Lake

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Time Travel: Data versioning in Delta Lake

Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the state of the data lake. But, sometimes we would like to access a historical version of our data. This requires versioning of data. Such kinds of data management simplifies our data pipeline by making it easy for professionals or organizations to Continue Reading

Data Lake – Build it in Phases

Reading Time: 3 minutes Data Lake – How to build a data lake and what are the phases involved in the same.

Apache Spark: Read Data from S3 Bucket

Reading Time: < 1 minute Amazon S3 Accessing S3 Bucket through Spark Edit spark-default.conf file You need to add below 3 lines consists of your S3 access key, secret key & file system

Apache Spark

Deep Dive into Apache Spark Transformations and Action

Reading Time: 4 minutes In our previous blog of Apache Spark, we discussed a little about what Transformations & Actions are? Now we will get deeper into the topic and will understand what actually they are & how they play a vital role to work with Apache Spark? What is Spark RDD? Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects Continue Reading

Diving deeper into Delta Lake

Reading Time: 6 minutes Delta Lake is an open-source storage layer that brings reliability to data lakes. It has numerous reliability features including ACID transactions, scalable metadata handling, and unified streaming and batch data processing.

Delta Lake To the Rescue

Reading Time: 4 minutes Welcome Back. In our previous blogs, we tried to get some insights about Spark RDDs and also tried to explore some new things in Spark 2.4. You can go through those blogs here: RDDs – The backbone of Apache Spark Spark 2.4: Adding a little more Spark to your code In this blog, we will be discussing something called a Delta Lake. But first, let’s Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

Do you really need Spark? Think Again!

Reading Time: 5 minutes With the massive amount of increase in big data technologies today, it is becoming very important to use the right tool for every process. The process can be anything like Data ingestion, Data processing, Data retrieval, Data Storage, etc. Today we are going to focus on one of those popular big data technologies i.e., Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark Continue Reading

Commit Log: A commitment that Cassandra provides.

Reading Time: 5 minutes Welcome back, everyone. I have been working on Cassandra for quite some time now but never actually got to explore its working in depth. We know that its decentralized nature, as well as its ability to handle such a large volume of writes, makes it really commendable. But how does it manage to be efficient? How is it able to achieve what it is so Continue Reading