Apache Spark: Delta Lake as a Solution - Part I

Table of contents

Reading Time: 3 minutes

Today, everyone is talking about Delta Lake. Why? Ever tried to find the answer to this question? Yes or No doesn’t matter, don’t worry here in Part1 we will be discussing the same & also will be targetting the following questions:

What are the features missing from Apache Spark?
What kind of issues it causes in executing Data Lake?

Answering the above questions will definitely solve most of your queries on Delta Lake. So, let’s dive into each of these questions.

1) What are the features missing from Apache Spark?

Well, most of the Data Engineer is working with Apache Spark. And one of the most important we might find missing from Spark is that it is not ACID compliant where ACID stands for Atomicity Consistency Isolation Durability.
However, each property can be defined as below:

– Atomicity states that it should either write full data or nothing to the data source when using a spark data frame writer. But Spark doesn’t follow this which is a big problem. Think of in the perspective of Job Failure, if the job gets failed in between a user can lose its entire data which can be a black day for a user, as every value is very crucial either it is Data Engineer or Data Analyst or Data Scientist for better analysis or prediction.

– Consistency ensures that the data isalways in the valid state. As discussed in Atomicity if the job gets failed it can lead to data loss, which definitely breaks the data consistency. Even though the job doesn’t fail still it loses consistency as there is some time between delete & write operation.

– Isolation when a transaction is in process and not yet committed, it must remain isolated from any other transaction. This is called Isolation Property. It means writing to a data set shouldn’t impact another concurrent read/write on the same data set. Apache Spark does not have a strict notion of a commit. Meaning, there are task-level commits, and finally, a job-level commit that Spark implements. But this implementation is broken due to lack of Atomicity in write operations. And hence, Spark is not offering isolation property as well.

– Durability guarantees that transactions that have committed will survive permanently. However, when Spark doesn’t correctly implement the commit, then all the durability features offered by the storage goes for a toss.
And finally, it can be concluded that Spark is not ACID compliant. Well, this is not the end to conclude about Spark. However, there is something more to know. As of now, we have just proved that Apache Spark is not ACID compliant. But What kind of issues it causes in executing Data Lake?

The era of Data lake- Good v/s Bad

Before continuing let’s discuss something about the good & bad thing in the era of Data Lake.

GOOD	BAD
Massive Scale	Inconsistent Data
Inexpensive Storage	Lack of Schema
Open Formats(Parquet, ORC)	Poor Performance
ML & Real-Time Streaming	Unreliable for Analytics

So, these are some of the good things & bad things in Data Lake.

2) What kind of issues it causes in executing Data Lake?

a) Lack of Schema Enforcement

This issue creates inconsistent and low-quality data.

b) Failed Production Jobs

Leaves data in the corrupt state requiring tedious recovery.

c) Lack of consistency

Makes it impossible to mix append and read, batch and streaming jobs.

d) Small file Problem

While running a batch job let’s say gets triggered every hour or so, creates so many small files finally affects the Job performance.
To all those above problems Delta Lake provides the best solution.

But how?

That will be covered in Part2.
To know more about Delta Lake in detail, don’t forget to read Delta Lake- Part2, where we will be covering How to use Delta Lake?

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.