Welcome Back. In our previous blogs, we tried to get some insights about Spark RDDs and also tried to explore some new things in Spark 2.4. You can go through those blogs here:
In this blog, we will be discussing something called a Delta Lake.
But first, let’s try and understand what data lakes are.
Traditionally data has been residing in silos across the organization and the ecosystem in which it operations (external data). That’s a challenge: you can’t combine the right data to succeed in a big data project if that data is a bit everywhere in and out of the cloud. This is where the idea – and reality – of (big) data lakes comes from.
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.
Although the data lakes serve as a central ingestion point for the plethora of data that organizations seek to gather and mine, it still has various limitations or challenges. In theory, data lakes sound like a good idea: One big repository to store all data your organization needs to process, unifying myriads of data sources. In practice, most data lakes are a mess in one way or another.
Some of these limitations include:
- Reading and writing into data lakes is not reliable. Data engineers often run into the problem of unsafe writes into data lakes that causes readers to see garbage data during writes which further require workarounds in place to ensure readers always see consistent data during writes.
- The data quality in data lakes is low: Dumping unstructured data into a data lake is easy. But this comes at the cost of data quality. As a consequence, analytics projects that strive to mine this data also fail.
- Poor performance with increasing amounts of data. As the amount of data that gets dumped into a data lake increases, the number of files and directories also increases. Big data jobs and query engines that process the data spend a significant amount of time in handling the metadata operations.
- Updating records in data lakes is hard: Engineers need to build complicated pipelines to read entire partitions or tables, modify the data and write them back. This makes the pipelines inefficient and hard to maintain.
Because of these challenges, many big data projects fail to deliver on their vision or sometimes just fail altogether. We need a solution that enables data practitioners to make use of their existing data lakes, but while ensuring data quality. This is where Delta Lake comes into the picture.
Databricks says part of the reason for the data lakes being that efficient is the lack of transactional support, and they have just open-sourced Delta Lake, a solution to address this.
Going by the definition of Delta Lake:
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
In order to explain the above definition a bit, we can say that the Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes.
One thing to note here is that Delta Lake can operate in the cloud, in on-premise servers, or on devices like laptops. It can handle both batch and streaming sources of data.
Some of the key features include:
- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent the insertion of bad records during ingestion.
- Time travel(data versioning): Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Full DML Support: Delta Lake supports standard DML including UPDATE, DELETE and MERGE INTO providing developers more controls to manage their big datasets.
- Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
The Delta Lake project is available to download at delta.io. For further information on Delta Lake, you guys can refer to the official documentation by DataBricks here. Also, as it is an open-source project, contributions are welcomed by the community.
In this blog, we tried to explain what delta lake is and its comparison with the existing data lakes. We also concluded that Delta Lake addresses the problems/ limitations of the data lakes in order to simplify how you build your data lakes. Also, being an open-source project now has also made it possible for developers to get some hands-on on the same.
Also, With the recent release of support for Python APIs, Databricks has really done it this time. For now, this was it. We will try to discuss more on Delta lake in the future.
Hope this helps. Stay tuned for more. 🙂