As organizations nowadays have a lot of data, which could be customer data or S3 or could be unstructured data from a bunch of sensors. The promise of Data Lake is to collect all data and dump it into the data lake. Through which you can actually get an insight into it. You can build powerful tools with it such as a Recommendation engine and Fraud detection algorithm. So the problem here is that data collected could be garbage, and what is dumped to data lake could also be garbage and eventually will give garbage out insights. So, in this blog, we will discuss the Databricks Delta Architecture and how Delta removes the cons of Data Lake.
Data Lake Distractions
- Doesn’t provide Atomicity – No all or nothing, it may end up storing corrupt data.
- No Quality Enforcement – It creates inconsistent and unusable data.
- No Consistency/Isolation – It’s impossible to read and append when there is an update going on.
Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake.
- Delta lake brings full ACID transactions to Apache Spark. That means jobs will either complete or not at all.
- Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
- Delta lake is deeply powdered by Apache Spark which means that the Spark jobs (batch/stream) can be converted without writing those from scratch.
For a quick Introduction on Delta Lake refer to the blog.
Delta Lake Architecture
Now, let us discuss the Delta Architecture with all its tables.
- Bronze tables:
- Because data comes from various sources which could be Dirty. Thus, It is a dumping ground for raw data
- Often with long retention(years).
- Avoid error-prone parsing.
- Silver tables:
- Consists of Intermediate data with some cleanup applied. Similar to the water that flows and cleaned up with twists and turns of the river.
- It is Queryable for easy debugging.
- Gold tables:
- Consists of clean data, which is ready for consumption.
- You can read data with Spark or presto.
Delta Lake offers features that unify data science, data engineering, and production workflows which is ideal for the machine learning life cycle. It offers time travel to allow data changes to be rolled back or reproduced when needed. It ensures that data has been processed in the right format i.e. schema enforcement. It also offers schema evolution, preventing existing models from breaking due to schema changes. Delta lets organizations remove complexity by getting the benefits of multiple storage systems in one. Delta enables simpler data architectures that let organizations focus on extracting value from their data.
- Machine Learning with Delta Lake.
- Databricks Delta Guide PDF
- Making Apache Spark better with Delta Lake webinar