In this blog, I am going to explain about delta lake. Now before taking a dive into the lake, we should know the depth of it and most importantly why we are diving into it. First of all, I am going to talk about what is all about. So, let’s start.
Well, We often view technological advances as an inevitable movement of forwarding progress. New technologies often bring challenges, some predictable, some unforeseen, that greatly minimize their benefits. Nowadays, Companies have rapidly adopted data lakes, viewing them as a complement or in some cases, a replacement for data warehouses.
Because in the modern days of AI, ML and big data, a data warehouse is not very useful for making decisions in real-time due to the long processing time and there are also other disadvantages. So the data lakes come into the picture.
Why data lakes are evil?
Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right?
All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.
So,the advantages we get in the data warehouse has gone with a data lake. In a data lake, all the data are just dumped and they are not optimized at all. Because of this, the following problems arises-
- Inconsistent data
- Unreliable for analytics
- Lack of schema
- Poor performance
That’s why data lake to be viewed with caution. Data lakes are evil because you get way too much of too little quality. “You don’t need that much complexity or data, “You’re going to throw away 95% of what you collect.
So, It’s time to rethink data binge. Data lakes create complexity, cost, and confusion, but often provide very little value. Many companies think “store now, find value later” but never move onto the “find value” phase because the difficulty in finding that value is too extreme.
What is delta lake?
So, to rescue us from the nightmares of data lakes, there comes a new technology by databricks which is a Delta lake. Data professionals across a wide range of industries are looking to help their organizations to innovate for competitive advantage by making the most of their data. Good quality, reliable data are the foundation of success for such analytics and machine learning initiatives. Delta lake is quite a new concept and most of the companies nowadays migrating their storage system to it.
In simple words, Delta Lake is an open-source storage layer that brings reliability to data lakes. It has numerous reliability features including ACID transactions, scalable metadata handling, and unified streaming and batch data processing.
- Delta lake is a powerful storage layer which harnesses the power of spark and databricks DBFS (Databricks File System)
- The core abstraction of databricks delta is an optimized spark table that stores data in parquet files and DBFS
- It can read/write data stored in delta format using spark sequel batch and streaming API
Basically, it is a unified management system that delivers
- The scale of the data lake
- The reliability and performance of data warehouse
- The low latency of streaming
How delta lake solve the issues related to data lakes and data warehouses
- Delta Lake, along with Structured Streaming, makes it possible to analyze streaming and historical data together at data warehouse speeds.
- Upserts are supported on Delta Lake tables so changes are simpler to manage.
- You can easily include machine learning scoring and advanced analytics into ETL and queries.
- Compute and storage are decoupled resulting in a completely scalable solution.
- data indexing and caching (10-100x, taken from traditional DW technique)
This might sound a little bit complicated because to understand this,we have to know about it’s architecture which i am going to cover in my next blog. In this blog, I am focussing completly focussing on the usage of delta lake.
Enough of theory, so now let’s come to the implementation part i.e how to use it-
To use delta lake interactively within spark shell you need a local installation of apache spark. Depending on whether you want to use python or scala, you can set up either PySpark or the Spark shell, respectively. In this blog, I am going to focus only on scala, so all the eg you’ll see will be related to scala.
Just Run spark-shell with the Delta Lake package as shown-
bin/spark-shell --packages io.delta:delta-core_2.11:0.3.0
And that’s it, you are ready to use delta lake within your local spark-shell.
You include Delta Lake in your SBT project by adding the following line to your build.sbt file-
libraryDependencies += "io.delta" %% "delta-core" % "0.3.0"
Now that’s cool, isn’t it? To use such a beautiful and sophisticated piece of tech, we just have to give a library dependency, and everything is set.
Creating a table
To create a Delta Lake table, write a DataFrame out in the delta format. You can use existing Spark SQL code and change the format from parquet, CSV, JSON, and so on, to delta.
val data = sparkSession.range(0, 5)
data.write.format("delta").save("path where you want to save it”)
-Here data is a spark dataSet
Reading a table
You read data in your Delta Lake table by specifying the path to the files
val df = sparkSession.read.format("delta").load("path where the table is saved") df.show()
-Here df is a type of spark datFrame
Querying an older snapshot (time travel)
Another cool feature of delta lake that it allows you to time travel allows i.e to query an older snapshot of a Delta Lake table. Time travel has many use cases, including:
- Re-creating analyses, reports, or outputs. (for eg. the outputs of a machine learning model). This could be useful for debugging or auditing.
- Writing complex temporal queries
- Fixing mistakes in our data
- Providing snapshot isolation for a set of queries for fast-changing tables.
There are several ways to query an older version of a delta lake table,one of the ways is using DataFrameReader as shown-
- DataFrameReader options
DataFrameReader options allow creating a DataFrame from a Delta Lake that is fixed to a specific version of the table.
df1 = spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/delta/events")
For timestamp_string, only date or timestamp strings are accepted. For example, “2019-01-01” and “2019-01-01’T’00:00:00.000Z”.
Write to a table
There are three ways to write to a delta lake table
- Append using DataFrames
- Overwrite using DataFrames
- Automatic schema update
Append using DataFrames
Using append mode we can atomically add new data to an existing Delta Lake table:
For complete mode just replace “append” with “complete”.Eg:
Overwrite using DataFrames
To atomically replace all of the data in a table, you can use overwrite mode:
Also, we can selectively overwrite only the data that matches predicates over partition columns. The following command atomically replaces the month of January with the data in df:
df.write .format("delta") .mode("overwrite") .option("replaceWhere", "date >= '2017-01-01' AND date <= '2017-01-31'") .save("/delta/events")
This sample code writes out the data in df, validates that it all falls within the specified partitions, and performs an atomic replacement.
As I said earlier, A table in Delta Lake is both a batch table, as well as a streaming source and sink. So far I have said how to use a delta lake table for batch processing, now let’s see how to read and write to and from a Delta Lake table using Structured Streaming.
How to do table streaming read and writes in the delta table?
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
- Maintaining “exactly once” processing with more than one stream and concurrent batch jobs
- Efficiently discovering which files are new while using files as a source of the stream
How to use a delta table as a stream source?
While loading a Delta Lake table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
spark.readStream.format(“delta”).load(“path to the table”)
How to use a delta table as a sink?
We can also write data into a delta lake table using structured streaming. The transaction log enables data lake to guarantee exactly-once processing, even when there are other streams or batch queries running concurrently against the table
There are two modes available
- Append mode
- Complete mode
By default, streams run in append mode, which adds new records to the table
events.writeStream .format(“delta”) .outputMode(“append”) .option(“checkpointLocation”, “specify the path”) .start(“path to the output file”)
For complete mode just replace “append” with “complete”.
Pretty amazing, right?
So, who have a basic knowledge about spark and structured streaming, it’ll be not at all difficult to use delta lake.
Well, There are a lot of other things to cover, this blog is already getting too long I guess.
so, to summarize, in this blog I explained about what is delta lake, why delta lake comes into the picture when we have already a storage technology data lake, along with this I have also explained about how to use a delta lake using scala.
I have created a simple project using sbt to show how to store the data. Here I have taken streaming data coming from apache Kafka then processing with spark and finally storing the result in delta lake.
the link is provided below
If you want to explore more about to go through the documentation on databricks delta
Stay tuned for more