Knoldus Blogs

Top 5 Reasons to Convert Your Cloud Data Lake to a Delta Lake

March 30, 2021March 30, 2021Database, Tech Blogsdata lake, delta lake

Reading Time: 6 minutes There are various resources that give advice on how to [and how not to] partition your data, how to calculate the ideal file size, how to handle evolving schemas, how to build compaction routines, how to recover from failed ETL jobs, how to stream raw data into the data lake, etc. We have been working with customers throughout this time to encapsulate all of the Continue Reading

Spark SQL in Delta Lake 0.7.0

September 3, 2020September 12, 2020Apache Spark, Big Data and Fast Data, Java, SQLAnalytics, Big Data, delta lake, query, Spark, sql

Reading Time: 3 minutes Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. Continue Reading

Delta Lake: Schema Enforcement & Evolution

May 31, 2020June 1, 2020Apache Spark, Big Data and Fast Data, Java, SparkApache Spark, Big Data, Big Data Analytics, delta lake, schema

Reading Time: 4 minutes Nowadays data is constantly evolving and changing. As well as the business problems and requirements are evolving, the shape or the structure of the data is also changing. When that happens, we want to be in control of how the data or schema changes. But how we can achieve this? Delta Lake has good ways to control how schema changes. With Delta Lake, users have Continue Reading

Apache Spark: Delta Lake as a Solution – Part II

May 5, 2020October 12, 2020Analytics, Apache Spark, Big Data and Fast Data, Database, ML, AI and Data Engineering, Spark, Studio-Scala, Tech BlogsBig Data, Big Data Analytics, data analysis, delta lake

Reading Time: 3 minutes Well, we have already covered the missing features in Apache Spark & also the causes of the issue in executing Delta Lake in Part1. However, today we will be talking about What Delta Lake is & how it provides the solution to all those problems discussed herein Delta Lake as a Solution: Part1.As we all know that Spark is just a processing engine, it doesn’t Continue Reading

Apache Spark: Delta Lake as a Solution – Part I

May 4, 2020Analytics, Apache Spark, Big Data and Fast Data, Database, ML, AI and Data Engineering, NoSql, SQL, Streaming, Studio-Scala, Tech BlogsBig Data, Big Data Analytics, data analysis, delta lake

Reading Time: 3 minutes Today, everyone is talking about Delta Lake. Why? Ever tried to find the answer to this question? Yes or No doesn’t matter, don’t worry here in Part1 we will be discussing the same & also will be targetting the following questions: What are the features missing from Apache Spark? What kind of issues it causes in executing Data Lake? Answering the above questions will definitely Continue Reading

Spark: ACID Transaction with Delta Lake

February 5, 2020February 5, 2020Apache Spark, Big Data and Fast Data, Java, NoSql, Spark, Studio-ScalaACID, Apache Spark, Big Data, DataFrame, datasets, delta lake, transaction

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Time Travel: Data versioning in Delta Lake

February 2, 2020February 2, 2020Analytics, Apache Spark, Big Data and Fast Data, Java, Spark, Studio-ScalaApache Spark, Big Data, Big Data Analytics, BigData, data lake, Data Management, data science, delta lake, Spark, Time Travel

Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the state of the data lake. But, sometimes we would like to access a historical version of our data. This requires versioning of data. Such kinds of data management simplifies our data pipeline by making it easy for professionals or organizations to Continue Reading

Diving deeper into Delta Lake

October 14, 2019October 14, 2019Apache Kafka, Apache Spark, Big Data and Fast Data, github, Spark, Streaming, Studio-Scala, Tech BlogsApache Spark, Big Data, delta lake, kafka, Kafka Streams, scala, Spark Streaming

Reading Time: 6 minutes Delta Lake is an open-source storage layer that brings reliability to data lakes. It has numerous reliability features including ACID transactions, scalable metadata handling, and unified streaming and batch data processing.

Delta Lake To the Rescue

October 7, 2019October 7, 2019Apache Spark, Big Data and Fast Data, Java, python, Spark, Streaming, Studio-ScalaApache Spark, batch processing, Big Data, delta lake, Stream Processing

Reading Time: 4 minutes Welcome Back. In our previous blogs, we tried to get some insights about Spark RDDs and also tried to explore some new things in Spark 2.4. You can go through those blogs here: RDDs – The backbone of Apache Spark Spark 2.4: Adding a little more Spark to your code In this blog, we will be discussing something called a Delta Lake. But first, let’s Continue Reading