Spark: ACID Transaction with Delta Lake

Reading Time: 3 minutes

Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, our programs talk to the delta lake for reading and writing the data. Thus, delta lake takes the responsibility of complying to ACID properties.

spark with delta lake architecture

DeltaLog is the crux of Delta Lake which ensures atomicity, consistency, isolation, and durability of user-initiated transactions. DeltaLog is an ordered record of transactions. Every transaction performed since the inception of Delta Lake Table, has an entry in the DeltaLog (also known as the Delta Lake transaction log). It acts as a single source of truth, giving users access to the last version of a DeltaTable’s state. It provides serializability, the strongest level of isolation level. Let’s see how DeltaLog ensures ACID Transactions.

Atomicity and Consistency

Delta Lake breaks down every operation performed by a user into atomic commits, themselves composed of actions. Successful completion of all actions of a commit ensures that DeltaLog records that commit. In case of any failed job, the commit is not recorded in the DeltaLog.

//Job-1
Dataset<Long> data = spark.range(100, 200);
data.write().format("delta").mode("overwrite").save(FILE_PATH);

The delta lake saves data in parquet format using the save() method of DataFrameWriter API. Executing Job-1 creates a record of 100 integers in a parquet file and one commit. Going through the PATH_FILE we can find one parquet file consisting of records and one commit JSON file in _delta_log directory containing commit info. Additional changes to the table generate subsequent JSON files in ascending numerical order.

DeltaLog

Now, Let’s create another Job to do the same task but raise an exception in the middle.

//Job-2
        try {
            spark.range(50,150).map((MapFunction<Long, Integer>) num ->
                    {
                        if (num > 50) {
                            throw new RuntimeException("Atomicity failed");
                        }
                        return Math.toIntExact(num);
                    }, Encoders.INT()
            )write().format("delta").mode("overwrite").option("overwriteSchema", "true").save(FILE_PATH);
        } catch (Exception e) {
            LOGGER.info("failed while OverWriteData");
            throw new RuntimeException("Oops! Atomicity failed");
            LOGGER.info("records created after Job-2 " +   sparkSession.read().format("delta").load(FILE_PATH).count())
        }      

Executing Job-2 gives an exception as expected. But is the record created by Job-1 impacted in any way? No, we still get the count of 100 items in the record created by Job-1. The failed job didn’t create any commit logs and the previous commit of Job-1 still resides consisting of information of its record in a parquet file.

Voila! Our data is not corrupt and in valid state. This ensures Atomicity and Consistency. DeltaLake’s strong schema checking also guarantees Consistency.

Isolation & Durability

Delta Lake takes care of concurrent read-write access by managing Concurrency of commits. This is done using optimistic concurrency control. This means that:

  • when a commit execution starts, the thread snapshots the current DeltaLog.
  • once the commit actions complete, the thread checks if the DeltaLog is updated by another one in the meantime.
    • if not, it records the commit in the DeltaLog.
    • else, it updates its DeltaTable view and attempts again to register the commit after a step of reprocessing, if needed.

This ensures the Isolation property.

All of the transactions made on Delta Lake tables are stored directly to disk. This process satisfies the ACID property of durability, meaning it will persist even in the event of system failure.

You can find source code here

I hope, it will be helpful for you.

Happy Blogging!!

3 thoughts on “Spark: ACID Transaction with Delta Lake4 min read

Leave a Reply

%d bloggers like this: