Fundamentals of Delta Lake

[ad_1]

You might be hearing a lot about Delta Lake nowadays. Yes, it is because of it’s introduction of new features which was not there in Apache Spark earlier.

Why is Delta Lake?

If you check here, you can understand that Spark writes are not atomic and data consistency is not guaranteed. When metadata itself becomes big data it is difficult to manage data. If you have worked on lambda architecture you would understand how painful it is to have same aggregations to both hot layer as well as cold layer differently. Sometimes you make unwanted writes to a table or a location and it overwrites existing data then you wish to go back to your previous state of data which was tedious task.

What is Delta Lake?

Delta Lake is a project that was developed by Databricks and now open sourced with the Linux Foundation project. Delta Lake is an open source storage layer that sits on top of the Apache Spark services which brings reliability to data lakes. Delta Lake provides new features which includes ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

How does it works?

Delta lake provides a storage layer on top of existing cloud storage data lake. It acts as a middle layer between Spark runtime and cloud storage. When you store data as delta table by default data is stored as parquet file in your cloud storage.

Delta Lake will generate delta logs for each committed transactions. Delta logs will have delta files stored as JSON which has information about the operations occurred and details about the latest snapshot of the file and also it contains the information about the statistics of the data. Delta files are sequentially increasing named JSON files and together make up the log of all changes that have occurred to a table.

Now let’s code and see how it works.

I am reading data from csv file that contains CustomerID and Country columns. Let me show you how it looks.

Now once we have a dataframe, it is very easy to create delta tables. All you have to do is specify the format as delta and provide the place to store delta table.

Now let’s see what we have in this directory.

Since we have partioned data by country we have three subfolders under customer-data folder which we have specified. These folders contain actual data in parquet format. It is important to note that we have _delta_log folder where all the log information are stored for this table.

This is the first write to the table. So we have 00000000000000000000.json, name of the file is auto incremental,it will be incremented for each actions we perform which changes the content of the data. I have downloaded the file to show the contents and this is how it looks.

It has commitInfo, protocol, metaData and add Json. commitInfo is self explanatory. Protocol has minReaderVersion and minWriterVersion. This is the minimum required version of Delta Lake to read and write this table.

metaData contains id which is GUID, format has provider which is the name of the encoding for the files, schema is the schema of the actual data and partitionColumns contains array of partition columns.

We have add json which contains data for each file being added with the partition data. In path it contains the actual path of the data and it also contains statistics of the data and other fields.

Now let’s remove some data from the table and see what happens to the content of the table. Before that let me show you data from this delta table.

Now we will remove France data and see what happens to the contents of the table location.

If you see we still have data for France and it is not removed. Don’t get confused, delete removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed.

But let’s see what is updated in _delta_logs. We have 00000000000000000001.json file created.

If we see the contents of the 00000000000000000001.json. We have remove json which has information about which path files has to be removed.

Now let’s update a table and see the logs again.

Now we have both India and Australia and you can see the contents of the json changes for yourself and understand!

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link