Home Data Engineering Data DIY Querying Delta Lake with Scala, Java, and Python

Querying Delta Lake with Scala, Java, and Python

Audio version of the article

 

Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark™ APIs. The project has been deployed at thousands of organizations and processes more exabytes of data each week, becoming an indispensable pillar in data and AI architectures. More than 75% of the data scanned on the Databricks Platform is on Delta Lake!

In addition to Apache Spark, Delta Lake has integrations with Amazon Redshift, Redshift Spectrum, Athena, Presto, Hive, and more; you can find more information in the Delta Lake Integrations. For this blog post, we will discuss the most recent release of the Delta Standalone Reader and the Delta Rust API that allows you to query your Delta Lake with Scala, Java, and Python without Apache Spark.

Delta Standalone Reader

The Delta Standalone Reader (DSR) is a JVM library that allows you to read Delta Lake tables without the need to use Apache Spark; i.e. it can be used by any application that cannot run Spark. The motivation behind creating DSR is to enable better integrations with other systems such as Presto, Athena, Redshift Spectrum, Snowflake, and Apache Hive. For Apache Hive, we rewrote it using DSR to get rid of the embedded Spark in the new release.

To use DSR using sbt include delta-standalone as well as hadoop-client and parquet-hadoop.

libraryDependencies ++= Seq( "io.delta" %% "delta-standalone" % "0.2.0", "org.apache.hadoop" % "hadoop-client" % "2.7.2", "org.apache.parquet" % "parquet-hadoop" % "1.10.1")

Using DSR to query your Delta Lake table

Below are some examples of how to query your Delta Lake table in Java.

Reading the Metadata

After importing the necessary libraries, you can determine the table version and associated metadata (number of files, size, etc.) as noted below.

import io.delta.standalone.DeltaLog;
import io.delta.standalone.Snapshot;
import io.delta.standalone.data.CloseableIterator;
import io.delta.standalone.data.RowRecord;

import org.apache.hadoop.conf.Configuration;

DeltaLog log = DeltaLog.forTable(new Configuration(), "[DELTA LOG LOCATION]");

// Returns the current snapshot
log.snapshot();

// Returns the version 1 snapshot
log.getSnapshotForVersionAsOf(1);

// Returns the snapshot version
log.snapshot.getVersion();

// Returns the number of data files
log.snapshot.getAllFiles().size();

Reading the Delta Table

To query the table, open a snapshot and then iterate through the table as noted below.

// Create a closeable iterator
CloseableIterator iter = snapshot.open();

RowRecord row = null;
int numRows = 0;

// Schema of Delta table is {long, long, string}
while (iter.hasNext()) {
row = iter.next();
      numRows++;

      Long c1 = row.isNullAt("c1") ? null : row.getLong("c1");
      Long c2 = row.isNullAt("c2") ? null : row.getLong("c2");
      String c3 = row.getString("c3");
      System.out.println(c1 + " " + c2 + " " + c3);
}

// Sample output
175 0 foo-1
176 1 foo-0
177 2 foo-1
178 3 foo-0
179 4 foo-1

Requirements

DSR has the following requirements:

  • JDK 8 or above
  • Scala 2.11 or Scala 2.12
  • Dependencies on parquet-hadoop and hadoop-client 

For more information, please refer to the Java API docs or Delta Standalone Reader wiki.

Delta Rust API

delta.rs is an experimental interface to Delta Lake for Rust. This library provides low-level access to Delta tables and is intended to be used with data processing frameworks like datafusionballistarust-dataframe, and vega. It can also act as the basis for native bindings in other languages such as Python, Ruby, or Golang.

QP Hou and R. Tyler Croy at Scribd use Delta Lake to enable the world’s largest digital library are the initial creators of this API. The Delta Rust API has quickly gained traction in the community with a special callout of community-driven Azure support within weeks after the initial release.

Reading the Metadata (Cargo)

You can use the API or CLI to inspect the files of your Delta Lake table as well as provide the metadata information; below are sample commands using the CLI via cargo. Once the 0.2.0 release of delta.rs has been published, `cargo install deltalake` will provide the delta-inspect binary.

To inspect the files, check out the source and use delta-inspect files:

❯ cargo run --bin delta-inspect files ./tests/data/delta-0.2.0

part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet

To inspect the metadata, use delta-inspect info:

❯ cargo run --bin delta-inspect info ./tests/data/delta-0.2.0 DeltaTable(./tests/data/delta-0.2.0) version: 3 metadata: GUID=22ef18ba-191c-4c36-a606-3dad5cdf3830, name=None, description=None, partitionColumns=[], configuration={} min_version: read=1, write=2 files count: 3

Reading the Metadata (Python)

You can also use the delta.rs to query Delta Lake using Python via the delta.rs Python bindings.

To obtain the Delta Lake version and files, use the .version() and .files() methods respectively.

from deltalake import DeltaTable dt = DeltaTable("../rust/tests/data/delta-0.2.0")

# Get the Delta Lake Table version dt.version()

# Example Output 3

# List the Delta Lake table files dt.files()

# Example Output ['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet', 'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet', 'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']

Reading the Delta Table (Python)

To read a Delta table using the delta.rs Python bindings, you will need to convert the Delta table into a PyArrow Table and Pandas Dataframe.

# Import Delta Table
from deltalake import DeltaTable

# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")

# Create a Pandas Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = dt.to_pyarrow_table().to_pandas()
# Query the Pandas table
Df

# Example output
0	5
1	7
2	9

You can also use Time Travel and load a previous version of the Delta table by specifying the version number by using the load_version method.

# Load version 2 of the table
dt.load_version(2)

Notes

Currently, you can also query your Delta Lake table through delta.rs using Python and Ruby, but the underlying Rust APIs should be straightforward to integrate into Golang or other languages too.. Refer to delta.rs for more information. There’s lots of opportunity to contribute to Delta.rs, so be sure to check out the open issues! https://github.com/delta-io/delta.rs/issues

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

 

- Advertisment -

Most Popular

- Advertisment -