Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion or duckdb are better suited for this and provide interesting options.
Sure, you can run anything you want in Databricks Notebooks and workflows by just installing the library. But the interesting part is to access the data stored in the Databricks unity catalog.
There are effectively two ways to query the data with duckdb inside Databricks:
- the obvious one: read the data from the unity catalog table with Spark, convert it to pandas/arrow and use duckdb from there
- the direct one: read the delta file directly from the underlying storage using duckdb, without spark in between
For the impatient: you can find the code for both ways in this gist on github. This also includes all the imports which I mostly omitted in the examples below for the sake of brevity. Also, you need to install duckdb e.g. by %pip install duckdb
in a notebook cell.
Read with pyspark, process with duckdb
We use standard pyspark to read the data and convert it to arrow format afterwards. Then we can query it using duckdb.
1import duckdb 2 3# read the data with spark 4spark_df = spark.read.table("samples.nyctaxi.trips") 5 6# create an arrow table from the spark dataframe that can be queried using duckdb 7# the name of the variable (here `trips`) is the name of the table in duckdb. 8# duckdb discovers this automatically. 9trips = spark_df.toArrow() 10 11duckdb.sql("SELECT COUNT (*) FROM trips")
toArrow()
is a new method added in Spark 4. Spark 4 is not released yet, but Databricks regularly adds new (unreleased) features from the open-source version to their runtime.
The method works like the toPandas()
method, but instead of returning a pandas DataFrame, it returns an Arrow Table. Also the same limitations apply: like with toPandas()
all data is loaded into the memory of the driver node. Therefore this will not work for datasets larger than memory. Arrow is a bit more efficient than pandas in terms of memory usage, so it might work for a bit larger datasets than with pandas. It will definitely be faster in terms of performance for the conversion. So you should prefer arrow over pandas, although duckdb could also query pandas dataframes.
When you want to convert the duckdb result back to a Spark dataframe, e.g. for using the built-in Databricks visualization you can pass an arrow table to SparkSession.createDataFrame()
, starting with Databricks Runtime 16:
1result = duckdb.sql(sql).arrow() 2display( 3 spark.createDataFrame(result) 4)
In older runtimes you could use Pandas instead of arrow as an intermediate format.
Read delta files directly with duckdb
While the above way works it has some limitations. Most notably:
- It loads all data into the memory of the driver node
- Reading with Spark and converting to arrow adds a performance penalty
duckdb is actually pretty good at reading “larger than memory” data directly from cloud blob storages. So it would be nice to use these functionalities.
To do so, we use the Databricks Unity Catalog feature of temporary table credentials. This is an API that provides the URL to the delta files in the blob storage and a temporary token valid for 1 hour to read this data (In Azure it's an SAS token).
In order to use this API:
- you must authenticate against the API
- the principal you authenticate with must have
EXTERNAL_USE_SCHEMA
permissions
1w = WorkspaceClient(
2 host=spark.conf.get(
3 "spark.databricks.workspaceUrl"
4 )
5)
6
7def get_temporary_credentials_for_table(table: str):
8 table_id = w.tables.get(table).table_id
9 return w.temporary_table_credentials.generate_temporary_table_credentials(table_id=table_id, operation=TableOperation.READ)
10
11cred = get_temporary_credentials_for_table("samples.nyctaxi.trips")
The credential API does not accept a full qualified table name, but only a table ID which we have to retrieve first. Be aware that you might need to upgrade the databricks-sdk on your cluster to the latest version, as the temporary credentials API is quite new.
After we have retrieved the credential, we can store it in duckdb.
Note: setting azure_transport_option_type to curl is needed, as otherwise duckdb struggles to handle certificates correctly on Databricks.
1storage_account_name = re.search('@(.*).dfs.', cred.url).group(1) 2sql = f""" 3SET azure_transport_option_type = 'curl'; 4CREATE OR REPLACE SECRET ( 5 TYPE AZURE, 6 CONNECTION_STRING 'AccountName={storage_account_name};SharedAccessSignature={cred.azure_user_delegation_sas.sas_token}' 7); 8""" 9duckdb.sql(sql)
And finally query the files using duckdb:
1sql = f""" 2SELECT * FROM delta_scan("{cred.url}") 3""" 4duckdb.sql(sql).show()
It is also possible to use duckdb outside of Databricks to process data stored in the Unity Catalog. You would use the same approach as above (getting the credentials using the Databricks API, store it in duckdb, do the query). But as you call the API from outside of databricks, the metastore must be enabled for external data access. And of course, the blob storage must be reachable network-wise from wherever you want to do the query.
Closing remarks
The temporary table credentials API is a really nice addition to the unity catalog. In my opinion it absolutely needed for databricks, so that other engines can be used as well. Sticking to spark as the only processing option would otherwise prevent some modern, efficient processing techniques. Maybe we get an even smoother integration of duckdb and the databricks managed unity catalog, something along the lines of duck/uc_catalog, which is a PoC for the open source unity catalog with some Azure Databricks support already built-in.
More articles
fromMatthias Niehoff
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Matthias Niehoff
Head of Data
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.