Access Databricks UnityCatalog from duckdb

20.1.2025 | 4 minutes reading time

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion or duckdb are better suited for this and provide interesting options.

Sure, you can run anything you want in Databricks Notebooks and workflows by just installing the library. But the interesting part is to access the data stored in the Databricks unity catalog.

There are effectively two ways to query the data with duckdb inside Databricks:

the obvious one: read the data from the unity catalog table with Spark, convert it to pandas/arrow and use duckdb from there
the direct one: read the delta file directly from the underlying storage using duckdb, without spark in between

For the impatient: you can find the code for both ways in this gist on github. This also includes all the imports which I mostly omitted in the examples below for the sake of brevity. Also, you need to install duckdb e.g. by %pip install duckdb in a notebook cell.

Read with pyspark, process with duckdb

We use standard pyspark to read the data and convert it to arrow format afterwards. Then we can query it using duckdb.

1import duckdb
2
3# read the data with spark
4spark_df = spark.read.table("samples.nyctaxi.trips")
5
6# create an arrow table from the spark dataframe that can be queried using duckdb
7# the name of the variable (here `trips`) is the name of the table in duckdb. 
8# duckdb discovers this automatically.
9trips = spark_df.toArrow()
10
11duckdb.sql("SELECT COUNT (*) FROM trips")

toArrow() is a new method added in Spark 4. Spark 4 is not released yet, but Databricks regularly adds new (unreleased) features from the open-source version to their runtime.

The method works like the toPandas() method, but instead of returning a pandas DataFrame, it returns an Arrow Table. Also the same limitations apply: like with toPandas() all data is loaded into the memory of the driver node. Therefore this will not work for datasets larger than memory. Arrow is a bit more efficient than pandas in terms of memory usage, so it might work for a bit larger datasets than with pandas. It will definitely be faster in terms of performance for the conversion. So you should prefer arrow over pandas, although duckdb could also query pandas dataframes.

When you want to convert the duckdb result back to a Spark dataframe, e.g. for using the built-in Databricks visualization you can pass an arrow table to SparkSession.createDataFrame(), starting with Databricks Runtime 16:

1result = duckdb.sql(sql).arrow()
2display(
3   spark.createDataFrame(result)
4)

In older runtimes you could use Pandas instead of arrow as an intermediate format.

Read delta files directly with duckdb

While the above way works it has some limitations. Most notably:

It loads all data into the memory of the driver node
Reading with Spark and converting to arrow adds a performance penalty

duckdb is actually pretty good at reading “larger than memory” data directly from cloud blob storages. So it would be nice to use these functionalities.

To do so, we use the Databricks Unity Catalog feature of temporary table credentials. This is an API that provides the URL to the delta files in the blob storage and a temporary token valid for 1 hour to read this data (In Azure it's an SAS token).

In order to use this API:

you must authenticate against the API
the principal you authenticate with must have EXTERNAL_USE_SCHEMA permissions

1w = WorkspaceClient(
2   host=spark.conf.get(
3       "spark.databricks.workspaceUrl"
4   )
5)
6
7def get_temporary_credentials_for_table(table: str):
8   table_id = w.tables.get(table).table_id
9   return w.temporary_table_credentials.generate_temporary_table_credentials(table_id=table_id, operation=TableOperation.READ)
10
11cred = get_temporary_credentials_for_table("samples.nyctaxi.trips")

The credential API does not accept a full qualified table name, but only a table ID which we have to retrieve first. Be aware that you might need to upgrade the databricks-sdk on your cluster to the latest version, as the temporary credentials API is quite new.

After we have retrieved the credential, we can store it in duckdb.

Note: setting azure_transport_option_type to curl is needed, as otherwise duckdb struggles to handle certificates correctly on Databricks.

1storage_account_name = re.search('@(.*).dfs.', cred.url).group(1)
2sql = f"""
3SET azure_transport_option_type = 'curl';
4CREATE OR REPLACE SECRET (
5   TYPE AZURE,
6   CONNECTION_STRING 'AccountName={storage_account_name};SharedAccessSignature={cred.azure_user_delegation_sas.sas_token}'
7);
8"""
9duckdb.sql(sql)

And finally query the files using duckdb:

1sql = f"""
2SELECT * FROM delta_scan("{cred.url}")
3"""
4duckdb.sql(sql).show()

It is also possible to use duckdb outside of Databricks to process data stored in the Unity Catalog. You would use the same approach as above (getting the credentials using the Databricks API, store it in duckdb, do the query). But as you call the API from outside of databricks, the metastore must be enabled for external data access. And of course, the blob storage must be reachable network-wise from wherever you want to do the query.

Closing remarks

The temporary table credentials API is a really nice addition to the unity catalog. In my opinion it absolutely needed for databricks, so that other engines can be used as well. Sticking to spark as the only processing option would otherwise prevent some modern, efficient processing techniques. Maybe we get an even smoother integration of duckdb and the databricks managed unity catalog, something along the lines of duck/uc_catalog, which is a PoC for the open source unity catalog with some Azure Databricks support already built-in.

Was this post helpful?

Blog author

Matthias Niehoff

Head of Data

Do you still have questions? Just send me a message.

fromMatthias Niehoff

Lookup additional data in Spark Streaming

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event. In this blog post I would like to discuss...

Software architecture
Scala
Big Data
Data
Streaming

1.6.2017 | 7 minutes reading time

Matthias Niehoff

Event time processing in Apache Spark and Apache Flink

With the new release of Spark 2.1, the event-time capabilities of Spark Structured Streaming have been expanded. It is time to take a closer look at the state of support and compare it with Apache Flink – which comes with a broad support for event time...

Big Data
Data
Machine Learning
Streaming

19.4.2017 | 9 minutes reading time

Matthias Niehoff

Distributed Stream Processing Frameworks for Fast & Big Data

Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore...

Big Data
Data
Open Source
Messaging
Machine Learning
Streaming

26.3.2017 | 10 minutes reading time

Matthias Niehoff

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 [Missing String "readingTime"]

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 [Missing String "readingTime"]

Denis Stalz-John

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

BigQuery can help with building an ML system for production with a short time to market.Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.Model improvement is not everything. A good model evaluation as well...

Data

2.2.2022 | 9 [Missing String "readingTime"]

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 [Missing String "readingTime"]

Berthold Schulte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 [Missing String "readingTime"]

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 [Missing String "readingTime"]

Access Databricks UnityCatalog from duckdb

Read with pyspark, process with duckdb

Read delta files directly with duckdb

Closing remarks

Was this post helpful?

Blog author

More articles

Lookup additional data in Spark Streaming

Event time processing in Apache Spark and Apache Flink

Distributed Stream Processing Frameworks for Fast & Big Data

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Introducing Data Interface Quadrants (DIQs)

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Money, Money, Money - Monetization of APIs needs more than just a business...

Python on an M1 chip: Running smoothly using Docker

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

Evaluating machine learning models: Establishing quality gates

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!