※This is an English translation of the original Japanese article:https://dev.classmethod.jp/articles/ducklake-database-based-approach-lakehouse/
Hi, this is Sagara.
A new Lakehouse format, "DuckLake," where metadata management is handled by the database, was announced on DuckDB's official blog.
https://duckdb.org/2025/05/27/ducklake
In this article, I'll briefly introduce what DuckLake is and summarize my experience trying it out locally.
What is DuckLake?
First, DuckLake is an open-source Lakehouse format.
Recognizing the complexity of file-based metadata management in recent formats like Iceberg and Delta Lake, DuckLake was implemented with the approach of handling the entire metadata management layer, including what is known as the catalog layer in Iceberg, using an SQL database.
https://duckdb.org/2025/05/27/ducklake
The blog post at the link above mentions the following four benefits of DuckLake:
- Simplicity
- To run DuckLake on a laptop, you just need to install DuckDB and use the DuckLake extension (in this case, DuckDB's local file handles catalog and metadata management).
- No Avro or JSON files; everything can be controlled with SQL.
- Scalability
- An architecture that separates storage, compute, and metadata management.
- Speed
- Unlike traditional Open Table Formats, file I/O is not required.
- Reduces the number of files written for small changes and can handle concurrent modifications.
- Features
- Operations are SQL-based, enabling ACID-compliant transactions, adding/deleting columns, and changing data types.
- Data and delete files written to storage by DuckLake are compatible with Iceberg, allowing for metadata-only migration.
- DuckLake's compute nodes have been simultaneously released as a DuckDB extension (available from DuckDB v1.3.0).
The official website and GitHub repository can be accessed from the following links:
https://github.com/duckdb/ducklake
Trying it Out
I'll briefly try out DuckLake in a local environment.
I'll follow the official documentation provided below.
https://ducklake.select/docs/stable/duckdb/introduction
Test Environment
- Windows 11
- DuckDB v1.3.0
- Launched via an .exe file, following this blog post (link to Japanese article).
Installation
First, install DuckLake.
INSTALL ducklake;
Create a New DuckLake Database
Execute the following command to create a new DuckLake database. If executed without specifying a path, the DuckDB database file will be created in the location where the .exe file was launched.
ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;
Create a New Table
Execute the following command to create a table based on a publicly available CSV file.
USE my_ducklake;
CREATE TABLE nl_train_stations AS
FROM 'https://blobs.duckdb.org/nl_stations.csv';
At this stage, a folder is added in the same directory where the DuckDB database file was created. Looking inside the folder, you can see that Parquet files have been created.
You can also view the contents of the Parquet files with the following query.
FROM glob('my_ducklake.ducklake.files/*');
FROM 'my_ducklake.ducklake.files/*.parquet' LIMIT 10;
Executing an UPDATE Operation on the Table
Next, let's examine the behavior when an UPDATE operation is performed on the table.
Executing the following query will run an UPDATE statement that changes a station name.
UPDATE nl_train_stations SET name_long='Johan Cruijff ArenA' WHERE code = 'ASB';
SELECT name_long FROM nl_train_stations WHERE code = 'ASB';
Then, new Parquet files are generated in the folder holding data for this DuckLake. You can check the list of files with the following query.
Two new Parquet files have been added. Since an UPDATE was performed, we can see that a Parquet file holding the new records and a Parquet file representing the old (deleted) records (marked with a -delete
suffix) have been added.
FROM glob('my_ducklake.ducklake.files/*');
Like other Open Table Formats, DuckLake records Snapshots to retain historical versions. In DuckLake, these Snapshots are recorded in a table, and you can check their contents with the following query.
FROM my_ducklake.snapshots();
Furthermore, you can use Snapshots to query the table at a specific point in time. In the following query, VERSION => 1
allows you to see the data before the UPDATE, and VERSION => 2
shows the data after the UPDATE.
SELECT name_long FROM nl_train_stations AT (VERSION => 1) WHERE code = 'ASB';
SELECT name_long FROM nl_train_stations AT (VERSION => 2) WHERE code = 'ASB';
Finally
Although this was a brief overview, I've introduced DuckLake and tried it out following the official documentation.
The simple architecture of "Parquet files on storage, with all metadata including Snapshots managed by the database" is excellent, as it allows everything to be inspected via SQL, just as advertised!
I'm excited to see how DuckLake will be integrated not only into MotherDuck but also into various other products in the future!
Absolutely! For easily ingesting data into Ducklake, check out sling. There is a CLI as well as a Python interface.