Original Japanese article: AWS Lambda×DuckDB×PyIcebergによるETLの実装

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In a previous article, I demonstrated how to implement lightweight ETL using AWS Lambda and DuckDB. This time, I aim to implement a Lambda-based ETL that supports the recently popular Apache Iceberg format.

While AWS Glue or EMR can easily handle this, they may not be cost-effective for lightweight datasets. In such cases, an AWS Lambda-based ETL, as demonstrated here, proves effective.

Note: When I used Glue with my personal account to manipulate Iceberg tables, it cost nearly 20,000 yen. Therefore, this approach is also effective for personal use.

Architecture used in this article

The architecture is straightforward: Lambda is triggered by an S3 file upload and writes to an S3 bucket in Iceberg format.

Key libraries used in Lambda include:

DuckDB — A fast in-memory SQL database engine.
PyArrow — Efficient data transformation and transfer via the Arrow format.
PyIceberg — Enables working with Iceberg tables using AWS Glue Catalog.

About the Libraries

DuckDB

DuckDB is an embedded OLAP database engine designed for analytical queries.

It’s lightweight, runs in memory, and works efficiently even in restricted environments like Lambda.

It’s particularly well-suited for batch analytics and ETL workloads.

https://duckdb.org/

PyArrow

PyArrow is the Python binding for Apache Arrow, a high-performance columnar memory format optimized for analytics and distributed computing.

It allows fast, zero-copy data transfer between systems and languages.

https://arrow.apache.org/

PyIceberg

PyIceberg is the Python implementation of Apache Iceberg, a table format designed for large, cloud-native data lakes.

It enables Python applications to read, write, and manage Iceberg tables seamlessly.

https://py.iceberg.apache.org/

Integrating with Lambdah

Integration with Lambda is achieved using Lambda layers. By executing the following command, you can download the necessary libraries and package them into a ZIP file:

mkdir python
pip install -t python --platform manylinux2014_x86_64 --only-binary=:all: pyiceberg[glue,duckdb]

Adding the glue and duckdb options installs the Glue extension, DuckDB, and PyArrow simultaneously. Note that this may exceed Lambda's 250MB layer size limit, so I removed boto3 to stay within the limit.

Alternatively, containerizing the Lambda function might be a more straightforward approach.

Sample Code

The sample code adds a filter WHERE VendorID = 1 to the ETL process.
This demonstrates that you can implement SQL-like processing directly within Lambda using DuckDB.

import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog

def lambda_handler(event, context):
    try:
        # Connect to DuckDB and set home directory
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'")

        # Install and load the HTTPFS extension
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        # Load data from S3 into DuckDB
        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']
        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        query = f"""
            SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1
        """
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        # Set up Glue Catalog to access Iceberg table
        catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")

        # Load the Iceberg table
        namespace = "icebergdb"
        table_name = "yellow_tripdata"
        iceberg_table = catalog.load_table(f"{namespace}.{table_name}")

        # Append data to Iceberg table
        iceberg_table.append(result_arrow_table)

    except Exception as e:
        print(f"Error: {e}")

Results

Sample data used in this demo is the commonly used NYC taxi dataset:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Before

After

We successfully registered data into an Iceberg table (OTF) using this approach.

Advantages and Disadvantages

Advantages

Easy Setup: Simple integration with Lambda layers.
Cost-Effective Iceberg Operations: Operate Iceberg without incurring costs associated with Glue or EMR.
Simple Development: Utilize functions provided by DuckDB for straightforward development.
SQL-Like Processing: Perform operations using PostgreSQL-compatible SQL syntax.
Efficient Processing: In-memory processing enables efficient data handling within Lambda.
Real-Time ETL: S3 triggers allow immediate execution of ETL processes upon file upload.

Disadvantages

Memory Limitations: Lambda's maximum memory size is 10,240 MB; larger datasets may exceed this limit.
Execution Time Limitations: Lambda's maximum execution time is 15 minutes; large datasets may require alternative services like container services.

Further Expansion

This implementation simply retrieves data and registers it. For more complex processing, consider incorporating additional filters, validation checks, or transformations within the Lambda function.

Write-Audit-Publish (WAP) Pattern

Since Iceberg supports features like schema evolution and time-travel, you can implement the Write-Audit-Publish (WAP) pattern to help ensure data quality before committing data to production tables. Using Glue Data Quality can be convenient for this purpose.

DuckDB Iceberg Extension

DuckDB also offers an Iceberg extension that can be useful for querying and transforming Iceberg tables locally.
Note: Direct writing to Iceberg tables via the DuckDB extension is currently not supported, but it can still help with pre-processing or local validation.

Conclusion

In this article, we implemented an ETL process using AWS Lambda for Iceberg tables.
While Glue or EMR are common choices for Iceberg on AWS, they may not be cost-effective for small datasets. In such cases, the combination shown here is a cost-effective alternative.

Using Lambda allows real-time ETL triggered by S3 events, enabling a lightweight and simple data processing workflow.
Of course, every service has its place, so always consider requirements and constraints when choosing your ETL platform.

Iceberg tables offer benefits such as schema evolution and time-travel, making them increasingly relevant in modern data architectures.
We hope this article helps those exploring lightweight ETL and real-time data processing for Iceberg tables.

For readers who want to see the original Japanese article or request translations of other articles, please visit my Zenn page:AWS Lambda×DuckDB×PyIcebergによるETLの実装 (Japanese)

Feel free to request English translations of other articles as well!

Aki @datapenguin

Lightweight ETL with AWS Lambda, DuckDB, and PyIceberg