Lightweight ETL with AWS Glue Python Shell, DuckDB, and PyIceberg
Aki

Aki @datapenguin

About: 12× AWS‑Certified | AWS Community Builder(Data)

Location:
Japan
Joined:
Jul 16, 2025

Lightweight ETL with AWS Glue Python Shell, DuckDB, and PyIceberg

Publish Date: Aug 13
1 1

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

I previously posted an article about implementing ETL with AWS Lambda × DuckDB × PyIceberg:
https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a
https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-duckdb-and-pyiceberg-1l5p

In this post I’ll implement an ETL that writes data in Apache Iceberg format using AWS Glue Python Shell, which—like Lambda—is often chosen for lightweight ETL workloads.

When Glue (Spark) or EMR is too costly, and Lambda’s 15-minute timeout is a concern, AWS Glue Python Shell can be an effective middle ground.


What is AWS Glue Python Shell?

AWS Glue Python Shell is a Glue execution environment that runs Python scripts without Apache Spark. Compared with Lambda’s 15-minute limit, it supports much longer batch jobs—the default maximum is 48 hours—so it’s suitable for longer-running ETL tasks. It’s serverless, so you don’t manage infrastructure and you pay for what you use.

Compared to Glue Spark or EMR, Glue Python Shell is often cheaper for lightweight ETL and data transformations. However, one downside versus Lambda is fewer native trigger options: Glue Python Shell does not directly support S3 object-created triggers or EventBridge triggers in the same way Lambda does. Typical approaches are to invoke Glue Python Shell via Lambda or Glue Workflows. Despite constraints, when used well it can significantly expand your processing options.


Architecture used in this article

Here’s the architecture covered in this post: when a file is uploaded to S3, a Lambda function is triggered. The Lambda only starts a Glue Python Shell job and passes the S3 path; the actual ETL runs inside the Glue Python Shell and writes Iceberg-format data back to S3. This lets you process workloads that exceed Lambda’s 15-minute limit.

architecture

Key libraries used:

  • DuckDB — A fast in-memory SQL database engine.
  • PyArrow — Efficient data transformation and transfer via the Arrow format.
  • PyIceberg — Enables working with Iceberg tables using AWS Glue Catalog.

About the Libraries

DuckDB

DuckDB is an embedded OLAP database engine designed for analytical queries.

It’s lightweight, runs in memory, and works efficiently even in restricted environments like Lambda.

It’s particularly well-suited for batch analytics and ETL workloads.

https://duckdb.org/

PyArrow

PyArrow is the Python binding for Apache Arrow, a high-performance columnar memory format optimized for analytics and distributed computing.

It allows fast, zero-copy data transfer between systems and languages.

https://arrow.apache.org/

PyIceberg

PyIceberg is the Python implementation of Apache Iceberg, a table format designed for large, cloud-native data lakes.

It enables Python applications to read, write, and manage Iceberg tables seamlessly.

https://py.iceberg.apache.org/


Additional Python modules path

Set the Additional Python modules path to include:

pyiceberg[glue]==0.9.1,pydantic==2.5.0,pydantic-core==2.14.1,annotated-types==0.6.0,duckdb==0.9.2,pyarrow==14.0.1,typing-extensions==4.8.0
Enter fullscreen mode Exit fullscreen mode

Be mindful—dependency management can be tricky here.


Sample Code

Lambda trigger function

A simple Lambda that receives an S3 event and starts the Glue job:

import boto3

def lambda_handler(event, context):
    glue = boto3.client("glue")

    s3_bucket = event['Records'][0]['s3']['bucket']['name']
    s3_object_key = event['Records'][0]['s3']['object']['key']

    s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

    response = glue.start_job_run(
        JobName="Iceberg shell",
        Arguments={
            "--s3_input": s3_input_path
        }
    )
    print(f"Glue Job started: {response['JobRunId']}")
    return response
Enter fullscreen mode Exit fullscreen mode

AWS Glue Python Shell ETL

The Glue Python Shell job is started from Lambda with the S3 input path; the job uses DuckDB to read the file and PyIceberg to write into an Iceberg table. In this example the DuckDB query filters VendorID = 1 to demonstrate SQL-style cleansing/filters.

This example uses GlueCatalog access. You can also use the Glue Iceberg REST catalog approach (REST Catalog) depending on your requirements—see the link below for differences.

Reference about catalog access modes:
https://zenn.dev/penginpenguin/articles/e44880aaa2d5e3

import boto3
import sys
import os
from awsglue.utils import getResolvedOptions


def get_job_parameters():
    """Get job parameters passed from the Glue job arguments."""
    try:
        required_args = ['s3_input']

        args = getResolvedOptions(sys.argv, required_args)

        s3_file_path = args['s3_input']
        print(f"object: {s3_file_path}")

        return s3_file_path

    except Exception as e:
        print(f"parameters error: {e}")
        raise

def setup_duckdb_environment():
    """Properly set up the DuckDB environment on the local filesystem (/tmp)."""
    try:
        home_dir = '/tmp'
        duckdb_dir = '/tmp/.duckdb'
        extensions_dir = '/tmp/.duckdb/extensions'

        os.environ['HOME'] = home_dir
        os.environ['DUCKDB_HOME'] = duckdb_dir
        os.environ['DUCKDB_CONFIG_PATH'] = duckdb_dir
        os.environ['DUCKDB_EXTENSION_DIRECTORY'] = extensions_dir

        os.makedirs(duckdb_dir, exist_ok=True)
        os.makedirs(extensions_dir, exist_ok=True)
        os.chmod(duckdb_dir, 0o755)
        os.chmod(extensions_dir, 0o755)

        print(f"DuckDB environment setup completed: {duckdb_dir}")
        return True

    except Exception as e:
        print(f"DuckDB environment setup error: {e}")
        return False

def read_parquet_with_duckdb(s3_input):
    """Read Parquet file from S3 using DuckDB and return a PyArrow table."""
    import duckdb

    con = duckdb.connect(':memory:')

    try:
        con.execute("SET extension_directory='/tmp/.duckdb/extensions';")
        con.execute("INSTALL httpfs;")
        con.execute("LOAD httpfs;")

        session = boto3.Session()
        credentials = session.get_credentials()

        con.execute(f"SET s3_region='ap-northeast-1';")
        con.execute(f"SET s3_access_key_id='{credentials.access_key}';")
        con.execute(f"SET s3_secret_access_key='{credentials.secret_key}';")

        if credentials.token:
            con.execute(f"SET s3_session_token='{credentials.token}';")

        print(f"Reading data from S3: {s3_input}")
        sql = f"SELECT * FROM read_parquet('{s3_input}') WHERE VendorID = 1"
        res = con.execute(sql)

        return res.arrow()

    except Exception as e:
        print(f"DuckDB error: {e}")
        raise
    finally:
        con.close()

def create_and_write_iceberg_table_fixed(arrow_table):
    """Create (or load) an Iceberg table and append an Arrow table to it."""
    try:
        print("Fixed Iceberg table creation and writing started...")

        from pyiceberg.catalog import load_catalog
        from pyiceberg.schema import Schema
        from pyiceberg.types import NestedField, StringType, IntegerType, DoubleType, TimestampType

        catalog_config = {
            "type": "glue",
            "warehouse": "s3://your-bucket/your-warehouse/", # Adjust to your environment.
            "region": "ap-northeast-1"
        }

        catalog = load_catalog("glue_catalog", **catalog_config)

        table_identifier = "icebergdb.yellow_tripdata"

        table = catalog.load_table(table_identifier)
        print(f"Target data to write: {arrow_table.num_rows:,} rows")

        # Write Arrow table to Iceberg table
        table.append(arrow_table)

        return True

    except Exception as e:
        print(f"Fixed Iceberg table creation/writing error: {e}")
        import traceback
        traceback.print_exc()
        return False

def main():
    print("Fixed Iceberg table creation/writing test...")

    # Set up DuckDB environment
    if not setup_duckdb_environment():
        print("Failed to set up DuckDB environment")
        return

    try:
        import pyiceberg

        # Read data
        s3_input = get_job_parameters()

        arrow_tbl = read_parquet_with_duckdb(s3_input)
        print(f"Data read success: {arrow_tbl.shape}")

        # Create and write to fixed Iceberg table
        if create_and_write_iceberg_table_fixed(arrow_tbl):
            print("\nFixed Iceberg table creation/writing fully successful!")
        else:
            print("Fixed Iceberg table creation/writing failed")

    except Exception as e:
        print(f"Main error: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

Results

Sample data used in this demo is the commonly used NYC taxi dataset:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Before

After

We successfully registered data into an Iceberg table (OTF) using this approach.

Reference: Lambda implementation (for comparison)

In a previous article I implemented the same libraries directly inside Lambda. Because Lambda has a 15-minute runtime limit, large or complex workloads may time out or hit memory limits. Migrating the same library stack to Glue Python Shell can avoid those constraints.

import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog  

def lambda_handler(event, context):
    try:
        # Connect to DuckDB and set the home directory
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'") 

        # Install and load the httpfs extension
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        # Load data from S3 using DuckDB
        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']

        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        print(f"s3_input_path: {s3_input_path}")

        query = f"""
            SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1
        """
        # Execute SQL and retrieve results as a PyArrow Table
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        print(f"Number of rows retrieved: {result_arrow_table.num_rows}")
        print(f"Data schema: {result_arrow_table.schema}")

        # Configure Glue Catalog (to access Iceberg table)
        catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")  # Adjust to your environment.

        # Load the table
        namespace = "icebergdb"  # Adjust to your environment.
        table_name = "yellow_tripdata"  # Adjust to your environment.
        iceberg_table = catalog.load_table(f"{namespace}.{table_name}")

        # Append data to the Iceberg table in bulk
        iceberg_table.append(result_arrow_table) 

        print("Data has been appended to S3 in Iceberg format.")

    except Exception as e:
        print(f"An error occurred: {e}")

Enter fullscreen mode Exit fullscreen mode

Pros & Cons of the Glue Python Shell approach

Pros

  • Can run for much longer than Lambda—can handle large files and long processing.
  • Serverless and elastic—no infra management.
  • Can reuse Python libraries like DuckDB and PyIceberg as-is.
  • Writing to Iceberg gives you snapshots, time travel, and other table-management features.

Cons

  • Startup is slower than Lambda due to loading Python libraries.
  • Glue Python Shell only supports Python (no other languages).
  • No Spark-based distributed processing—unsuitable for very large-scale or very high throughput workloads.
  • You need to manage dependency versions carefully.

Cost comparison vs Lambda

Lambda

I compared costs under a scenario limited to Lambda’s 15-minute maximum. (Memory for Lambda was set to 2048 MB in the example.) Depending on your workload and runtime, Glue Python Shell can sometimes be cheaper than Lambda.

AWS Glue Python Shell

Conclusion

This article showed how to use AWS Glue Python Shell together with DuckDB and PyIceberg to implement ETL that writes data in Apache Iceberg format on S3. Glue Python Shell sits between Lambda and Glue Spark/EMR as a serverless execution option that supports longer runtimes without the management overhead of Spark clusters.

By combining DuckDB and PyIceberg you can build a flexible, lower-cost data processing pipeline—especially when Lambda’s runtime is insufficient but Glue Spark / EMR would be overkill.

Iceberg supports schema evolution, time travel, and other powerful table features; it’s becoming a mainstream choice for OTF workflows, so I expect its adoption to keep growing.

I hope this article helps anyone considering lightweight ETL or near-real-time ingestion into Iceberg tables.

Comments 1 total

  • Emil
    Emil Aug 14, 2025

    Great article. Love the idea

Add comment