Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
I previously posted an article about implementing ETL with AWS Lambda × DuckDB × PyIceberg:
https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a
https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-duckdb-and-pyiceberg-1l5p
In this post I’ll implement an ETL that writes data in Apache Iceberg format using AWS Glue Python Shell, which—like Lambda—is often chosen for lightweight ETL workloads.
When Glue (Spark) or EMR is too costly, and Lambda’s 15-minute timeout is a concern, AWS Glue Python Shell can be an effective middle ground.
What is AWS Glue Python Shell?
AWS Glue Python Shell is a Glue execution environment that runs Python scripts without Apache Spark. Compared with Lambda’s 15-minute limit, it supports much longer batch jobs—the default maximum is 48 hours—so it’s suitable for longer-running ETL tasks. It’s serverless, so you don’t manage infrastructure and you pay for what you use.
Compared to Glue Spark or EMR, Glue Python Shell is often cheaper for lightweight ETL and data transformations. However, one downside versus Lambda is fewer native trigger options: Glue Python Shell does not directly support S3 object-created triggers or EventBridge triggers in the same way Lambda does. Typical approaches are to invoke Glue Python Shell via Lambda or Glue Workflows. Despite constraints, when used well it can significantly expand your processing options.
Architecture used in this article
Here’s the architecture covered in this post: when a file is uploaded to S3, a Lambda function is triggered. The Lambda only starts a Glue Python Shell job and passes the S3 path; the actual ETL runs inside the Glue Python Shell and writes Iceberg-format data back to S3. This lets you process workloads that exceed Lambda’s 15-minute limit.
Key libraries used:
- DuckDB — A fast in-memory SQL database engine.
- PyArrow — Efficient data transformation and transfer via the Arrow format.
- PyIceberg — Enables working with Iceberg tables using AWS Glue Catalog.
About the Libraries
DuckDB
DuckDB is an embedded OLAP database engine designed for analytical queries.
It’s lightweight, runs in memory, and works efficiently even in restricted environments like Lambda.
It’s particularly well-suited for batch analytics and ETL workloads.
https://duckdb.org/
PyArrow
PyArrow is the Python binding for Apache Arrow, a high-performance columnar memory format optimized for analytics and distributed computing.
It allows fast, zero-copy data transfer between systems and languages.
https://arrow.apache.org/
PyIceberg
PyIceberg is the Python implementation of Apache Iceberg, a table format designed for large, cloud-native data lakes.
It enables Python applications to read, write, and manage Iceberg tables seamlessly.
https://py.iceberg.apache.org/
Additional Python modules path
Set the Additional Python modules path to include:
pyiceberg[glue]==0.9.1,pydantic==2.5.0,pydantic-core==2.14.1,annotated-types==0.6.0,duckdb==0.9.2,pyarrow==14.0.1,typing-extensions==4.8.0
Be mindful—dependency management can be tricky here.
Sample Code
Lambda trigger function
A simple Lambda that receives an S3 event and starts the Glue job:
import boto3
def lambda_handler(event, context):
glue = boto3.client("glue")
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
response = glue.start_job_run(
JobName="Iceberg shell",
Arguments={
"--s3_input": s3_input_path
}
)
print(f"Glue Job started: {response['JobRunId']}")
return response
AWS Glue Python Shell ETL
The Glue Python Shell job is started from Lambda with the S3 input path; the job uses DuckDB to read the file and PyIceberg to write into an Iceberg table. In this example the DuckDB query filters VendorID = 1 to demonstrate SQL-style cleansing/filters.
This example uses GlueCatalog access. You can also use the Glue Iceberg REST catalog approach (REST Catalog) depending on your requirements—see the link below for differences.
Reference about catalog access modes:
https://zenn.dev/penginpenguin/articles/e44880aaa2d5e3
import boto3
import sys
import os
from awsglue.utils import getResolvedOptions
def get_job_parameters():
"""Get job parameters passed from the Glue job arguments."""
try:
required_args = ['s3_input']
args = getResolvedOptions(sys.argv, required_args)
s3_file_path = args['s3_input']
print(f"object: {s3_file_path}")
return s3_file_path
except Exception as e:
print(f"parameters error: {e}")
raise
def setup_duckdb_environment():
"""Properly set up the DuckDB environment on the local filesystem (/tmp)."""
try:
home_dir = '/tmp'
duckdb_dir = '/tmp/.duckdb'
extensions_dir = '/tmp/.duckdb/extensions'
os.environ['HOME'] = home_dir
os.environ['DUCKDB_HOME'] = duckdb_dir
os.environ['DUCKDB_CONFIG_PATH'] = duckdb_dir
os.environ['DUCKDB_EXTENSION_DIRECTORY'] = extensions_dir
os.makedirs(duckdb_dir, exist_ok=True)
os.makedirs(extensions_dir, exist_ok=True)
os.chmod(duckdb_dir, 0o755)
os.chmod(extensions_dir, 0o755)
print(f"DuckDB environment setup completed: {duckdb_dir}")
return True
except Exception as e:
print(f"DuckDB environment setup error: {e}")
return False
def read_parquet_with_duckdb(s3_input):
"""Read Parquet file from S3 using DuckDB and return a PyArrow table."""
import duckdb
con = duckdb.connect(':memory:')
try:
con.execute("SET extension_directory='/tmp/.duckdb/extensions';")
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
session = boto3.Session()
credentials = session.get_credentials()
con.execute(f"SET s3_region='ap-northeast-1';")
con.execute(f"SET s3_access_key_id='{credentials.access_key}';")
con.execute(f"SET s3_secret_access_key='{credentials.secret_key}';")
if credentials.token:
con.execute(f"SET s3_session_token='{credentials.token}';")
print(f"Reading data from S3: {s3_input}")
sql = f"SELECT * FROM read_parquet('{s3_input}') WHERE VendorID = 1"
res = con.execute(sql)
return res.arrow()
except Exception as e:
print(f"DuckDB error: {e}")
raise
finally:
con.close()
def create_and_write_iceberg_table_fixed(arrow_table):
"""Create (or load) an Iceberg table and append an Arrow table to it."""
try:
print("Fixed Iceberg table creation and writing started...")
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType, DoubleType, TimestampType
catalog_config = {
"type": "glue",
"warehouse": "s3://your-bucket/your-warehouse/", # Adjust to your environment.
"region": "ap-northeast-1"
}
catalog = load_catalog("glue_catalog", **catalog_config)
table_identifier = "icebergdb.yellow_tripdata"
table = catalog.load_table(table_identifier)
print(f"Target data to write: {arrow_table.num_rows:,} rows")
# Write Arrow table to Iceberg table
table.append(arrow_table)
return True
except Exception as e:
print(f"Fixed Iceberg table creation/writing error: {e}")
import traceback
traceback.print_exc()
return False
def main():
print("Fixed Iceberg table creation/writing test...")
# Set up DuckDB environment
if not setup_duckdb_environment():
print("Failed to set up DuckDB environment")
return
try:
import pyiceberg
# Read data
s3_input = get_job_parameters()
arrow_tbl = read_parquet_with_duckdb(s3_input)
print(f"Data read success: {arrow_tbl.shape}")
# Create and write to fixed Iceberg table
if create_and_write_iceberg_table_fixed(arrow_tbl):
print("\nFixed Iceberg table creation/writing fully successful!")
else:
print("Fixed Iceberg table creation/writing failed")
except Exception as e:
print(f"Main error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
main()
Results
Sample data used in this demo is the commonly used NYC taxi dataset:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Before
After
We successfully registered data into an Iceberg table (OTF) using this approach.
Reference: Lambda implementation (for comparison)
In a previous article I implemented the same libraries directly inside Lambda. Because Lambda has a 15-minute runtime limit, large or complex workloads may time out or hit memory limits. Migrating the same library stack to Glue Python Shell can avoid those constraints.
import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog
def lambda_handler(event, context):
try:
# Connect to DuckDB and set the home directory
duckdb_connection = duckdb.connect(database=':memory:')
duckdb_connection.execute("SET home_directory='/tmp'")
# Install and load the httpfs extension
duckdb_connection.execute("INSTALL httpfs;")
duckdb_connection.execute("LOAD httpfs;")
# Load data from S3 using DuckDB
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
print(f"s3_input_path: {s3_input_path}")
query = f"""
SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1
"""
# Execute SQL and retrieve results as a PyArrow Table
result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()
print(f"Number of rows retrieved: {result_arrow_table.num_rows}")
print(f"Data schema: {result_arrow_table.schema}")
# Configure Glue Catalog (to access Iceberg table)
catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog") # Adjust to your environment.
# Load the table
namespace = "icebergdb" # Adjust to your environment.
table_name = "yellow_tripdata" # Adjust to your environment.
iceberg_table = catalog.load_table(f"{namespace}.{table_name}")
# Append data to the Iceberg table in bulk
iceberg_table.append(result_arrow_table)
print("Data has been appended to S3 in Iceberg format.")
except Exception as e:
print(f"An error occurred: {e}")
Pros & Cons of the Glue Python Shell approach
Pros
- Can run for much longer than Lambda—can handle large files and long processing.
- Serverless and elastic—no infra management.
- Can reuse Python libraries like DuckDB and PyIceberg as-is.
- Writing to Iceberg gives you snapshots, time travel, and other table-management features.
Cons
- Startup is slower than Lambda due to loading Python libraries.
- Glue Python Shell only supports Python (no other languages).
- No Spark-based distributed processing—unsuitable for very large-scale or very high throughput workloads.
- You need to manage dependency versions carefully.
Cost comparison vs Lambda
Lambda
I compared costs under a scenario limited to Lambda’s 15-minute maximum. (Memory for Lambda was set to 2048 MB in the example.) Depending on your workload and runtime, Glue Python Shell can sometimes be cheaper than Lambda.
AWS Glue Python Shell
Conclusion
This article showed how to use AWS Glue Python Shell together with DuckDB and PyIceberg to implement ETL that writes data in Apache Iceberg format on S3. Glue Python Shell sits between Lambda and Glue Spark/EMR as a serverless execution option that supports longer runtimes without the management overhead of Spark clusters.
By combining DuckDB and PyIceberg you can build a flexible, lower-cost data processing pipeline—especially when Lambda’s runtime is insufficient but Glue Spark / EMR would be overkill.
Iceberg supports schema evolution, time travel, and other powerful table features; it’s becoming a mainstream choice for OTF workflows, so I expect its adoption to keep growing.
I hope this article helps anyone considering lightweight ETL or near-real-time ingestion into Iceberg tables.
Great article. Love the idea