Articles by Tag #dataengineering

Browse our collection of articles on various topics related to IT technologies. Dive in and explore something new!

The Ultimate Linux Command Cheat Sheet for Data Engineers and Analysts

Introduction As a data engineer or analyst, your day-to-day responsibilities likely...

Learn More 75 4May 21

When Small Parquet Files Become a Big Problem (and How I Ended Up Writing a Compactor in PyArrow)

It all began with a fairly normal data pipeline. Events were coming in through Kafka, landing in AWS...

Learn More 17 2May 9

🌍 Automating Africa’s Energy Data Collection Using Python, Playwright(+Why Playwright ?), and MongoDB (2000–2024)

⚡ Introduction In today’s data-driven world, access to reliable and structured energy data...

Learn More 5 0Nov 4

Why Parquet Is Everywhere - And What Makes It Actually Fast?

Hey folks 👋, As I kept building more data pipelines, I noticed one file format showing up...

Learn More 2 0Nov 15

RIP Amazon Data Firehose Change Data Capture

When a couple of weeks before re:Invent 2024, AWS announced in a blog post titled Replicate changes...

Learn More 7 3Oct 2

Writes, 3 ways: Postgres, Apache Kafka® and Apache Iceberg™

Writes, 3 ways: Postgres, Apache Kafka® and Apache Iceberg™ As a part of my new job at...

Learn More 1 0Nov 6

Building a 75,000-Product Image Feature Dataset for the Amazon ML Challenge 2025

Hey everyone! Hope you’re all doing great today. So I’ve got something pretty exciting to share with...

Learn More 1 0Oct 17

From smog to streams: how data engineering helps us breathe easier.

Building a Real-Time Air Quality Data Pipeline for Mombasa & Nairobi The...

Learn More 1 1Oct 20

Data Quality at Scale: Why Your Pipeline Needs More Than Green Checkmarks

Originally published on Medium:...

Learn More 0 0Nov 24

Simulating An Event-Driven Python Shopping App with Kafka on AWS For Real-Time Processing.

Business Use Application/Relevance: Applications like banking applications, streaming(like...

Learn More 5 2Oct 25

How I Built a MongoDB Archiving System for Crawled Data

How I Built a MongoDB Archiving System for Crawled Data The Problem: Data Chaos...

Learn More 1 2Oct 3

Interoperating Open Table Formats on AWS Using Apache XTable (Delta Iceberg)

Original Japanese article: Apache XTableを使ったAWS上でのOpen Table Format相互運用(Delta→Iceberg) ...

Learn More 4 0Nov 14

Designing Data-Intensive Applications — Chapter 1: Reliable, Scalable, and Maintainable Applications

This post is part of a series summarizing key ideas from Designing Data-Intensive Applications by...

Learn More 6 0Oct 29

Auto-Detecting CSV Schemas for Lightning-Fast ClickHouse Ingestion with Parquet

As a data engineer, one of the most repetitive tasks I face is ingesting data from CSV files. The...

Learn More 8 0Nov 7

Real-Time Crypto Data Pipeline

Introduction Ever wondered how trading platforms display live crypto prices? In this...

Learn More 5 0Oct 27

🧠 ClickHouse LEFT JOINs: Why join_use_nulls Matters

🧠 Understanding join_use_nulls in ClickHouse ClickHouse is famous for being blazing fast —...

Learn More 5 0Oct 30

Another Data Nerd Guide to re:Invent 2025

Well, it's that time of year again. In less than two months we'll be in amazing and weird Las...

Learn More 2 0Oct 14

Real-Time Streaming Platform with Pulsar, Flink & ClickHouse

Real-Time Streaming Platform: Building Enterprise-Grade Data Infrastructure with Pulsar,...

Learn More 4 0Oct 26

How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month

A fully automated Binance Level 2 order book streaming system on AWS Free Tier - 260K+ snapshots/day for ~$15/month with 6-person team access

Learn More 5 0Nov 7

Fixing Type Hints for Callable Objects with Custom Signatures in Dagster

So... it's been an interesting week. After my last contribution to Scikit-learn (which was honestly...

Learn More 3 0Oct 28

Building a Real-Time Data Platform with Kubernetes (Kind) - A Complete Local Setup Guide

Ever wondered how to build a production-grade real-time data pipeline that can handle millions of...

Learn More 2 0Oct 26

The Offline Data Engineer: Building Resilient API Pipelines that Work on an Airplane

Development loops for API integrations are usually painful. We’ve all been there: You are building a...

Learn More 4 0Nov 21

Building an Enterprise Patching Dashboard with AWS - A Complete Guide

Learn how to build a centralized patching and inventory management solution using AWS...

Learn More 4 0Nov 26

Real-Time Crypto Data Pipeline

Real-Time Crypto Data Pipeline: From Binance API to Cassandra with CDC and...

Learn More 3 0Oct 25

Polyglot Data Engineering: Python + Go in the Same Pipeline

Hey Devs 👋, If you're exploring modern data engineering stacks or curious about mixing languages in...

Learn More 3 2Sep 19

End-to-End Data Workflow: Kestra, Redshift, and dbt Integration

Imagine that at the end of every month, you are required to download data from a particular source,...

Learn More 4 0Oct 29

Apache Doris 4.0: One Engine for Analytics, Full-Text Search, and Vector Search

We're excited to announce the official release of Apache Doris 4.0: a major milestone release that...

Learn More 4 0Oct 24

Building a Streaming Data Pipeline with Kafka and Spark: Real-Time Analytics Implementation Guide

null

Learn More 1 0Oct 10

Realtime Data Streaming Platform: Building a Unified Monitoring Stack

When you're running a real-time streaming platform processing 1 million messages per second, you...

Learn More 4 0Oct 26

A Deep Dive into Apache Spark Architecture

Introduction The digital world constantly generates enormous volumes of data — from social media...

Learn More 1 0Oct 27