Interesting links - August 2025

Not got time for all this? I’ve marked 🔥 for my top reads of the month :)

Data Engineering

🔥 Ben Rogojan (a.k.a. SeattleDataGuy) has a great list of 5 Things in Data Engineering That Still Hold True After 10 Years (guess what: data modelling matters, if you start with crap data you’ll end with crap data, and so on…).
Veronika Durgin shares some good tips for building resilient data pipelines.
Some good pointers for why you might want to modernise your data platform, and how to pick your stack if you do so.
🔥 Aleksandr Klein has a thoughtful post about The Mythic Journey of Data Quality Maturity
A useful post from Fiore Mario Vitale showing the use of OpenLineage to troubleshoot data pipelines in Debezium and Flink.

Data in Action

Building your own data ingestion framework may be a siren song for many, but Cloudflare operate at the kind of scale where it’s perhaps worth it. Read about Jetflow here.
Nubank have published a series of interesting blog posts about their use of stream processing, including with Kafka and Flink. There’s also a meetup recording (in Spanish) that looks like it has lots more details.
Details from UK Bank Monzo on their Go-based fraud prevention platform.
🔥 Excellent blog post from Anton Borisov at Fresha detailing why and how they adopted StarRocks after finding that Snowflake "wasn’t cost-effective — or fast enough — for chatty, near-real-time product and operational analytics.".
Guidewire have published a couple of interesting blog posts looking at their data platform design, testing, and optimisation.

Apache Kafka

Trendyol’s Ahmet Tortumlu walks through the process they follow when replacing KRaft controller nodes.
A look at how you could use Kafka to implement the international baggage tracking system ("IATA R753").
🔥 The r/apachekafka subreddit recently hit 17k members. It’s a funny place, with a mix of shills, trolls, n00bs who won’t even help themselves—and some lovely community conversations that remind me why I continue to enjoy being part of it :) Here are a handful of threads that caught my eye if you want to sample the fare:
Another month, another Kafka UI—this time a TUI (Text User Interface) with a name I’ll drink to: ktea.
A practical guide on different techniques to use when using Kafka for integration.
A well-written two part account of Nike’s journey from TSV (shudder) to Protobuf for their data pipelines.
After talks about Northguard and Xinfra in April, LinkedIn’s Stream Processing meetup continued with its impressive content in July hosting three talks:

Open Table Formats & Catalogs

Let’s be honest, it’s mostly just Apache Iceberg…😅

🔥 I spent some time looking into Flink vs Kafka Connect vs Tableflow for getting data into Iceberg, and wrote up some of the comparison points: Kafka to Iceberg - Exploring the Options
Aiven published a whitepaper with details of their plans for writing to Iceberg directly from Kafka
David Reger published a detailed writeup of Tableflow (Confluent’s tool for getting data from Kafka to Iceberg).
A while back I wrote about the Write-Audit-Publish pattern, and so enjoyed reading these two blog posts from Turóczy Attila about branching and tagging in Apache Iceberg.
🔥 Details of taking practical advantage of more Iceberg features including time travel and schema evolution are covered in this article about building reproducible ML systems with Iceberg and SparkSQL
A nice hands-on example of using the Kafka Connect Iceberg sink with Nessie and MinIO.
The columnar format Vortex has been donated to the Linux Foundation. Earlier this year there was a PoC to show how it can speed up Iceberg queries, with an interesting talk at Iceberg Summit on the same.
A comprehensive introduction summarised from a talk at this year’s Flink Forward Asia conference about "the other" open table format that people often forget—Apache Paimon. TIL it supports integration with Iceberg.

Stream Processing

Apache Flink 2.1.0 has been released
Ming Hung Tsai wrote a three part series showing how you could use Kafka Streams to implement a ticket reservation system (also discussed in the Reddit thread linked above)
🔥 Sometimes the old ones are the best—and this article from Tyler Akidau nine years ago is still just as important to read today if you’re thinking about stream processing: Streaming 102: The world beyond batch—The what, where, when, and how of unbounded data processing.
LinkedIn’s Jiangjie Qin, a PMC member for both Apache Flink and Apache Kafka, spoke at QCon SF about Stream and Batch Processing Convergence in Flink
Should you use a hammer to tighten a screw? Should you try and express all your stream processing needs in SQL? also no.
FLIP-541 is a proposal to make PyFlink more Pythonic, and looks to have wide support in the community.
Databricks announced the public preview of a real-time mode for Spark Structured Streaming. It will be donated to the Apache Spark project but is currently only available on Databricks.

RDBMS + CDC

Debezium 3.2.1.Final and 3.3.0.Alpha1 have been released
Yingjun Wu has written a good explanation about why RisingWave use the embedded Debezium Engine in their product—and why they didn’t rewrite it in Rust to match the rest of their product code.
A practical guide from Nick Tobey on choosing a database schema for polymorphic data.
🔥 Richard van der Hoff writes about something to strike fear into any DBA’s heart: corruption in the database. In this case, a corrupted Postgres index.
More Postgres goodness, this time from Gunnar Morling: Postgres Replication Slots: Confirmed Flush LSN vs. Restart LSN

General Data Stuff

🔥🔥 Hot off the press is another banger from Jack Vanlightly, this time looking at A Conceptual Model for Storage Unification. If you’re interested in things like writing Kafka data to Iceberg, this is a vital foundation for understanding the design considerations and trade-offs.
How Klaviyo use Ray for their scalable data processing, training, and optimization
Prompted by a talk that Tesla gave about ingesting metrics into ClickHouse, Javier Santana at TinyBird set out to reproduce the feat using a 50-node ClickHouse cluster. In a sense these exercises are somewhat BSD and clickbait-y, but I do like the clear steps and detail that he showed in the blog post :).
🔥 If anyone is going to need to build their own time-series database (TSDB), Datadog is going to be one of the top contenders. In this blog post they write about how they built it using Rust and the benefits they saw (60x ingest, 5x query). Also interesting is the history of their previous TSDB platforms.
FastLanes describes itself as a Next-Gen Big Data File Format, aimed as a replacement to columnar formats such as the somewhat-ubiquitous Parquet. Beyond several conference papers it’s unclear if there’s any adoption of the format in the wild yet.

And finally…

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

🔥 Brad Stulberg’s article Motivation is Overrated: Here’s What Works Instead is down to earth and well worth a read.
I’m not going to even pretend to understand the first thing in these organic simulation algorithms, but gosh, don’t they make pretty pictures!
I Tried Every Todo App and Ended Up With a .txt File — This one hit a bit close to home…
A healthy dose of nostalgia from MacPaint Art From The Mid-80s Still Looks Great Today (although cards on the table, I was on the BBC Micro/Acorn Archimedes side of things 😅)
It may seem odd to compile a list of "Why I want to leave" the day that you start a new job, but this article makes a compelling case for starting, and maintaining, such a list.

If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)
I’m linking out to Freedium versions of Medium posts, because Medium seems to be pay-walling a bunch of otherwise-freely accessible content.

Robin Moffatt @rmoff