Interesting links - August 2025
Robin Moffatt

Robin Moffatt @rmoff

About: Robin is a Principal DevEx Engineer. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev.

Location:
Ilkley, UK
Joined:
Oct 7, 2019

Interesting links - August 2025

Publish Date: Dec 17 '25
0 0

Not got time for all this? I’ve marked 🔥 for my top reads of the month :)

Data Engineering

Data in Action

  • Building your own data ingestion framework may be a siren song for many, but Cloudflare operate at the kind of scale where it’s perhaps worth it. Read about Jetflow here.

  • Nubank have published a series of interesting blog posts about their use of stream processing, including with Kafka and Flink. There’s also a meetup recording (in Spanish) that looks like it has lots more details.

  • Details from UK Bank Monzo on their Go-based fraud prevention platform.

  • 🔥 Excellent blog post from Anton Borisov at Fresha detailing why and how they adopted StarRocks after finding that Snowflake "wasn’t cost-effective — or fast enough — for chatty, near-real-time product and operational analytics.".

  • Guidewire have published a couple of interesting blog posts looking at their data platform design, testing, and optimisation.

Apache Kafka

  • Trendyol’s Ahmet Tortumlu walks through the process they follow when replacing KRaft controller nodes.

  • A look at how you could use Kafka to implement the international baggage tracking system ("IATA R753").

  • 🔥 The r/apachekafka subreddit recently hit 17k members. It’s a funny place, with a mix of shills, trolls, n00bs who won’t even help themselves—and some lovely community conversations that remind me why I continue to enjoy being part of it :) Here are a handful of threads that caught my eye if you want to sample the fare:

  • Another month, another Kafka UI—this time a TUI (Text User Interface) with a name I’ll drink to: ktea.

  • A practical guide on different techniques to use when using Kafka for integration.

  • A well-written two part account of Nike’s journey from TSV (shudder) to Protobuf for their data pipelines.

  • After talks about Northguard and Xinfra in April, LinkedIn’s Stream Processing meetup continued with its impressive content in July hosting three talks:

Open Table Formats & Catalogs

Let’s be honest, it’s mostly just Apache Iceberg…😅

Stream Processing

  • Apache Flink 2.1.0 has been released

  • Ming Hung Tsai wrote a three part series showing how you could use Kafka Streams to implement a ticket reservation system (also discussed in the Reddit thread linked above)

  • 🔥 Sometimes the old ones are the best—and this article from Tyler Akidau nine years ago is still just as important to read today if you’re thinking about stream processing: Streaming 102: The world beyond batch—The what, where, when, and how of unbounded data processing.

  • LinkedIn’s Jiangjie Qin, a PMC member for both Apache Flink and Apache Kafka, spoke at QCon SF about Stream and Batch Processing Convergence in Flink

  • Should you use a hammer to tighten a screw? Should you try and express all your stream processing needs in SQL? also no.

  • FLIP-541 is a proposal to make PyFlink more Pythonic, and looks to have wide support in the community.

  • Databricks announced the public preview of a real-time mode for Spark Structured Streaming. It will be donated to the Apache Spark project but is currently only available on Databricks.

RDBMS + CDC

General Data Stuff

  • 🔥🔥 Hot off the press is another banger from Jack Vanlightly, this time looking at A Conceptual Model for Storage Unification. If you’re interested in things like writing Kafka data to Iceberg, this is a vital foundation for understanding the design considerations and trade-offs.

  • How Klaviyo use Ray for their scalable data processing, training, and optimization

  • Prompted by a talk that Tesla gave about ingesting metrics into ClickHouse, Javier Santana at TinyBird set out to reproduce the feat using a 50-node ClickHouse cluster. In a sense these exercises are somewhat BSD and clickbait-y, but I do like the clear steps and detail that he showed in the blog post :).

  • 🔥 If anyone is going to need to build their own time-series database (TSDB), Datadog is going to be one of the top contenders. In this blog post they write about how they built it using Rust and the benefits they saw (60x ingest, 5x query). Also interesting is the history of their previous TSDB platforms.

  • FastLanes describes itself as a Next-Gen Big Data File Format, aimed as a replacement to columnar formats such as the somewhat-ubiquitous Parquet. Beyond several conference papers it’s unclear if there’s any adoption of the format in the wild yet.

And finally…

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.


|

Comments 0 total

    Add comment