Introduction

Why I Wrote This Guide

In today's system architectures, streaming massive amounts of data into a search engine in real time has become increasingly common. However, the underlying mechanisms—stream processing, data transformation, and integration across systems—pose many hidden complexities.

This guide organizes practical knowledge about OSS-based data pipelines using Kafka, Flink, and OpenSearch, and presents it in a way that allows anyone to learn by actually running and observing the system.

Intended Audience

This guide is aimed at individuals who:

Prefer to experiment and validate systems locally rather than in the cloud
Are interested in stream processing and real-time data integration
Are new to Kafka, Flink, or OpenSearch
Want to try out OSS-based architectures using Docker Compose
Are interested in designing and verifying a data infrastructure but don't know where to start

Even beginners can safely follow along, as the guide explains the background and architecture of each component in a clear and thorough way.

What You'll Learn

After completing this guide, you will be able to:

Build a Kafka → Flink → OpenSearch pipeline using Docker Compose
Understand the basics of Flink's Event Time and Watermark processing, and how to inspect logs
Gain insights into detecting and visualizing data ingestion delays into the search engine

Required Knowledge

This guide assumes basic familiarity with:

Linux command-line operations
Basic Docker usage

Guide Structure

The book is organized into the following chapters:

Setting up the environment using Docker Compose
Understanding the architecture and relationship between components
Extracting change data with Debezium and outputting it to Kafka
Running Flink jobs and analyzing logs
Sending data to a search engine and verifying it

Each chapter includes diagrams and runnable examples to bridge theory with practice.

About the Code

All code, SQL scripts, and Docker Compose configuration files in this guide are available on GitHub:

👉 GitHub repository

Each chapter corresponds to a runnable configuration, so you can follow along and execute each step as you read.

If Docker Desktop is installed, you can get started with:

git clone https://github.com/sisiodos/rdb-to-search-pipeline-with-flink.git
cd rdb-to-search-pipeline-with-flink
docker compose up -d

Chapter-specific files and commands are indicated in the guide to help you follow and try them hands-on.

Future Expansion (Planning Notes)

While this guide focuses on the fundamentals of connecting stream processing with search engines using Flink, we are also planning to explore the following advanced topics:

Flink parallel execution and TaskManager slot configuration
Data distribution and visualization across multiple TaskManagers
Order guarantee and consistency design for Flink → OpenSearch writes
- Upsert control using _id and op fields
Deduplication strategies for handling duplicate messages
Checkpointing, exactly-once processing, and TwoPhaseCommitSink
Design for stable operation and backpressure mitigation
- Analyzing bottlenecks via metrics like currentSendBackPressure, busyTimeMsPerSecond, I/O wait
- Tuning parallelism, slot placement, resource balancing
- Operator chaining, buffer timeouts, Watermark tuning
- Managing large state (joins, deduplication): key selection, TTL, partial aggregation
- Tuning RocksDB StateBackend and spill control
- Using Async I/O Sink, tuning bulk flush/batch size
- Monitoring and visualization using Flink Web UI and Grafana + Prometheus
- Alerting on checkpoint growth and state bloat

These topics go beyond the basics and involve more architectural thinking.

Behind the Scenes

This guide was written in collaboration with OpenAI's conversational AI, ChatGPT-4o. From structure and technical verification to fine-tuning expressions, AI played a major role as a thinking partner throughout the writing process.

That said, the final architecture, structure, verification, and all editorial decisions are the author's responsibility.

The act of documenting and delivering technical insight is no longer the sole domain of human effort. I hope this guide serves as an example of what's possible when human and AI collaborate—and supports you in your own technical explorations.

For what it's worth, during the writing process, ChatGPT-4o used "I" as its pronoun. Somewhere along the way, it became something like a co-author. I leave that note here, quietly.

(Coming soon: Chapter 1 — Setting Up the Environment with Docker Compose)

sisiodos @sisiodos

Connecting RDBs and Search Engines — Preface