Introduction
Why I Wrote This Guide
In today's system architectures, streaming massive amounts of data into a search engine in real time has become increasingly common. However, the underlying mechanisms—stream processing, data transformation, and integration across systems—pose many hidden complexities.
This guide organizes practical knowledge about OSS-based data pipelines using Kafka, Flink, and OpenSearch, and presents it in a way that allows anyone to learn by actually running and observing the system.
Intended Audience
This guide is aimed at individuals who:
- Prefer to experiment and validate systems locally rather than in the cloud
- Are interested in stream processing and real-time data integration
- Are new to Kafka, Flink, or OpenSearch
- Want to try out OSS-based architectures using Docker Compose
- Are interested in designing and verifying a data infrastructure but don't know where to start
Even beginners can safely follow along, as the guide explains the background and architecture of each component in a clear and thorough way.
What You'll Learn
After completing this guide, you will be able to:
- Build a Kafka → Flink → OpenSearch pipeline using Docker Compose
- Understand the basics of Flink's Event Time and Watermark processing, and how to inspect logs
- Gain insights into detecting and visualizing data ingestion delays into the search engine
Required Knowledge
This guide assumes basic familiarity with:
- Linux command-line operations
- Basic Docker usage
Guide Structure
The book is organized into the following chapters:
- Setting up the environment using Docker Compose
- Understanding the architecture and relationship between components
- Extracting change data with Debezium and outputting it to Kafka
- Running Flink jobs and analyzing logs
- Sending data to a search engine and verifying it
Each chapter includes diagrams and runnable examples to bridge theory with practice.
About the Code
All code, SQL scripts, and Docker Compose configuration files in this guide are available on GitHub:
Each chapter corresponds to a runnable configuration, so you can follow along and execute each step as you read.
If Docker Desktop is installed, you can get started with:
git clone https://github.com/sisiodos/rdb-to-search-pipeline-with-flink.git
cd rdb-to-search-pipeline-with-flink
docker compose up -d
Chapter-specific files and commands are indicated in the guide to help you follow and try them hands-on.
Future Expansion (Planning Notes)
While this guide focuses on the fundamentals of connecting stream processing with search engines using Flink, we are also planning to explore the following advanced topics:
- Flink parallel execution and TaskManager slot configuration
- Data distribution and visualization across multiple TaskManagers
-
Order guarantee and consistency design for Flink → OpenSearch writes
- Upsert control using
_id
andop
fields
- Upsert control using
- Deduplication strategies for handling duplicate messages
- Checkpointing, exactly-once processing, and TwoPhaseCommitSink
-
Design for stable operation and backpressure mitigation
- Analyzing bottlenecks via metrics like
currentSendBackPressure
,busyTimeMsPerSecond
, I/O wait - Tuning parallelism, slot placement, resource balancing
- Operator chaining, buffer timeouts, Watermark tuning
- Managing large state (joins, deduplication): key selection, TTL, partial aggregation
- Tuning RocksDB StateBackend and spill control
- Using Async I/O Sink, tuning bulk flush/batch size
- Monitoring and visualization using Flink Web UI and Grafana + Prometheus
- Alerting on checkpoint growth and state bloat
- Analyzing bottlenecks via metrics like
These topics go beyond the basics and involve more architectural thinking.
Behind the Scenes
This guide was written in collaboration with OpenAI's conversational AI, ChatGPT-4o. From structure and technical verification to fine-tuning expressions, AI played a major role as a thinking partner throughout the writing process.
That said, the final architecture, structure, verification, and all editorial decisions are the author's responsibility.
The act of documenting and delivering technical insight is no longer the sole domain of human effort. I hope this guide serves as an example of what's possible when human and AI collaborate—and supports you in your own technical explorations.
For what it's worth, during the writing process, ChatGPT-4o used "I" as its pronoun. Somewhere along the way, it became something like a co-author. I leave that note here, quietly.
(Coming soon: Chapter 1 — Setting Up the Environment with Docker Compose)