Chapter 1: Setting Up the Environment with Docker Compose

Purpose of This Chapter

The goal of this chapter is to help you build and understand a data pipeline that streams change data from PostgreSQL through stream processing and into OpenSearch, all within your local environment.

About the Code (Repeat)

The docker-compose.yaml file and PostgreSQL initialization scripts used in this chapter are available in the following repository:

👉 https://github.com/sisiodos/rdb-to-search-pipeline-with-flink

If you haven't cloned the repository yet, you can set it up with:

git clone https://github.com/sisiodos/rdb-to-search-pipeline-with-flink.git
cd rdb-to-search-pipeline-with-flink
docker compose up -d

The docker/ directory contains the configuration files, and the postgres/ directory contains SQL scripts for initializing sample data.

Architecture Overview

Here’s a brief explanation of each technology used in this pipeline, so even readers unfamiliar with these tools can follow along:

PostgreSQL: An open-source relational database. It serves as the source of data changes in this setup.
Debezium: A CDC (Change Data Capture) tool that detects changes in the database and publishes them to Kafka in real time.
Kafka: A high-throughput messaging system that temporarily stores the change data from Debezium and relays it to Flink.
Flink: A distributed stream processing engine. It processes the change data from Kafka and sends it to OpenSearch.
OpenSearch: A search engine where the processed data from Flink is stored and made searchable.

This pipeline demonstrates the end-to-end flow of capturing changes from PostgreSQL and delivering them to OpenSearch via Kafka and Flink.

Building the Environment with Docker Compose

Note: This guide assumes a local environment (Mac / Windows / Linux) with Docker Desktop installed. If you're unfamiliar with Docker CLI or Compose, refer to the official Docker documentation.

Docker Compose is a tool for defining and managing multi-container Docker applications. With a single docker-compose.yaml file, you can describe multiple services, including images, ports, environment variables, and dependencies.

In this project, the docker-compose.yaml file defines the following open-source services:

PostgreSQL (sample database)
Debezium Connect (captures changes from PostgreSQL and sends to Kafka)
Kafka + ZooKeeper (messaging infrastructure)
Flink (JobManager and TaskManager)
OpenSearch (search engine)

Starting the Services

To start all services:

docker compose up -d

Stopping the Services

To stop all services:

docker compose down

Managing Individual Services

You can control individual services (e.g., pause, restart, check logs) like this:

docker compose restart kafka            # Restart Kafka service only
docker compose stop connect             # Stop Connect service only
docker compose start connect            # Start Connect service only
docker compose logs flink-taskmanager   # View logs from Flink TaskManager

Pause and Resume

To temporarily suspend or resume all services:

docker compose pause
docker compose unpause

This is useful when you want to reduce CPU or I/O usage or pause processing during log inspection.

Remove Persistent Volumes (Full Cleanup)

To completely clean up the environment, including persistent volumes (e.g., PostgreSQL data), use:

docker compose down --volumes

This removes all container volumes and resets everything to its initial state. See the official documentation for more details.

sisiodos @sisiodos

Connecting RDBs and Search Engines — Chapter 1