Connecting RDBs and Search Engines — Chapter 1
sisiodos

sisiodos @sisiodos

About: Writing about stream processing, data pipelines, and modern data architecture using Flink, Kafka, OpenSearch, and lakehouse technologies.

Joined:
May 10, 2025

Connecting RDBs and Search Engines — Chapter 1

Publish Date: May 10
0 0

Chapter 1: Setting Up the Environment with Docker Compose

Purpose of This Chapter

The goal of this chapter is to help you build and understand a data pipeline that streams change data from PostgreSQL through stream processing and into OpenSearch, all within your local environment.

About the Code (Repeat)

The docker-compose.yaml file and PostgreSQL initialization scripts used in this chapter are available in the following repository:

👉 https://github.com/sisiodos/rdb-to-search-pipeline-with-flink

If you haven't cloned the repository yet, you can set it up with:

git clone https://github.com/sisiodos/rdb-to-search-pipeline-with-flink.git
cd rdb-to-search-pipeline-with-flink
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

The docker/ directory contains the configuration files, and the postgres/ directory contains SQL scripts for initializing sample data.

Architecture Overview

Here’s a brief explanation of each technology used in this pipeline, so even readers unfamiliar with these tools can follow along:

  • PostgreSQL: An open-source relational database. It serves as the source of data changes in this setup.
  • Debezium: A CDC (Change Data Capture) tool that detects changes in the database and publishes them to Kafka in real time.
  • Kafka: A high-throughput messaging system that temporarily stores the change data from Debezium and relays it to Flink.
  • Flink: A distributed stream processing engine. It processes the change data from Kafka and sends it to OpenSearch.
  • OpenSearch: A search engine where the processed data from Flink is stored and made searchable.

This pipeline demonstrates the end-to-end flow of capturing changes from PostgreSQL and delivering them to OpenSearch via Kafka and Flink.

Building the Environment with Docker Compose

Note: This guide assumes a local environment (Mac / Windows / Linux) with Docker Desktop installed. If you're unfamiliar with Docker CLI or Compose, refer to the official Docker documentation.

Docker Compose is a tool for defining and managing multi-container Docker applications. With a single docker-compose.yaml file, you can describe multiple services, including images, ports, environment variables, and dependencies.

In this project, the docker-compose.yaml file defines the following open-source services:

  • PostgreSQL (sample database)
  • Debezium Connect (captures changes from PostgreSQL and sends to Kafka)
  • Kafka + ZooKeeper (messaging infrastructure)
  • Flink (JobManager and TaskManager)
  • OpenSearch (search engine)

Starting the Services

To start all services:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Stopping the Services

To stop all services:

docker compose down
Enter fullscreen mode Exit fullscreen mode

Managing Individual Services

You can control individual services (e.g., pause, restart, check logs) like this:

docker compose restart kafka            # Restart Kafka service only
docker compose stop connect             # Stop Connect service only
docker compose start connect            # Start Connect service only
docker compose logs flink-taskmanager   # View logs from Flink TaskManager
Enter fullscreen mode Exit fullscreen mode

Pause and Resume

To temporarily suspend or resume all services:

docker compose pause
docker compose unpause
Enter fullscreen mode Exit fullscreen mode

This is useful when you want to reduce CPU or I/O usage or pause processing during log inspection.

Remove Persistent Volumes (Full Cleanup)

To completely clean up the environment, including persistent volumes (e.g., PostgreSQL data), use:

docker compose down --volumes
Enter fullscreen mode Exit fullscreen mode

This removes all container volumes and resets everything to its initial state. See the official documentation for more details.

Related Links

With this, you now have the foundational environment for building a local data pipeline. In the next chapter, we will explore the structure and relationships between each component.

In the next chapter, we will explore the structure and relationships between each component.

(Coming soon: Chapter 2 — Understanding the Architecture and Component Relationships)

Comments 0 total

    Add comment