⚡ Kafka ClickHouse: Real-Time Data Pipeline for Beginners
Mohamed Hussain S

Mohamed Hussain S @mohhddhassan

About: Exploring OLAP DBs like ClickHouse & building data infra with Airflow, Docker & PostgreSQL. Love working on real-world data problems, AutoML (AutoTrend), and dashboards (TrendLite). Always learning, a

Location:
Chennai, India
Joined:
May 31, 2025

⚡ Kafka ClickHouse: Real-Time Data Pipeline for Beginners

Publish Date: Jul 2
2 0

Hey Devs 👋,

I'm Mohamed Hussain S, currently working as an Associate Data Engineer Intern.
After building a batch pipeline with Airflow and Postgres, I wanted to step into the real-time data world — so I created this lightweight Kafka → ClickHouse pipeline.

If you’re curious how streaming data pipelines actually work (beyond just theory), this one’s for you 🎯


🚀 What This Project Does

✅ Generates mock user data (name, email, age)
✅ Sends each message to a Kafka topic called user-signups
✅ A ClickHouse Kafka engine table listens for those messages
✅ A materialized view pushes clean data into a persistent table
✅ All of this runs in Docker for easy setup and teardown

It’s super lightweight and totally beginner-friendly — perfect for learning how Kafka and ClickHouse can work together.


🧰 Tech Stack

  • Python — Kafka producer to simulate user signups
  • Kafka — distributed streaming platform
  • ClickHouse — OLAP database with native Kafka support
  • Docker — to spin up Kafka, Zookeeper, and ClickHouse
  • SQL — to define engine tables and views in ClickHouse

🗂️ Project Structure

kafka-clickhouse-pipeline/
├── producer/              # Python Kafka producer
├── clickhouse-setup.sql  # SQL to set up ClickHouse tables
├── docker-compose.yml    # All services defined here
├── screenshots/          # CLI outputs, topic messages, etc.
└── README.md              # Everything documented here
Enter fullscreen mode Exit fullscreen mode

⚙️ How It Works

  1. Run docker-compose up — spins up Kafka, Zookeeper & ClickHouse
  2. Run the SQL file to create:
  • Kafka engine table
  • Materialized view
  • Target users table
    1. Start the Python producer — sends mock user data to Kafka
    2. ClickHouse listens to the topic and stores data via materialized view
    3. Boom — your real-time pipeline is up and running!

🧪 Example Output

A single message sent to Kafka looks like this:

{"name": "Alice", "email": "alice@example.com", "age": 24}
Enter fullscreen mode Exit fullscreen mode

And the users table in ClickHouse will store it like this:

name email age
Alice alice@example.com 24

Check the screenshots/ folder in the repo to see the whole thing in action 📸


🧠 Key Learnings

✅ How Kafka producers work with Python
✅ Setting up Kafka topics and brokers in Docker
✅ How ClickHouse can natively consume Kafka messages
✅ How materialized views automate transformation & insert
✅ Containerized orchestration made simple with Docker


💡 What’s Next?

🔁 Add a proper Kafka consumer (Python-based) as an alt to ClickHouse ingestion
🔍 Add logging, retries, and dead-letter queue logic
📈 Simulate more complex streaming use cases like page visits
📊 Plug in Grafana for real-time metrics from ClickHouse


📌 Why You Should Try This

If you're exploring real-time data engineering:

  • Start with Kafka and Python — it’s intuitive and powerful
  • ClickHouse’s Kafka engine + materialized view combo = 💯
  • Docker lets you test and learn without messing up your local setup

This small project helped me understand the data flow in real-time systems — not just conceptually, but hands-on.


🔗 Repo

👉 GitHub Repo:


🙋‍♂️ About Me

Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub


⚙️ Building in public — one stream at a time.


Comments 0 total

    Add comment