⚡ Kafka ClickHouse: Real-Time Data Pipeline for Beginners

Hey Devs 👋,

I'm Mohamed Hussain S, currently working as an Associate Data Engineer Intern.
After building a batch pipeline with Airflow and Postgres, I wanted to step into the real-time data world — so I created this lightweight Kafka → ClickHouse pipeline.

If you’re curious how streaming data pipelines actually work (beyond just theory), this one’s for you 🎯

🚀 What This Project Does

✅ Generates mock user data (name, email, age)
✅ Sends each message to a Kafka topic called user-signups
✅ A ClickHouse Kafka engine table listens for those messages
✅ A materialized view pushes clean data into a persistent table
✅ All of this runs in Docker for easy setup and teardown

It’s super lightweight and totally beginner-friendly — perfect for learning how Kafka and ClickHouse can work together.

🧰 Tech Stack

Python — Kafka producer to simulate user signups
Kafka — distributed streaming platform
ClickHouse — OLAP database with native Kafka support
Docker — to spin up Kafka, Zookeeper, and ClickHouse
SQL — to define engine tables and views in ClickHouse

🗂️ Project Structure

kafka-clickhouse-pipeline/
├── producer/              # Python Kafka producer
├── clickhouse-setup.sql  # SQL to set up ClickHouse tables
├── docker-compose.yml    # All services defined here
├── screenshots/          # CLI outputs, topic messages, etc.
└── README.md              # Everything documented here

⚙️ How It Works

Run docker-compose up — spins up Kafka, Zookeeper & ClickHouse
Run the SQL file to create:

Kafka engine table
Materialized view
Target users table
1. Start the Python producer — sends mock user data to Kafka
2. ClickHouse listens to the topic and stores data via materialized view
3. Boom — your real-time pipeline is up and running!

🧪 Example Output

A single message sent to Kafka looks like this:

{"name": "Alice", "email": "alice@example.com", "age": 24}

And the users table in ClickHouse will store it like this:

name	email	age
Alice	alice@example.com	24

Check the screenshots/ folder in the repo to see the whole thing in action 📸

🧠 Key Learnings

✅ How Kafka producers work with Python
✅ Setting up Kafka topics and brokers in Docker
✅ How ClickHouse can natively consume Kafka messages
✅ How materialized views automate transformation & insert
✅ Containerized orchestration made simple with Docker

💡 What’s Next?

🔁 Add a proper Kafka consumer (Python-based) as an alt to ClickHouse ingestion
🔍 Add logging, retries, and dead-letter queue logic
📈 Simulate more complex streaming use cases like page visits
📊 Plug in Grafana for real-time metrics from ClickHouse

📌 Why You Should Try This

If you're exploring real-time data engineering:

Start with Kafka and Python — it’s intuitive and powerful
ClickHouse’s Kafka engine + materialized view combo = 💯
Docker lets you test and learn without messing up your local setup

This small project helped me understand the data flow in real-time systems — not just conceptually, but hands-on.

🔗 Repo

👉 GitHub Repo:

🙋‍♂️ About Me

Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub

⚙️ Building in public — one stream at a time.

Mohamed Hussain S @mohhddhassan