Hey Devs 👋,
I'm Mohamed Hussain S, currently working as an Associate Data Engineer Intern.
After building a batch pipeline with Airflow and Postgres, I wanted to step into the real-time data world — so I created this lightweight Kafka → ClickHouse pipeline.
If you’re curious how streaming data pipelines actually work (beyond just theory), this one’s for you 🎯
🚀 What This Project Does
✅ Generates mock user data (name
, email
, age
)
✅ Sends each message to a Kafka topic called user-signups
✅ A ClickHouse Kafka engine table listens for those messages
✅ A materialized view pushes clean data into a persistent table
✅ All of this runs in Docker for easy setup and teardown
It’s super lightweight and totally beginner-friendly — perfect for learning how Kafka and ClickHouse can work together.
🧰 Tech Stack
- Python — Kafka producer to simulate user signups
- Kafka — distributed streaming platform
- ClickHouse — OLAP database with native Kafka support
- Docker — to spin up Kafka, Zookeeper, and ClickHouse
- SQL — to define engine tables and views in ClickHouse
🗂️ Project Structure
kafka-clickhouse-pipeline/
├── producer/ # Python Kafka producer
├── clickhouse-setup.sql # SQL to set up ClickHouse tables
├── docker-compose.yml # All services defined here
├── screenshots/ # CLI outputs, topic messages, etc.
└── README.md # Everything documented here
⚙️ How It Works
- Run
docker-compose up
— spins up Kafka, Zookeeper & ClickHouse - Run the SQL file to create:
- Kafka engine table
- Materialized view
- Target
users
table- Start the Python producer — sends mock user data to Kafka
- ClickHouse listens to the topic and stores data via materialized view
- Boom — your real-time pipeline is up and running!
🧪 Example Output
A single message sent to Kafka looks like this:
{"name": "Alice", "email": "alice@example.com", "age": 24}
And the users
table in ClickHouse will store it like this:
name | age | |
---|---|---|
Alice | alice@example.com | 24 |
Check the screenshots/
folder in the repo to see the whole thing in action 📸
🧠 Key Learnings
✅ How Kafka producers work with Python
✅ Setting up Kafka topics and brokers in Docker
✅ How ClickHouse can natively consume Kafka messages
✅ How materialized views automate transformation & insert
✅ Containerized orchestration made simple with Docker
💡 What’s Next?
🔁 Add a proper Kafka consumer (Python-based) as an alt to ClickHouse ingestion
🔍 Add logging, retries, and dead-letter queue logic
📈 Simulate more complex streaming use cases like page visits
📊 Plug in Grafana for real-time metrics from ClickHouse
📌 Why You Should Try This
If you're exploring real-time data engineering:
- Start with Kafka and Python — it’s intuitive and powerful
- ClickHouse’s Kafka engine + materialized view combo = 💯
- Docker lets you test and learn without messing up your local setup
This small project helped me understand the data flow in real-time systems — not just conceptually, but hands-on.
🔗 Repo
🙋♂️ About Me
Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub
⚙️ Building in public — one stream at a time.