Hey Devs 👋,
If you’re diving into data engineering and want to explore monitoring internal database metrics, this post is for you.
As an Associate Data Engineer Intern, I’ve been learning by building — and this time, I wanted to peek under the hood of ClickHouse, a blazing-fast OLAP database.
So I built a mini-project to automate the extraction of system-level metrics from ClickHouse every hour using Airflow, Docker, and Python.
Here’s what the project does, how it works, and what I learned 👇
📊 What This Project Does
This mini-pipeline automates:
✅ Connecting to a running ClickHouse instance
✅ Querying the system.metrics
table for real-time internal metrics
✅ Using Airflow to schedule this task hourly
✅ Appending the results to a daily CSV file
✅ Running everything inside Docker containers
It’s a great way to practice how monitoring, scheduling, and data capture all come together in real-world setups.
🧰 The Tech Stack
-
Python — to connect to ClickHouse using
clickhouse-connect
- Airflow — to orchestrate hourly metric pulls
- ClickHouse — the OLAP database we’re extracting metrics from
- Docker — to run ClickHouse + Airflow locally
- CSV Files — to store hourly metric snapshots
⚙️ How It Works
- Airflow DAG runs every hour
- DAG triggers a Python script that connects to ClickHouse
- Script runs a query on the
system.metrics
table - Results are appended to a CSV file for that day (e.g.,
metrics-2025-06-21.csv
) - Airflow handles logging and retries in case anything breaks
You’ll end up with a growing CSV file full of hourly metrics — a super simple, readable log of system behavior.
Let me know if you’d like help turning this into a LinkedIn post, carousel, or adding diagrams/visuals for the Dev.to article — happy to help with those next!
🗂️ Project Structure
extract_clickhouse_metrics/
└── airflow-docker/
├── dags/ # Airflow DAG (scheduling logic)
├── scripts/ # Python script to connect + extract
├── output/ # Daily CSV metric logs
├── docker-compose.yaml # Runs Airflow + ClickHouse together
├── requirements.txt
└── logs/ (Airflow logs)
Full repo here:
👉 GitHub: mohhddhassan/extract_clickhouse_metrics
🧠 Key Learnings
✅ How to use clickhouse-connect
to query from Python
✅ Passing dynamic execution time via Airflow context
✅ How to append to a CSV with a proper structure (timestamped)
✅ Setting up Airflow + ClickHouse in a Dockerized workflow
✅ Building habits around logging and modular DAG design
🔍 Sample Metric Snapshot (CSV Output)
Here’s what a single row from the output CSV looks like:
timestamp,metric_name,value
2025-06-21 14:00:00,Query,120
2025-06-21 14:00:00,Merge,3
...
Each run adds a fresh row for all metrics in system.metrics
, timestamped by Airflow’s execution context.
🔧 What’s Next?
🚀 Store the metrics in ClickHouse or PostgreSQL, not just CSV
📦 Push daily CSV files to S3 or Google Cloud Storage
📊 Use Grafana or Streamlit to visualize trends
🔍 Extract from other tables like system.events
or system.asynchronous_metrics
📌 Why This Matters
Learning data engineering isn’t just about moving business data.
Understanding the health of your systems (like DBs, pipelines, infra) is just as important. This project gave me insight into how ClickHouse tracks internal activity — and how to automate its capture for future analysis.
It’s simple, but powerful.
🙋♂️ About Me
Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub
⏱️ Learning in public — one cron job at a time.