Hi everyone! My name is Alex and I'm a Backend Engineer.
This article is my attempt to better understand OpenTelemetry. I wanted to share my experience setting up a local observability stack — in case it helps others on the same path.
Table of Contents
- Introduction
- Observability and OpenTelemetry
- The Tools of the Trade
- Getting Started
- Final Thoughts
- Troubleshooting
- Resources
Introduction
Curiosity is often sparked in the simplest moments. While browsing technical content, I came across a video by Marcel Dempers explaining how to collect logs using OpenTelemetry. The walkthrough was clear and approachable, and it got me thinking — how hard would it be to recreate something like that locally?
This guide is the result of that question. Whether we're between projects or just eager to expand our skills, it offers a step-by-step exploration of setting up a local observability stack using Kubernetes, OpenTelemetry, ClickHouse, and Grafana. While it's not a production-ready deployment, it's a great way to gain hands-on experience with the foundational tools of modern observability.
Observability and OpenTelemetry — What Are They?
Observability is the practice of understanding what is happening inside a system based on the data it produces.
Think of observability like building with LEGO blocks. You have many options — various telemetry collectors, storage engines, visualization tools, and cloud platforms — and it's up to you to choose how to piece them together.
At the heart of both observability and OpenTelemetry are three shared pillars:
- Traces: Provide end-to-end visibility by following how a request flows across services.
- Metrics: Quantify the state and performance of our services.
- Logs: Record discrete events and messages during application execution.
Modern observability involves correlating all three to detect, troubleshoot, and resolve issues effectively. OpenTelemetry brings them together under one unified model and refers to them as signals — the raw observability data emitted by systems and routing them to your chosen storage and visualization layers.
In this guide, we focus on using OpenTelemetry framework with ClickHouse and Grafana, but alternatives exist. Cloud-native solutions like DataDog, Signoz, or open-source platforms like OpenObserve and HyperDX can serve similar purposes, each with different trade-offs.
If you're interested in a deeper dive into these concepts, the ClickHouse team has published a detailed article covering OpenTelemetry internals, architectural models, and log/trace handling: Engineering Observability with OpenTelemetry and ClickHouse. It's highly recommended for gaining a solid conceptual foundation.
The Tools of the Trade
💡 Note 1.
All examples in this guide assume you're running macOS. If you're using Linux or Windows, paths and some commands may need to be adjusted accordingly.💡 Note 2.
As mentioned earlier, there's no single right way to set up observability — it depends heavily on your architecture, data volume, team preferences, and business needs. In this guide, we'll use ClickHouse as main storage for logs and traces. This is just one of many valid approaches, shaped by practical constraints and design taste.
To simulate a real-world observability setup, we'll use the following instruments. All of them are easy to install and run locally on macOS:
Docker for Mac: Required to run Kind clusters. Make sure it's running before you start. You can download it from Docker's official site.
Homebrew: A package manager for macOS that simplifies installing CLI tools. Installation guide is available here.
kubectl: The Kubernetes CLI for managing clusters. It might already be installed as part of Docker Desktop for Mac. If it's not available for some reason, you can install it manually using Homebrew:
brew install kubectl
.Helm: A tool for managing Kubernetes charts. Install it with Homebrew:
brew install helm
.K9s: A terminal UI for managing and observing Kubernetes clusters. It simplifies navigating resources, checking pod logs, and debugging directly from the terminal. You can install it via Homebrew:
brew install k9s
.Node.js: Required for creating a sample NestJS application that generates logs and traces. Be sure to have Node.js installed before continuing. You can install it via Homebrew:
brew install node
or from the official site.Kind: A lightweight tool for running local Kubernetes clusters using Docker. It allows fast prototyping and testing of Kubernetes-based infrastructure without the need for cloud resources. You can find installation instructions in the official quick start guide.
-
Prometheus: An open-source systems monitoring and alerting toolkit. It collects time-series metrics from configured targets at given intervals and stores them efficiently.
📌 In this guide, we use Prometheus indirectly — it's installed as part of the monitoring stack, and its metrics power some of the default Grafana dashboards. However, we won't be sending any custom metrics to it directly.
Grafana: A data visualization platform used to create dashboards and alerts. In our setup, it connects to Prometheus for metrics and ClickHouse for logs and traces.
ClickHouse: A column-oriented database designed for high-performance analytics. It excels at storing structured logs and trace data at scale, making it well-suited for observability use cases.
OpenTelemetry Collector: A vendor-neutral component that receives, processes, and exports telemetry data. It supports various data formats and allows routing logs, metrics, and traces to multiple destinations like ClickHouse and Prometheus.
Now that the core concepts are clear, it's time to roll up our sleeves and get hands-on with some code.
Getting Started
Before we begin, let's prepare a working directory for our observability setup. This folder will contain configuration files, sample application and serve as a base for volume mounts in Kind.
Create it like this:
mkdir -p $PWD/test-observability/otelcol-storage
cd $PWD/test-observability
Now let's start building the environment step by step.
Step 1: Launching a Local Kubernetes Cluster
To simulate a production-like environment on our local machine, we use Kind (Kubernetes IN Docker). It lets us spin up a single-node Kubernetes cluster using Docker containers.
Now we create a KIND config file kind-config.yaml
that might look like this:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: ./otelcol-storage
containerPath: /var/lib/otelcol
extraPortMappings:
- containerPort: 30080
hostPort: 30080
protocol: TCP
- containerPort: 30090
hostPort: 30090
protocol: TCP
- containerPort: 31000
hostPort: 31000
protocol: TCP
With the Port mappings in the Kind config we explicitly expose service ports to our local machine. This makes it easy to access Grafana, Prometheus dashboards, and ClickHouse through the browser at localhost
.
Volume mount is included here as a placeholder — we'll explain its purpose in detail later when setting up the OpenTelemetry Collector. In short, it enables file-based log storage persistence across pod restarts.
Now run this command in your terminal:
kind create cluster --name observability --config kind-config.yaml
Creating the cluster may take a minute or two depending on your machine's performance and whether the required Docker image is already cached. During this process, Kind sets up a Kubernetes control plane node, installs a container network interface (CNI), and provisions storage.
You should see output similar to this:
➜ test-observability kind create cluster --name observability --config kind-config.yaml
Creating cluster "observability" ...
✓ Ensuring node image (kindest/node:v1.32.2) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Verify that all nodes are up and running and you local cluster is alive:
kubectl get nodes
kubectl get pods -A
kubectl cluster-info --context kind-observability
Now that our cluster is up and running, it's time to install the monitoring stack — Prometheus and Grafana.
Step 2: Deploying Prometheus and Grafana
We'll use the prometheus-community
Helm chart, which bundles both Prometheus and Grafana in a pre-integrated stack, making setup easier for local testing.
To make the Helm charts for Prometheus and Grafana available locally add helm repository first:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Next, create a new namespace to isolate the monitoring components from the rest of the cluster:
kubectl create namespace monitoring
... and install Prometheus and Grafana
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=15d \
--set grafana.service.type=NodePort \
--set grafana.service.nodePort=30080 \
--set prometheus.service.type=NodePort \
--set prometheus.service.nodePort=30090
Wait until all pods are up and running.
kubectl get pods -n monitoring
kubectl get svc -n monitoring
Our Grafana is now accessible at http://localhost:30080
(default login: admin/prom-operator
). You can always find admin password by executing:
kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d; echo
As at this point both our Prometheus and Grafana are up and running, let's import our first real dashboard to visualize data from the Kubernetes node.
- Open Grafana (
http://localhost:30080
), and from the left menu, go to Dashboards → New → Import. - This opens the "Import Dashboard" view. Click on the link to grafana.com/dashboards.
- In the new browser tab, search for "Node Exporter Full" and open this dashboard: Node Exporter Full – ID 1860.
- Click Copy ID to clipboard.
- Go back to Grafana, paste the ID (
1860
) into the "Import via grafana.com" field, and click Load. - Choose Prometheus as the data source and click Import.
Voilà! The dashboard is now live and displaying system metrics from your Kubernetes node, which should appear as observability-control-plane
.
💡 Note: You can also access Prometheus directly by visiting
http://localhost:30090/status
in your browser. We're able to access this from our host machine because port30090
was explicitly mapped in the Kind configuration.
Next up, let's install ClickHouse — our database for storing logs and traces.
Step 3: Installing ClickHouse
First, add the ClickHouse Helm chart source:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
Now install the ClickHouse into the same namespace used by Prometheus and Grafana:
helm install clickhouse bitnami/clickhouse \
--namespace monitoring \
--set architecture=standalone \
--set replicaCount=1 \
--set shards=1 \
--set auth.username=admin \
--set auth.password=clickhouse123 \
--set service.type=NodePort \
--set service.nodePorts.http=31000 \
--set persistence.enabled=false \
--set keeper.enabled=false \
--set metrics.enabled=true \
--set resources.requests.cpu=500m \
--set resources.requests.memory=512Mi \
--set resources.limits.cpu=4 \
--set resources.limits.memory=4Gi
Before we proceed, it's worth highlighting a couple of custom settings we used in the installation above:
- We explicitly provided a username and password for ClickHouse (
admin/clickhouse123
) to make it easy to log in using the CLI. - We overrode the default CPU and memory requests/limits. In particular, the default memory limit of 750Mi often isn't enough for ClickHouse, especially when handling real query loads. In practice, insufficient memory will eventually lead to query failures or service instability.
Use the following commands to check that the ClickHouse pod and service are available:
kubectl get pods -n monitoring -l app.kubernetes.io/name=clickhouse
kubectl get svc -n monitoring -l app.kubernetes.io/name=clickhouse
Step 4: ClickHouse Database Setup
Before we move on and deploy the OpenTelemetry Collector, let's pause for a moment to focus on the database setup.
The ClickHouse team officially supports and contributes to the OpenTelemetry exporter for ClickHouse, and it provides convenient defaults for automatically creating tables to store logs and traces.
However, there are two important caveats:
- The exporter will not create the database itself. That step is entirely up to you. The exporter can create tables, but only within an existing database.
- ClickHouse engineers explicitly recommend avoiding automatic table creation in production environments. In real-world setups, you'll often have more than one collector running — each potentially trying to create the same schema. More importantly, designing a universal schema that fits every workload is practically impossible. Instead, consider the default schema as a starting point — then evolve it using materialized views tailored to your access patterns and business needs. You can find more insights in their excellent technical blog post: ClickHouse and OpenTelemetry.
So, to create a new database, access the ClickHouse console inside the pod:
kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client \
--user=admin --password=clickhouse123
Inside the ClickHouse SQL console, run:
CREATE DATABASE IF NOT EXISTS observability;
At this point, we've got a fully functional Kubernetes cluster running Prometheus, Grafana, and ClickHouse. Now it's time to dive into the most exciting part — deploying and configuring the OpenTelemetry Collector.
Step 5: Deploying the OpenTelemetry Collector
Let's recap what the OpenTelemetry Collector is. It's a vendor-agnostic service that receives, processes, and exports telemetry data (logs, metrics, and traces) from your applications. It acts as a pipeline that standardizes and routes observability signals to multiple destinations.
There are multiple ways to deploy the collector. In production environments, you might consider using the OpenTelemetry Operator for better lifecycle management and integration with Kubernetes. For local development, however, a DaemonSet deployment is often simpler. It ensures that one instance of the collector runs on every node (we have only one node in our setup), which is especially useful for collecting container logs directly from the node filesystem.
However, DaemonSets do not expose pod IPs via services like regular Deployments. To route traffic to collector pods (e.g., from your instrumented apps using OTLP), you'll need a headless service that exposes them without load-balancing:
apiVersion: v1
kind: Service
metadata:
name: otel-collector-opentelemetry-collector-agent
namespace: observability
labels:
app.kubernetes.io/name: opentelemetry-collector
app.kubernetes.io/instance: otel-collector
spec:
clusterIP: None # Headless service
selector:
app.kubernetes.io/name: opentelemetry-collector
app.kubernetes.io/instance: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
For now, create a new otel-collector-headless-service.yaml
file in your working directory. We'll use it shortly in the next step to expose the collector to applications inside the cluster — allowing them to send telemetry via OTLP (gRPC).
Next, add the open-telemetry
Helm repository:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
Then create a custom configuration file and save it as otel-collector-values.yaml
:
image:
repository: otel/opentelemetry-collector-contrib
tag: latest
pullPolicy: IfNotPresent
mode: daemonset
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
config:
receivers:
otlp:
protocols:
grpc: {}
http: {}
filelog:
include: [/var/log/containers/*app-*.log]
start_at: end
include_file_path: true
include_file_name: false
storage: file_storage
operators:
- id: container-parser
type: container
add_metadata_from_filepath: false
- type: json_parser
parse_to: body
if: body matches "^{.*}$"
# on_error: drop_quiet
timestamp:
parse_from: body.time
layout_type: epoch
layout: ms
severity:
parse_from: body.level
overwrite_text: true
- type: copy
from: body.service_name
to: resource["service.name"]
- type: trace_parser
trace_id:
parse_from: body.trace_id
span_id:
parse_from: body.span_id
trace_flags:
parse_from: body.trace_flags
extensions:
health_check: {}
pprof: {}
zpages: {}
file_storage:
directory: /var/lib/otelcol/.data/storage/
create_directory: true
processors:
memory_limiter:
check_interval: 5s
limit_mib: 512
batch:
timeout: 5s
send_batch_size: 5000
exporters:
clickhouse:
endpoint: tcp://clickhouse.monitoring.svc.cluster.local:9000?compress=lz4&async_insert=1
database: observability
username: admin
password: clickhouse123
logs_table_name: otel_logs
traces_table_name: otel_traces
create_schema: true
ttl: 8760h
debug:
verbosity: detailed
service:
extensions: [health_check, pprof, zpages, file_storage]
pipelines:
logs:
receivers: [filelog]
processors: [memory_limiter, batch]
exporters: [debug, clickhouse]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [clickhouse]
extraVolumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: otelcolstorage
hostPath:
path: /var/lib/otelcol
extraVolumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
- name: otelcolstorage
mountPath: /var/lib/otelcol
Let's create a new namespace for the collector and call it observability
:
kubectl create namespace observability
Then install the collector:
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace observability \
-f otel-collector-values.yaml
To verify that the collector is running and ready to receive telemetry:
kubectl get pods -n observability
kubectl describe svc otel-collector-opentelemetry-collector-agent -n observability
Now we can check the collector logs to confirm that it's running properly:
2025-05-17T15:09:47.440Z info otlpreceiver@v0.126.0/otlp.go:116 Starting GRPC server {"resource": {}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "endpoint": "10.244.0.13:4317"}
2025-05-17T15:09:47.443Z info otlpreceiver@v0.126.0/otlp.go:173 Starting HTTP server {"resource": {}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "endpoint": "10.244.0.13:4318"}
2025-05-17T15:09:47.446Z info prometheusreceiver@v0.126.0/metrics_receiver.go:154 Starting discovery manager {"resource": {}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics"}
Great! Our collector is up and running, listening for incoming HTTP and gRPC connections.
Next, let's verify that our ClickHouse database has the expected telemetry tables.
Open a ClickHouse SQL shell:
kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client --user=admin --password=clickhouse123
Then, in the ClickHouse client, run:
SHOW DATABASES;
Expected output:
┌─name────────────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ observability │
│ system │
└─────────────────────────┘
Switch to the observability
database:
USE observability;
Check the available tables:
SHOW TABLES;
You should see something like:
┌─name────────────────────────────┐
│ otel_logs │
│ otel_traces │
│ otel_traces_trace_id_ts │
│ otel_traces_trace_id_ts_mv │
└─────────────────────────────────┘
That means everything is connected properly and working as expected!
Now let's take a look at the structure of the tables.
To inspect the otel_logs
table structure, run:
SHOW CREATE TABLE otel_logs;
Example output:
CREATE TABLE observability.otel_logs
(
`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`TimestampTime` DateTime DEFAULT toDateTime(Timestamp),
`TraceId` String CODEC(ZSTD(1)),
`SpanId` String CODEC(ZSTD(1)),
`TraceFlags` UInt8,
`SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
`SeverityNumber` UInt8,
`ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
`Body` String CODEC(ZSTD(1)),
`ResourceSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`ScopeSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
`ScopeName` String CODEC(ZSTD(1)),
`ScopeVersion` LowCardinality(String) CODEC(ZSTD(1)),
`ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_scope_attr_key mapKeys(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_scope_attr_value mapValues(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_body Body TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8
)
ENGINE = MergeTree
PARTITION BY toDate(TimestampTime)
PRIMARY KEY (ServiceName, TimestampTime)
ORDER BY (ServiceName, TimestampTime, Timestamp)
TTL TimestampTime + toIntervalDay(365)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1
Now let's inspect the otel_traces
table:
SHOW CREATE TABLE otel_traces;
Example output:
CREATE TABLE observability.otel_traces
(
`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`TraceId` String CODEC(ZSTD(1)),
`SpanId` String CODEC(ZSTD(1)),
`ParentSpanId` String CODEC(ZSTD(1)),
`TraceState` String CODEC(ZSTD(1)),
`SpanName` LowCardinality(String) CODEC(ZSTD(1)),
`SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
`ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`ScopeName` String CODEC(ZSTD(1)),
`ScopeVersion` String CODEC(ZSTD(1)),
`SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`Duration` UInt64 CODEC(ZSTD(1)),
`StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
`StatusMessage` String CODEC(ZSTD(1)),
`Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
`Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
`Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
`Links.TraceId` Array(String) CODEC(ZSTD(1)),
`Links.SpanId` Array(String) CODEC(ZSTD(1)),
`Links.TraceState` Array(String) CODEC(ZSTD(1)),
`Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))
TTL toDateTime(Timestamp) + toIntervalDay(365)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1
Looks promising!
Finalizing Collector Networking Setup
Now let's launch the headless service to allow other applications to communicate with the OpenTelemetry Collector:
kubectl apply -f otel-collector-headless-service.yaml
Check that all components are running:
kubectl get all -n observability
You should see output similar to:
pod/otel-collector-opentelemetry-collector-agent-mfh7d 1/1 Running
service/otel-collector-opentelemetry-collector-agent 1/1 Running
daemonset.apps/otel-collector-opentelemetry-collector-agent 1/1 Running
Our collector is now active as a DaemonSet, and the headless service allows applications to send OTLP traffic.
To verify connectivity from within the cluster, open a Grafana pod shell:
kubectl -n monitoring exec -it prometheus-grafana-77bcfb9bdb-8pfjg -- bash
Then test the connection:
nc -vz otel-collector-opentelemetry-collector-agent.observability.svc.cluster.local 4317
Expected output:
otel-collector-opentelemetry-collector-agent.observability.svc.cluster.local (10.244.0.13:4317) open
That confirms the service is reachable from within the cluster. We're ready to send and visualize telemetry.
Understanding the OpenTelemetry Collector Configuration
But before we continue, it's worth pausing and examining what the OpenTelemetry Collector configuration file actually consists of. If you haven't yet, I highly recommend watching this video by Marcel Dempers, which explains the basics in a very clear and digestible way.
In short, the configuration file can be logically divided into several key sections:
1. receivers
Receivers are how data gets into the Collector. Each receiver is responsible for accepting telemetry data in a specific format or from a specific source — e.g., OTLP over HTTP/gRPC, Prometheus metrics scraping, or reading logs from files.
A receiver is how data gets into the Collector. Receivers "listen" for data being sent to them or collect it from a target.
— OpenTelemetry Documentation
2. extensions
Extensions add optional but useful features to the Collector. They are not part of the signal pipeline but provide capabilities like health checks, profiling endpoints, zPages for live debugging, or persistent file storage.
Extensions are optional components that provide additional capabilities such as health checks, file storage, or zPages.
— OpenTelemetry Documentation
3. processors
Processors transform telemetry data between receiving and exporting. They can batch, filter, enhance, or drop data. Common processors include batching for improved throughput and memory limiting to protect from resource overuse.
Processors are used to modify data before it is exported. They can be chained and applied to logs, metrics, or traces.
— OpenTelemetry Documentation
4. exporters
Exporters are how telemetry data leaves the Collector. Each exporter sends data to a backend system like ClickHouse, Prometheus, Jaeger, etc. Exporters are the final stage in a telemetry pipeline.
Exporters are how data is sent to other systems or storage backends.
— OpenTelemetry Documentation
5. service
This section brings everything together. It defines the telemetry pipelines (logs, traces, metrics), specifying which receivers, processors, and exporters to use. It also activates the extensions.
The service section defines how the Collector runs: which pipelines to start and which extensions to enable.
— OpenTelemetry Documentation
Example: How We Use file_storage
In our configuration, the file_storage
extension is used to enable state persistence — particularly useful for operators like filelog
, which maintain internal state about file offsets.
We configure it as follows:
extensions:
file_storage:
directory: /var/lib/otelcol/.data/storage/
create_directory: true
This tells the collector to store internal metadata (e.g., file read positions) in the specified directory. Combined with a persistent volume mount in the kind-config.yaml
config file:
- hostPath: ./otelcol-storage
containerPath: /var/lib/otelcol
…this ensures that the state survives across restarts and pod rescheduling. Without it, the collector may reprocess logs from the beginning — causing duplicates, or worse, losing data if it misses new entries during the restart (this behavior is also influenced by the start_at
setting).
The filelog receiver references this extension using:
storage: file_storage
This closes the loop and ensures robust, restart-safe log ingestion.
Why We Use the filelog
Receiver
In our current setup, we're relying on the filelog
receiver to collect logs directly from the node's filesystem. While OpenTelemetry SDKs for most languages support sending logs and metrics programmatically to the collector, this feature isn't always production-ready. For example, at the time of writing, Node.js SDK documentation explicitly marks log export support as "in development".
That's why we're sticking to the old, reliable approach: scraping logs directly from disk. It's robust, language-agnostic, and doesn't require modifying application code — a practical choice when SDKs lag behind or instrumentation needs to stay lightweight.
It's also worth noting that not all containerized applications produce logs in the same format. Depending on the base image, logging library, or runtime, your logs might be plain text, JSON, or even multi-line stack traces.
Fortunately, OpenTelemetry's filelog
receiver supports a wide range of parsing operators to help with that. You can mix and match operators like json_parser
, regex_parser
, trace_parser
, multiline
, move
, and more to handle even the most bizarre log formats.
In our case, we use the purpose-built container
log parser operator — designed specifically for Kubernetes environments. It's optimized for parsing container logs where metadata is appended outside of the actual log payload.
Log Parsing Flow in Our Configuration
In our setup, we're collecting logs from container files that match the pattern /var/log/containers/*app-*.log
. These are symbolic links pointing to actual log files managed by the container runtime — and depending on which runtime is used (like Docker, containerd, or CRI-O), the format of those logs can differ quite a bit.
Here's a quick comparison of the most common formats:
Runtime | Log Format Example |
---|---|
Docker | JSON: {"log":"msg","stream":"stdout","time":"2024-01-01T12:00:00.000000000Z"}
|
containerd | Very similar to Docker, with minor variations in buffering and timing |
CRI-O | Plaintext: 2024-01-01T12:00:00.000000000Z stdout F {"time":..., "level":"info"}
|
These differences matter — because if we want to parse logs properly, we need to apply the right sequence of operators to extract structured data from the raw lines.
In our case, we're working with CRI-O formatted logs that contain JSON log bodies. Here's how we process them using the filelog
receiver:
filelog:
include: [/var/log/containers/*app-*.log]
start_at: end
include_file_path: true
include_file_name: false
storage: file_storage
operators:
- id: container-parser
type: container
add_metadata_from_filepath: false
- type: json_parser
parse_to: body
if: body matches "^{.*}$"
timestamp:
parse_from: body.time
layout_type: epoch
layout: ms
severity:
parse_from: body.level
overwrite_text: true
- type: copy
from: body.service_name
to: resource["service.name"]
- type: trace_parser
trace_id:
parse_from: body.trace_id
span_id:
parse_from: body.span_id
trace_flags:
parse_from: body.trace_flags
Let's break that down:
- container — This one strips away the CRI-O metadata prefix (timestamp, stream, flag) and gives us just the log body.
-
json_parser — It kicks in if the body looks like JSON. It extracts timestamp and severity from specific fields like
body.time
andbody.level
. -
copy — We map
service_name
from the body to a proper OpenTelemetry resource attribute. -
trace_parser — We extract trace context (
trace_id
,span_id
,trace_flags
) so the log can be linked to a trace.
A Note on the OpenTelemetry Log Data Model
All of this parsing leads to one goal: transforming raw logs into structured entries that comply with the OpenTelemetry Log Data Model.
This model defines how logs should be structured — with clear separation between:
- Body: the actual log message
- Attributes: key-value pairs for context
-
Trace context:
TraceId
,SpanId
, andTraceFlags
- Resource metadata: like service name, environment, or deployment info
By following this model, we ensure that our logs are ready for advanced querying, correlation with traces, and rich visualizations in tools like ClickHouse and Grafana.
I also recommend reading Attribute Registry Specification to gain a better understanding of standard attributes and how they correlate with this model.
Step 6: Sample Application Deployment
Alright — the hardest part is behind us. Let's recap what we've accomplished so far.
We have a local Kubernetes cluster up and running with Prometheus, Grafana, and ClickHouse installed. The OpenTelemetry Collector has been successfully deployed as a DaemonSet, and we've created a headless service to expose it. We also verified that ClickHouse automatically created the required tables for logs and traces.
Everything is now in place to deploy a sample application that emits logs and traces — so we can validate the full end-to-end pipeline.
Setting Up the Sample Service
Now, we are going to create a simple web application using NestJS. Don't worry if you've never used NestJS before — we'll write everything in plain TypeScript, and the code is easy to follow. The idea is quite simple: we will deploy our application to our local cluster as two independent services, each with two replicas. There are two endpoints inside the application: ping
and call
. The first endpoint simply returns pong
and simulates some activity for 50 milliseconds, creating a test-span
in the process. The second endpoint, call
, is more complicated: it calls its sibling's ping
(app-a/call
→ app-b/ping
, and vice versa), which does the same thing. This will allow us to see how traces are built for two independent applications.
If Nest CLI isn't installed globally yet:
npm install -g @nestjs/cli
We'll assume you're still inside the test-observability
directory.
Generate a new project:
nest new my-observability-app
cd my-observability-app
Choose defaults when prompted (you can go with either npm
or yarn
).
Installing Dependencies
The base NestJS app doesn't include observability tools, so let's install everything needed for logging, tracing, and OTLP exports:
npm install \
nestjs-pino pino pino-http pino-pretty pino-opentelemetry-transport \
@opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/instrumentation-http \
@opentelemetry/instrumentation-express \
@opentelemetry/instrumentation-pino \
@opentelemetry/instrumentation-redis \
@opentelemetry/instrumentation-ioredis \
@opentelemetry/instrumentation-mysql \
sonyflake
Wiring Up Tracing and Logging
src/tracing.ts
— this is where our observability begins. We use @opentelemetry/sdk-node
to connect to our collector via gRPC and begin sending traces. We won't dive into SDK configuration here — it's a deep topic that deserves its own article. The main thing to pay attention to is getNodeAutoInstrumentations()
function (github). This tells our application that we have installed OpenTelemetry packages and that it should make our traces and spans compatible with them. I've also added comments to explain how to enable debugging logs when using @opentelemetry/sdk-node
.
💡 One more important clarification: for everything to work, this file needs to be called before the main application is launched. We will come back to this when we launch the application through Docker.
src/tracing.ts
:
import { credentials } from '@grpc/grpc-js';
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
// import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
credentials: credentials.createInsecure(),
}),
instrumentations: [getNodeAutoInstrumentations()],
});
// diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
sdk.start();
Next one is our main.ts
file, where the application is bootstrapped. I have made some slight modifications to it to ensure that the Pino logger is used properly.
src/main.ts
:
import { NestFactory } from '@nestjs/core';
import { Logger } from 'nestjs-pino';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.create(AppModule, { bufferLogs: true });
app.useLogger(app.get(Logger));
await app.listen(3000);
}
bootstrap();
The next two are our controller, which has endpoints for ping
and call
, and a service that simulates some work.
src/app.controller.ts
:
import { Controller, Get } from '@nestjs/common';
import { InjectPinoLogger, PinoLogger } from 'nestjs-pino';
import { AppService } from './app.service';
@Controller()
export class AppController {
private readonly targetHost: string;
constructor(
@InjectPinoLogger(AppService.name)
private readonly logger: PinoLogger,
private readonly appService: AppService,
) {
const self = process.env.OTEL_SERVICE_NAME;
if (self === 'app-a') {
this.targetHost = 'http://app-b:3000';
} else if (self === 'app-b') {
this.targetHost = 'http://app-a:3000';
} else {
this.targetHost = 'http://0.0.0.0:3000';
}
}
@Get('ping')
async ping(): Promise<string> {
await this.getHello();
return 'pong';
}
@Get('call')
async call(): Promise<string> {
if (!this.targetHost) {
return 'Unknown app role, cannot determine target';
}
try {
const res = await fetch(`${this.targetHost}/ping`);
const text = await res.text();
return `Response from ${this.targetHost}: ${text}`;
} catch (err) {
this.logger.error(err);
return `Failed to call ${this.targetHost}: ${err}`;
}
}
@Get()
getHello(): Promise<string> {
return this.appService.getHello();
}
}
src/app.service.ts
:
import { Injectable } from '@nestjs/common';
import { trace } from '@opentelemetry/api';
import { PinoLogger, InjectPinoLogger } from 'nestjs-pino';
@Injectable()
export class AppService {
constructor(
@InjectPinoLogger(AppService.name)
private readonly logger: PinoLogger,
) {}
async getHello(): Promise<string> {
const tracer = trace.getTracer('manual-test');
await tracer.startActiveSpan('test-span', async (span) => {
await new Promise((res) => setTimeout(res, 50));
span.end();
});
this.logger.info('Handling getHello request...');
this.doSomething();
this.logger.info('Finished getHello request.');
return 'Hello World!';
}
doSomething(): void {
this.logger.info('Doing something internal...');
}
}
Now go to module setup. AppModule
is our main, or root
, module. In NestJS, modules form the basis of business logic by uniting services and controllers, and are a recommended and convenient way to organise code. Modules can import other modules. We will therefore import the Pino module into our root module, which will allow us to use the Pino logger in our application. We're going to take a closer look at what's actually going on here.
src/app.module.ts
:
import { Module } from '@nestjs/common';
import { LoggerModule } from 'nestjs-pino';
import { Sonyflake } from 'sonyflake';
import { IncomingMessage } from 'http';
import { AppController } from './app.controller';
import { AppService } from './app.service';
const isProd = process.env.NODE_ENV === 'production';
const sonyflake = new Sonyflake({
machineId: 2,
epoch: Date.UTC(2020, 4, 18, 0, 0, 0),
});
@Module({
imports: [
LoggerModule.forRoot({
pinoHttp: {
base: {
service_name: process.env.OTEL_SERVICE_NAME || 'app-a',
},
customLevels: {
trace: 1,
debug: 5,
info: 9,
warn: 13,
error: 17,
fatal: 21,
},
useOnlyCustomLevels: true,
genReqId: (req: IncomingMessage) => {
const id = sonyflake.nextId();
req.id = id;
return id;
},
level: process.env.LOG_LEVEL || 'info',
...(isProd
? {}
: {
transport: {
target: 'pino-pretty',
options: {
colorize: true,
},
},
}),
},
}),
],
controllers: [AppController],
providers: [AppService],
})
export class AppModule {}
Let's go through the settings for the logger.
This
base: {
service_name: process.env.OTEL_SERVICE_NAME || 'app-a',
}
... will add a service_name
field to our logs.
This
const sonyflake = new Sonyflake({
machineId: 2, // in range 2^16
epoch: Date.UTC(2020, 4, 18, 0, 0, 0), // timestamp
});
...
genReqId: (req: IncomingMessage) => {
const id = sonyflake.nextId();
req.id = id;
return id;
}
... associates an incoming request with a unique id that we generate with sonyflake
.
And, finally, this
customLevels: {
trace: 1,
debug: 5,
info: 9,
warn: 13,
error: 17,
fatal: 21,
},
useOnlyCustomLevels: true
... defines custom levels. Because Pino uses a default numeric level set (e.g., info = 30
, warn = 40
) that doesn't align with OpenTelemetry's level conventions (info = 9
, error = 17
, etc.). Defining custom levels ensures logs are correctly parsed and severity is accurately extracted in the OpenTelemetry Collector pipeline.
To start our application run:
npm run build
NODE_ENV=production node -r ./dist/tracing dist/main
In the terminal you should see the output, showing that the service has started and that our ping
and call
routes have been mapped.
{"level":9,"time":1747552531452,"service_name":"app-a","context":"NestFactory","msg":"Starting Nest application..."}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"InstanceLoader","msg":"LoggerModule dependencies initialized"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"InstanceLoader","msg":"AppModule dependencies initialized"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RoutesResolver","msg":"AppController {/}:"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/ping, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/call, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"NestApplication","msg":"Nest application successfully started"}
OpenTelemetry Instrumentation Debugging
Before we continue, I'll explain how to troubleshoot if logs or traces don't appear in the database, for example.
To achieve this, we simply need to uncomment the commented lines in src/tracing.ts
and then restart our service.
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
...
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
Then run it again:
npm run build
NODE_ENV=production node -r ./dist/tracing dist/main
Even though our service has successfully started, you will see something like this in the logs:
{"stack":"AggregateError [ECONNREFUSED]: \n at internalConnectMultiple (node:net:1121:18)\n at afterConnectMultiple (node:net:1688:7)\n at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17)","errors":"Error: connect ECONNREFUSED ::1:4318,Error: connect ECONNREFUSED 127.0.0.1:4318","code":"ECONNREFUSED","name":"AggregateError"}
{"stack":"Error: 14 UNAVAILABLE: No connection established. Last error: Error: connect ECONNREFUSED 127.0.0.1:4317"}
This clearly indicates that it was unable to establish an HTTP and gRPC connection to collector. This debugging mechanism, together with the debug
exporter in the OpenTele Collector itself, provides us with a minimal set of tools for troubleshooting.
Deploying to Kind Cluster
To simplify deployment, we'll add a helper script at the root of the app:
./deploy-app.sh
Deployment script (deploy-app.sh
):
#!/bin/bash
set -e
VERSION_FILE="VERSION"
if [[ -f "$VERSION_FILE" ]]; then
VERSION=$(cat "$VERSION_FILE")
else
VERSION=1
fi
if ! [[ "$VERSION" =~ ^[0-9]+$ ]]; then
echo "❌ Invalid version number in VERSION file"
exit 1
fi
TAG="$VERSION"
IMAGE_NAME="my-observability-app"
KIND_CLUSTER_NAME="observability"
HELM_RELEASE_NAME="my-observability-app"
HELM_CHART_PATH="./helm/my-observability-app"
echo "🛠 Building Docker image: $IMAGE_NAME:$TAG"
docker build -t "$IMAGE_NAME:$TAG" .
echo "🐳 Loading image into Kind cluster: $KIND_CLUSTER_NAME"
kind load docker-image "$IMAGE_NAME:$TAG" --name "$KIND_CLUSTER_NAME"
echo "🚀 Installing/upgrading Helm release: $HELM_RELEASE_NAME"
helm upgrade --install "$HELM_RELEASE_NAME" "$HELM_CHART_PATH" \
--set image.repository="$IMAGE_NAME" \
--set image.tag="$TAG" \
--set image.pullPolicy=IfNotPresent
NEXT_VERSION=$((VERSION + 1))
echo "$NEXT_VERSION" > "$VERSION_FILE"
echo "✅ Done. Deployed image with tag: $TAG"
echo "📄 Updated VERSION for next release to: $NEXT_VERSION"
This script does the following:
- Builds the app and Docker image
- Loads it into the Kind cluster
- Deploys two services via Helm:
app-a
andapp-b
, each with 2 replicas
And our Helm chart structure will look like:
helm/my-observability-app/
├── Chart.yaml
├── values.yaml
└── templates/
├── app-a-deployment.yaml
├── app-a-service.yaml
├── app-b-deployment.yaml
└── app-b-service.yaml
helm/Chart.yaml
:
apiVersion: v2
name: my-observability-app
description: Deploys a single app twice as app-a and app-b
version: 0.1.0
appVersion: '1.0'
helm/values.yaml
:
replicaCount: 2
image:
repository: my-observability-app
tag: latest
pullPolicy: IfNotPresent
service:
port: 3000
otel:
endpoint: 'otel-collector-opentelemetry-collector-agent.observability.svc:4317'
helm/templates/app-a-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-a
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: app-a
template:
metadata:
labels:
app: app-a
spec:
containers:
- name: app-a
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
env:
- name: NODE_ENV
value: production
- name: OTEL_SERVICE_NAME
value: app-a
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: {{ .Values.otel.endpoint }}
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: grpc
ports:
- containerPort: {{ .Values.service.port }}
helm/templates/app-a-service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: app-a
spec:
selector:
app: app-a
ports:
- protocol: TCP
port: 3000
targetPort: 3000
app-b-deployment.yaml
and app-b-service.yaml
are identical except for names and labels changed to app-b
.
And do not forget to add a Dockerfile
and .dockerignore
files.
Dockerfile
:
FROM node:22-alpine as base
WORKDIR /app
COPY package*.json ./
RUN npm ci
FROM base as builder
COPY . .
RUN npm run build
FROM base as prod
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
RUN npm ci --omit=dev
CMD ["node", "-r", "./dist/tracing.js", "./dist/main.js"]
.dockerignore
:
# Node dependencies
node_modules
npm-debug.log
yarn.lock
# Build output
dist
*.ts
# OS / Editor / IDE
.DS_Store
.env
*.env
.vscode
.idea
*.swp
# Git
.git
.gitignore
VERSION
deploy-app.sh
Now we're ready to deploy our servises to Kubernetes. Simply run our deploy-app.sh
script and observe the example output:
🛠 Building Docker image: my-observability-app:1
🐳 Loading image into Kind cluster: observability
🚀 Installing/upgrading Helm release: my-observability-app
✅ Done. Deployed image with tag: 1
📄 Updated VERSION for next release to: 2
Testing the Pipeline
Port-forward the app:
kubectl port-forward service/app-a 3000:3000
In another terminal run:
curl http://localhost:3000/call
Expected output:
Response from http://app-b:3000: pong%
To see logs from app-a
and app-b
run:
kubectl logs -l app=app-a
kubectl logs -l app=app-b
Now inspect the OpenTelemetry Collector logs and ClickHouse tables to confirm that we store logs and traces.
OpenTele Collector Logs
You should see log records with full trace context, custom severity, and service name:
kubectl logs otel-collector-opentelemetry-collector-agent-fzwqv -n observability -f
LogRecord #0
ObservedTimestamp: 2025-05-17 18:47:52.880985305 +0000 UTC
Timestamp: 2025-05-17 18:47:52.783 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
Body: Map({"level":9,"msg":"request completed","req":{"headers":{"accept":"*/*","host":"localhost:3000","user-agent":"curl/8.7.1"},"id":"2646566778519420930","method":"GET","params":{"path":["call"]},"query":{},"remoteAddress":"::ffff:127.0.0.1","remotePort":49448,"url":"/call"},"res":{"headers":{"content-length":"37","content-type":"text/html; charset=utf-8","etag":"W/\"25-YIs9s+nPVAD6eNe/gEyORquumI4\"","x-powered-by":"Express"},"statusCode":200},"responseTime":74,"service_name":"app-a","span_id":"0190a252bfc22dab","time":1747507672783,"trace_flags":"01","trace_id":"18211f4c6b1dd7990bc8a6f113b774e6"})
Attributes:
-> log.file.path: Str(/var/log/containers/app-a-79fdcb5d84-5s5rl_default_app-a-d456b9167a9578c65b5f6c223461594b1e1599ee220b7c633694f62018e31258.log)
-> log.iostream: Str(stdout)
-> logtag: Str(F)
Trace ID: 18211f4c6b1dd7990bc8a6f113b774e6
Span ID: 0190a252bfc22dab
Flags: 1
{"resource": {}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs"}
2025-05-17T18:47:56.571Z info Metrics {"resource": {}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "resource metrics": 1, "metrics": 36, "data points": 45}
2025-05-17T18:47:56.577Z info ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
-> service.name: Str(otelcol-contrib)
-> server.address: Str(10.244.0.14)
-> service.instance.id: Str(86f2eb87-dbc0-4516-95b9-d3a254fb10e4)
-> server.port: Str(8888)
-> url.scheme: Str(http)
-> service.version: Str(0.126.0)
ClickHouse (otel_logs)
select * from otel_logs order by Timestamp desc limit 1 format vertical;
Timestamp: 2025-05-17 18:47:54.966000000
TimestampTime: 2025-05-17 18:47:54
TraceId: efb66e731e09d4abe7be282c704adf13
SpanId: 72514ac7a4067633
TraceFlags: 1
SeverityText: INFO
SeverityNumber: 9
ServiceName: app-a
Body: {"level":9,"msg":"request completed","req":{"headers":{"accept":"*/*","host":"localhost:3000","user-agent":"curl/8.7.1"},"id":"2646566815362187266","method":"GET","params":{"path":["call"]},"query":{},"remoteAddress":"::ffff:127.0.0.1","remotePort":49468,"url":"/call"},"res":{"headers":{"content-length":"37","content-type":"text/html; charset=utf-8","etag":"W/\"25-YIs9s+nPVAD6eNe/gEyORquumI4\"","x-powered-by":"Express"},"statusCode":200},"responseTime":62,"service_name":"app-a","span_id":"72514ac7a4067633","time":1747507674966,"trace_flags":"01","trace_id":"efb66e731e09d4abe7be282c704adf13"}
ResourceSchemaUrl:
ResourceAttributes: {'service.name':'app-a'}
ScopeSchemaUrl:
ScopeName:
ScopeVersion:
ScopeAttributes: {}
LogAttributes: {'log.file.path':'/var/log/containers/app-a-79fdcb5d84-5s5rl_default_app-a-d456b9167a9578c65b5f6c223461594b1e1599ee220b7c633694f62018e31258.log','log.iostream':'stdout','logtag':'F'}
ClickHouse (otel_traces)
select * from otel_traces order by Timestamp desc limit 1 format vertical;
Timestamp: 2025-05-17 18:47:54.911000000
TraceId: efb66e731e09d4abe7be282c704adf13
SpanId: 289622f5583790bb
ParentSpanId: 7eed83778c3497d7
TraceState:
SpanName: request handler - /ping
SpanKind: Internal
ServiceName: app-b
ResourceAttributes: {'host.arch':'arm64','host.name':'app-b-7d9f6b6bf7-z56nq','process.command':'/app/dist/main.js','process.command_args':'["/usr/local/bin/node","-r","./dist/tracing.js","/app/dist/main.js"]','process.executable.name':'node','process.executable.path':'/usr/local/bin/node','process.owner':'root','process.pid':'1','process.runtime.description':'Node.js','process.runtime.name':'nodejs','process.runtime.version':'22.15.1','service.name':'app-b','telemetry.sdk.language':'nodejs','telemetry.sdk.name':'opentelemetry','telemetry.sdk.version':'2.0.1'}
ScopeName: @opentelemetry/instrumentation-express
ScopeVersion: 0.50.0
SpanAttributes: {'express.name':'/ping','express.type':'request_handler','http.route':'/ping'}
Duration: 54657000
StatusCode: Unset
...
If both return recent entries and no errors in the collector — congratulations, you now have end-to-end observability working in your stack!
Step 7: Explore Traces in Grafana
All right — it's finally time to explore and customize trace visualization in Grafana.
To do this, we need to add ClickHouse as a new data source in Grafana.
- Open Grafana at
http://localhost:30080
(default login:admin/prom-operator
). - Go to Connections → Add new connection.
- In the search bar, type ClickHouse and select the plugin named ClickHouse.
- Click Install and wait until the plugin is installed.
- Then press "Add new data source".
You can leave the name as the default (grafana-clickhouse-datasource
) and toggle the "Default" switch to on.
Set the following connection details:
-
Server address:
clickhouse.monitoring.svc.cluster.local
- Port number: 9000
- Protocol: Native
- Skip TLS Verify: On
-
Credentials:
- Username:
admin
- Password:
clickhouse123
- Username:
Click "Save & test" — you should see the message: Data source is working
.
Reload this page, once it reloads, scroll down to the Additional settings section.
Configure Logs and Traces
In the Logs configuration section:
-
Default log database:
observability
-
Default log table:
otel_logs
- Toggle "Use OTel columns" to on
In the Traces configuration section:
-
Default trace database:
observability
-
Default trace table:
otel_traces
- Toggle "Use OTel columns" for traces as well
Press "Save & test" again.
Import Dashboards
Now scroll to the top of the data source page and open the Dashboards tab.
Click "Import" for each available dashboard. You should now have a few new dashboards available.
From the left-hand menu in Grafana, go to Dashboards, then type ClickHouse in the search bar. You should see your newly imported dashboards.
Go to:
Home → Dashboards → ClickHouse OTel Dashboard
This is where we'll be observing traces shortly.
Generate Some Data
Let's send some traffic through our services.
First, in one terminal, port-forward app-a
:
kubectl port-forward service/app-a 3000:3000
Then, in another terminal, run:
ab -n 200 -c 10 http://localhost:3000/call
This uses Apache Benchmark to generate 200 requests with a concurrency of 10.
Once it finishes, refresh the ClickHouse OTel Dashboard. You should see metrics like request count and latency.
Scroll down to the Traces section and click on one of the recent traces.
You'll see the full trace journey, including spans across both services. Scroll further down to the Logs section to view logs correlated with that particular trace.
💡 Bonus Tip:
Check out the ClickHouse - Data Analysis dashboard to see how much data is currently stored in your observability tables.
Fixing Log Expansion in Grafana Panel
You might notice that when you expand a log entry in the dashboard, the details are not very useful — you only see the log level. Let's fix that so you can access structured and meaningful fields.
1. Edit the Panel
- In the panel menu, click the three vertical dots (︙) in the top-right corner of the panel.
- Select Edit. This will open the Edit Logs Visualization view.
2. Update the Log Query
- Scroll down to the Queries section.
- Set:
-
Editor type to
SQL Editor
-
Query type to
Logs
-
Editor type to
- Replace the default query:
SELECT Timestamp as "timestamp", Body as "body", SeverityText as "level" FROM "default"."otel_logs" LIMIT 1000
with the enhanced one:
SELECT
Timestamp AS timestamp,
Body AS body,
TraceId AS trace_id,
ServiceName AS service,
JSONExtractString(JSONExtractRaw(Body, 'req'), 'id') AS request_id,
JSONExtractString(Body, 'msg') AS message,
JSONExtractInt(JSONExtractRaw(Body, 'res'), 'statusCode') AS status_code,
JSONExtractString(JSONExtractRaw(Body, 'req'), 'url') AS url,
JSONExtractInt(Body, 'responseTime') AS response_time,
SeverityText AS level
FROM observability.otel_logs
ORDER BY Timestamp DESC
LIMIT 1000
3. Run and Save
- Press the Run Query button.
- Scroll down to ensure no error messages appear.
-
If everything looks good:
- Click Save Dashboard (top-right).
- Confirm by pressing Save.
- You might see a warning: "This is a Plugin dashboard". Just click Save and overwrite.
4. Verify Enhanced Logs
- Navigate back to the Simple ClickHouse OTel Dashboard.
- Reload the page if needed.
- Scroll down to the Trace Logs panel.
- Click on any log entry — now you should see:
- Structured fields (like
trace_id
,request_id
,url
, etc.) - Two new buttons: View traces and View logs.
- Structured fields (like
These buttons give you direct navigation to a more convenient view for the current trace or its associated logs.
Final Thoughts
This guide turned out quite a bit longer than I initially expected — and even so, we've only scratched the surface. Each step above introduces just the basics of every component, and there's still so much more to explore in the world of observability.
(We haven't even touched on collecting metrics and setting up alerts!)
In production environments, you'll need to think about proper database schema design, secure credential management, persistent storage, automation for scaling and deployment — and many other real-world concerns.
Most importantly, you'll need to choose the right set of tools that align with your team's needs and your business context.
But regardless of scale, the core principle stays the same: start small, understand how your data flows, and evolve your stack as your system grows.
Good luck — and may no incident ever go unnoticed. 🚀
Troubleshooting
- Kind Cluster Creation Fails
# Check Docker is running
docker ps
# Clean up any existing clusters
kind delete clusters --all
# Verify system resources
docker info
- Prometheus/Grafana Pods Not Starting
# Check pod status
kubectl get pods -n monitoring
# Check pod logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
# Check resource limits
kubectl describe pod -n monitoring -l app.kubernetes.io/name=prometheus
- ClickHouse Connection Issues
# Verify ClickHouse pod is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=clickhouse
# Check ClickHouse logs
kubectl logs -n monitoring -l app.kubernetes.io/name=clickhouse
# Test connection from within cluster
kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client --user=admin --password=clickhouse123
- OpenTelemetry Collector Issues
# Check collector status
kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector
# View collector logs
kubectl logs -n observability -l app.kubernetes.io/name=opentelemetry-collector
-
Data Not Showing in Grafana
- Verify data sources are properly configured in Grafana
- Verify ClickHouse tables are being created and populated
- Check OpenTelemetry Collector logs for any export errors
Resources
- How to Collect Logs video by Marcel Dempers
- An Intro to OpenTelemetry by ClickHouse team
- ClickHouse Blog on OpenTelemetry
- Grafana Docs
- Prometheus Docs
- OpenTelemetry Docs
- OpenTelemetry Operator
- OpenTelemetry Logging
- OpenTelemetry Collector Container Log Parser
- OpenTelemetry Filelog Receiver
- OpenTelemetry Filelog Receiver Operators
- OpenTelemetry Attribute Registry Specification
- OpenTelemetry JavaScript SDK