Building a Local Observability Stack: A Journey with OpenTelemetry, ClickHouse, and Grafana

Hi everyone! My name is Alex and I'm a Backend Engineer.

This article is my attempt to better understand OpenTelemetry. I wanted to share my experience setting up a local observability stack — in case it helps others on the same path.

Introduction
Observability and OpenTelemetry
The Tools of the Trade
Getting Started
Final Thoughts
Troubleshooting
Resources

Introduction

Curiosity is often sparked in the simplest moments. While browsing technical content, I came across a video by Marcel Dempers explaining how to collect logs using OpenTelemetry. The walkthrough was clear and approachable, and it got me thinking — how hard would it be to recreate something like that locally?

This guide is the result of that question. Whether we're between projects or just eager to expand our skills, it offers a step-by-step exploration of setting up a local observability stack using Kubernetes, OpenTelemetry, ClickHouse, and Grafana. While it's not a production-ready deployment, it's a great way to gain hands-on experience with the foundational tools of modern observability.

Observability and OpenTelemetry — What Are They?

Observability is the practice of understanding what is happening inside a system based on the data it produces.

Think of observability like building with LEGO blocks. You have many options — various telemetry collectors, storage engines, visualization tools, and cloud platforms — and it's up to you to choose how to piece them together.

At the heart of both observability and OpenTelemetry are three shared pillars:

Traces: Provide end-to-end visibility by following how a request flows across services.
Metrics: Quantify the state and performance of our services.
Logs: Record discrete events and messages during application execution.

Modern observability involves correlating all three to detect, troubleshoot, and resolve issues effectively. OpenTelemetry brings them together under one unified model and refers to them as signals — the raw observability data emitted by systems and routing them to your chosen storage and visualization layers.

In this guide, we focus on using OpenTelemetry framework with ClickHouse and Grafana, but alternatives exist. Cloud-native solutions like DataDog, Signoz, or open-source platforms like OpenObserve and HyperDX can serve similar purposes, each with different trade-offs.

If you're interested in a deeper dive into these concepts, the ClickHouse team has published a detailed article covering OpenTelemetry internals, architectural models, and log/trace handling: Engineering Observability with OpenTelemetry and ClickHouse. It's highly recommended for gaining a solid conceptual foundation.

The Tools of the Trade

💡 Note 1.
All examples in this guide assume you're running macOS. If you're using Linux or Windows, paths and some commands may need to be adjusted accordingly.

💡 Note 2.
As mentioned earlier, there's no single right way to set up observability — it depends heavily on your architecture, data volume, team preferences, and business needs. In this guide, we'll use ClickHouse as main storage for logs and traces. This is just one of many valid approaches, shaped by practical constraints and design taste.

To simulate a real-world observability setup, we'll use the following instruments. All of them are easy to install and run locally on macOS:

Docker for Mac: Required to run Kind clusters. Make sure it's running before you start. You can download it from Docker's official site.
Homebrew: A package manager for macOS that simplifies installing CLI tools. Installation guide is available here.
kubectl: The Kubernetes CLI for managing clusters. It might already be installed as part of Docker Desktop for Mac. If it's not available for some reason, you can install it manually using Homebrew: brew install kubectl.
Helm: A tool for managing Kubernetes charts. Install it with Homebrew: brew install helm.
K9s: A terminal UI for managing and observing Kubernetes clusters. It simplifies navigating resources, checking pod logs, and debugging directly from the terminal. You can install it via Homebrew: brew install k9s.
Node.js: Required for creating a sample NestJS application that generates logs and traces. Be sure to have Node.js installed before continuing. You can install it via Homebrew: brew install node or from the official site.
Kind: A lightweight tool for running local Kubernetes clusters using Docker. It allows fast prototyping and testing of Kubernetes-based infrastructure without the need for cloud resources. You can find installation instructions in the official quick start guide.
Prometheus: An open-source systems monitoring and alerting toolkit. It collects time-series metrics from configured targets at given intervals and stores them efficiently.

📌 In this guide, we use Prometheus indirectly — it's installed as part of the monitoring stack, and its metrics power some of the default Grafana dashboards. However, we won't be sending any custom metrics to it directly.
Grafana: A data visualization platform used to create dashboards and alerts. In our setup, it connects to Prometheus for metrics and ClickHouse for logs and traces.
ClickHouse: A column-oriented database designed for high-performance analytics. It excels at storing structured logs and trace data at scale, making it well-suited for observability use cases.
OpenTelemetry Collector: A vendor-neutral component that receives, processes, and exports telemetry data. It supports various data formats and allows routing logs, metrics, and traces to multiple destinations like ClickHouse and Prometheus.

Now that the core concepts are clear, it's time to roll up our sleeves and get hands-on with some code.

Getting Started

Before we begin, let's prepare a working directory for our observability setup. This folder will contain configuration files, sample application and serve as a base for volume mounts in Kind.

Create it like this:

mkdir -p $PWD/test-observability/otelcol-storage
cd $PWD/test-observability

Now let's start building the environment step by step.

Step 1: Launching a Local Kubernetes Cluster

To simulate a production-like environment on our local machine, we use Kind (Kubernetes IN Docker). It lets us spin up a single-node Kubernetes cluster using Docker containers.

Now we create a KIND config file kind-config.yaml that might look like this:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraMounts:
      - hostPath: ./otelcol-storage
        containerPath: /var/lib/otelcol
    extraPortMappings:
      - containerPort: 30080
        hostPort: 30080
        protocol: TCP
      - containerPort: 30090
        hostPort: 30090
        protocol: TCP
      - containerPort: 31000
        hostPort: 31000
        protocol: TCP

With the Port mappings in the Kind config we explicitly expose service ports to our local machine. This makes it easy to access Grafana, Prometheus dashboards, and ClickHouse through the browser at localhost.

Volume mount is included here as a placeholder — we'll explain its purpose in detail later when setting up the OpenTelemetry Collector. In short, it enables file-based log storage persistence across pod restarts.

Now run this command in your terminal:

kind create cluster --name observability --config kind-config.yaml

Creating the cluster may take a minute or two depending on your machine's performance and whether the required Docker image is already cached. During this process, Kind sets up a Kubernetes control plane node, installs a container network interface (CNI), and provisions storage.

You should see output similar to this:

➜  test-observability kind create cluster --name observability --config kind-config.yaml
Creating cluster "observability" ...
 ✓ Ensuring node image (kindest/node:v1.32.2) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾

Verify that all nodes are up and running and you local cluster is alive:

kubectl get nodes
kubectl get pods -A
kubectl cluster-info --context kind-observability

Now that our cluster is up and running, it's time to install the monitoring stack — Prometheus and Grafana.

Step 2: Deploying Prometheus and Grafana

We'll use the prometheus-community Helm chart, which bundles both Prometheus and Grafana in a pre-integrated stack, making setup easier for local testing.

To make the Helm charts for Prometheus and Grafana available locally add helm repository first:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Next, create a new namespace to isolate the monitoring components from the rest of the cluster:

kubectl create namespace monitoring

... and install Prometheus and Grafana

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=15d \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30080 \
  --set prometheus.service.type=NodePort \
  --set prometheus.service.nodePort=30090

Wait until all pods are up and running.

kubectl get pods -n monitoring
kubectl get svc -n monitoring

Our Grafana is now accessible at http://localhost:30080 (default login: admin/prom-operator). You can always find admin password by executing:

kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d; echo

As at this point both our Prometheus and Grafana are up and running, let's import our first real dashboard to visualize data from the Kubernetes node.

Open Grafana (http://localhost:30080), and from the left menu, go to Dashboards → New → Import.
This opens the "Import Dashboard" view. Click on the link to grafana.com/dashboards.
In the new browser tab, search for "Node Exporter Full" and open this dashboard: Node Exporter Full – ID 1860.
Click Copy ID to clipboard.
Go back to Grafana, paste the ID (1860) into the "Import via grafana.com" field, and click Load.
Choose Prometheus as the data source and click Import.

Voilà! The dashboard is now live and displaying system metrics from your Kubernetes node, which should appear as observability-control-plane.

💡 Note: You can also access Prometheus directly by visiting http://localhost:30090/status in your browser. We're able to access this from our host machine because port 30090 was explicitly mapped in the Kind configuration.

Next up, let's install ClickHouse — our database for storing logs and traces.

Step 3: Installing ClickHouse

First, add the ClickHouse Helm chart source:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

Now install the ClickHouse into the same namespace used by Prometheus and Grafana:

helm install clickhouse bitnami/clickhouse \
  --namespace monitoring \
  --set architecture=standalone \
  --set replicaCount=1 \
  --set shards=1 \
  --set auth.username=admin \
  --set auth.password=clickhouse123 \
  --set service.type=NodePort \
  --set service.nodePorts.http=31000 \
  --set persistence.enabled=false \
  --set keeper.enabled=false \
  --set metrics.enabled=true \
  --set resources.requests.cpu=500m \
  --set resources.requests.memory=512Mi \
  --set resources.limits.cpu=4 \
  --set resources.limits.memory=4Gi

Before we proceed, it's worth highlighting a couple of custom settings we used in the installation above:

We explicitly provided a username and password for ClickHouse (admin/clickhouse123) to make it easy to log in using the CLI.
We overrode the default CPU and memory requests/limits. In particular, the default memory limit of 750Mi often isn't enough for ClickHouse, especially when handling real query loads. In practice, insufficient memory will eventually lead to query failures or service instability.

Use the following commands to check that the ClickHouse pod and service are available:

kubectl get pods -n monitoring -l app.kubernetes.io/name=clickhouse
kubectl get svc -n monitoring -l app.kubernetes.io/name=clickhouse

Step 4: ClickHouse Database Setup

Before we move on and deploy the OpenTelemetry Collector, let's pause for a moment to focus on the database setup.

The ClickHouse team officially supports and contributes to the OpenTelemetry exporter for ClickHouse, and it provides convenient defaults for automatically creating tables to store logs and traces.

However, there are two important caveats:

The exporter will not create the database itself. That step is entirely up to you. The exporter can create tables, but only within an existing database.
ClickHouse engineers explicitly recommend avoiding automatic table creation in production environments. In real-world setups, you'll often have more than one collector running — each potentially trying to create the same schema. More importantly, designing a universal schema that fits every workload is practically impossible. Instead, consider the default schema as a starting point — then evolve it using materialized views tailored to your access patterns and business needs. You can find more insights in their excellent technical blog post: ClickHouse and OpenTelemetry.

So, to create a new database, access the ClickHouse console inside the pod:

kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client \
  --user=admin --password=clickhouse123

Inside the ClickHouse SQL console, run:

CREATE DATABASE IF NOT EXISTS observability;

At this point, we've got a fully functional Kubernetes cluster running Prometheus, Grafana, and ClickHouse. Now it's time to dive into the most exciting part — deploying and configuring the OpenTelemetry Collector.

Step 5: Deploying the OpenTelemetry Collector

Let's recap what the OpenTelemetry Collector is. It's a vendor-agnostic service that receives, processes, and exports telemetry data (logs, metrics, and traces) from your applications. It acts as a pipeline that standardizes and routes observability signals to multiple destinations.

There are multiple ways to deploy the collector. In production environments, you might consider using the OpenTelemetry Operator for better lifecycle management and integration with Kubernetes. For local development, however, a DaemonSet deployment is often simpler. It ensures that one instance of the collector runs on every node (we have only one node in our setup), which is especially useful for collecting container logs directly from the node filesystem.

However, DaemonSets do not expose pod IPs via services like regular Deployments. To route traffic to collector pods (e.g., from your instrumented apps using OTLP), you'll need a headless service that exposes them without load-balancing:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector-opentelemetry-collector-agent
  namespace: observability
  labels:
    app.kubernetes.io/name: opentelemetry-collector
    app.kubernetes.io/instance: otel-collector
spec:
  clusterIP: None # Headless service
  selector:
    app.kubernetes.io/name: opentelemetry-collector
    app.kubernetes.io/instance: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317

For now, create a new otel-collector-headless-service.yaml file in your working directory. We'll use it shortly in the next step to expose the collector to applications inside the cluster — allowing them to send telemetry via OTLP (gRPC).

Next, add the open-telemetry Helm repository:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Then create a custom configuration file and save it as otel-collector-values.yaml:

image:
  repository: otel/opentelemetry-collector-contrib
  tag: latest
  pullPolicy: IfNotPresent

mode: daemonset

securityContext:
  runAsUser: 0
  runAsGroup: 0
  fsGroup: 0

config:
  receivers:
    otlp:
      protocols:
        grpc: {}
        http: {}
    filelog:
      include: [/var/log/containers/*app-*.log]
      start_at: end
      include_file_path: true
      include_file_name: false
      storage: file_storage
      operators:
        - id: container-parser
          type: container
          add_metadata_from_filepath: false
        - type: json_parser
          parse_to: body
          if: body matches "^{.*}$"
          # on_error: drop_quiet
          timestamp:
            parse_from: body.time
            layout_type: epoch
            layout: ms
          severity:
            parse_from: body.level
            overwrite_text: true
        - type: copy
          from: body.service_name
          to: resource["service.name"]
        - type: trace_parser
          trace_id:
            parse_from: body.trace_id
          span_id:
            parse_from: body.span_id
          trace_flags:
            parse_from: body.trace_flags

  extensions:
    health_check: {}
    pprof: {}
    zpages: {}
    file_storage:
      directory: /var/lib/otelcol/.data/storage/
      create_directory: true

  processors:
    memory_limiter:
      check_interval: 5s
      limit_mib: 512
    batch:
      timeout: 5s
      send_batch_size: 5000

  exporters:
    clickhouse:
      endpoint: tcp://clickhouse.monitoring.svc.cluster.local:9000?compress=lz4&async_insert=1
      database: observability
      username: admin
      password: clickhouse123
      logs_table_name: otel_logs
      traces_table_name: otel_traces
      create_schema: true
      ttl: 8760h
    debug:
      verbosity: detailed

  service:
    extensions: [health_check, pprof, zpages, file_storage]
    pipelines:
      logs:
        receivers: [filelog]
        processors: [memory_limiter, batch]
        exporters: [debug, clickhouse]
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [clickhouse]

extraVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers
  - name: otelcolstorage
    hostPath:
      path: /var/lib/otelcol

extraVolumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
  - name: otelcolstorage
    mountPath: /var/lib/otelcol

Let's create a new namespace for the collector and call it observability:

kubectl create namespace observability

Then install the collector:

helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace observability \
  -f otel-collector-values.yaml

To verify that the collector is running and ready to receive telemetry:

kubectl get pods -n observability
kubectl describe svc otel-collector-opentelemetry-collector-agent -n observability

Now we can check the collector logs to confirm that it's running properly:

2025-05-17T15:09:47.440Z info otlpreceiver@v0.126.0/otlp.go:116 Starting GRPC server {"resource": {}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "endpoint": "10.244.0.13:4317"}
2025-05-17T15:09:47.443Z info otlpreceiver@v0.126.0/otlp.go:173 Starting HTTP server {"resource": {}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "endpoint": "10.244.0.13:4318"}
2025-05-17T15:09:47.446Z info prometheusreceiver@v0.126.0/metrics_receiver.go:154 Starting discovery manager {"resource": {}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics"}

Great! Our collector is up and running, listening for incoming HTTP and gRPC connections.

Next, let's verify that our ClickHouse database has the expected telemetry tables.

Open a ClickHouse SQL shell:

kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client --user=admin --password=clickhouse123

Then, in the ClickHouse client, run:

SHOW DATABASES;

Expected output:

┌─name────────────────────┐
│ INFORMATION_SCHEMA       │
│ default                  │
│ information_schema       │
│ observability            │
│ system                   │
└─────────────────────────┘

Switch to the observability database:

USE observability;

Check the available tables:

SHOW TABLES;

You should see something like:

┌─name────────────────────────────┐
│ otel_logs                        │
│ otel_traces                      │
│ otel_traces_trace_id_ts          │
│ otel_traces_trace_id_ts_mv       │
└─────────────────────────────────┘

That means everything is connected properly and working as expected!

Now let's take a look at the structure of the tables.

To inspect the otel_logs table structure, run:

SHOW CREATE TABLE otel_logs;

Example output:

CREATE TABLE observability.otel_logs
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TimestampTime` DateTime DEFAULT toDateTime(Timestamp),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `TraceFlags` UInt8,
    `SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
    `SeverityNumber` UInt8,
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `Body` String CODEC(ZSTD(1)),
    `ResourceSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeSchemaUrl` LowCardinality(String) CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` LowCardinality(String) CODEC(ZSTD(1)),
    `ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_key mapKeys(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_value mapValues(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_body Body TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8
)
ENGINE = MergeTree
PARTITION BY toDate(TimestampTime)
PRIMARY KEY (ServiceName, TimestampTime)
ORDER BY (ServiceName, TimestampTime, Timestamp)
TTL TimestampTime + toIntervalDay(365)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1

Now let's inspect the otel_traces table:

SHOW CREATE TABLE otel_traces;

Example output:

CREATE TABLE observability.otel_traces
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `ParentSpanId` String CODEC(ZSTD(1)),
    `TraceState` String CODEC(ZSTD(1)),
    `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
    `SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `Duration` UInt64 CODEC(ZSTD(1)),
    `StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
    `StatusMessage` String CODEC(ZSTD(1)),
    `Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
    `Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    `Links.TraceId` Array(String) CODEC(ZSTD(1)),
    `Links.SpanId` Array(String) CODEC(ZSTD(1)),
    `Links.TraceState` Array(String) CODEC(ZSTD(1)),
    `Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toDateTime(Timestamp))
TTL toDateTime(Timestamp) + toIntervalDay(365)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1

Looks promising!

Finalizing Collector Networking Setup

Now let's launch the headless service to allow other applications to communicate with the OpenTelemetry Collector:

kubectl apply -f otel-collector-headless-service.yaml

Check that all components are running:

kubectl get all -n observability

You should see output similar to:

pod/otel-collector-opentelemetry-collector-agent-mfh7d        1/1     Running
service/otel-collector-opentelemetry-collector-agent          1/1     Running
daemonset.apps/otel-collector-opentelemetry-collector-agent   1/1     Running

Our collector is now active as a DaemonSet, and the headless service allows applications to send OTLP traffic.

To verify connectivity from within the cluster, open a Grafana pod shell:

kubectl -n monitoring exec -it prometheus-grafana-77bcfb9bdb-8pfjg -- bash

Then test the connection:

nc -vz otel-collector-opentelemetry-collector-agent.observability.svc.cluster.local 4317

Expected output:

otel-collector-opentelemetry-collector-agent.observability.svc.cluster.local (10.244.0.13:4317) open

That confirms the service is reachable from within the cluster. We're ready to send and visualize telemetry.

Understanding the OpenTelemetry Collector Configuration

But before we continue, it's worth pausing and examining what the OpenTelemetry Collector configuration file actually consists of. If you haven't yet, I highly recommend watching this video by Marcel Dempers, which explains the basics in a very clear and digestible way.

In short, the configuration file can be logically divided into several key sections:

1. `receivers`

Receivers are how data gets into the Collector. Each receiver is responsible for accepting telemetry data in a specific format or from a specific source — e.g., OTLP over HTTP/gRPC, Prometheus metrics scraping, or reading logs from files.

A receiver is how data gets into the Collector. Receivers "listen" for data being sent to them or collect it from a target.

— OpenTelemetry Documentation

2. `extensions`

Extensions add optional but useful features to the Collector. They are not part of the signal pipeline but provide capabilities like health checks, profiling endpoints, zPages for live debugging, or persistent file storage.

Extensions are optional components that provide additional capabilities such as health checks, file storage, or zPages.

— OpenTelemetry Documentation

3. `processors`

Processors transform telemetry data between receiving and exporting. They can batch, filter, enhance, or drop data. Common processors include batching for improved throughput and memory limiting to protect from resource overuse.

Processors are used to modify data before it is exported. They can be chained and applied to logs, metrics, or traces.

— OpenTelemetry Documentation

4. `exporters`

Exporters are how telemetry data leaves the Collector. Each exporter sends data to a backend system like ClickHouse, Prometheus, Jaeger, etc. Exporters are the final stage in a telemetry pipeline.

Exporters are how data is sent to other systems or storage backends.

— OpenTelemetry Documentation

5. `service`

This section brings everything together. It defines the telemetry pipelines (logs, traces, metrics), specifying which receivers, processors, and exporters to use. It also activates the extensions.

The service section defines how the Collector runs: which pipelines to start and which extensions to enable.

— OpenTelemetry Documentation

Example: How We Use `file_storage`

In our configuration, the file_storage extension is used to enable state persistence — particularly useful for operators like filelog, which maintain internal state about file offsets.

We configure it as follows:

extensions:
  file_storage:
    directory: /var/lib/otelcol/.data/storage/
    create_directory: true

This tells the collector to store internal metadata (e.g., file read positions) in the specified directory. Combined with a persistent volume mount in the kind-config.yaml config file:

- hostPath: ./otelcol-storage
    containerPath: /var/lib/otelcol

…this ensures that the state survives across restarts and pod rescheduling. Without it, the collector may reprocess logs from the beginning — causing duplicates, or worse, losing data if it misses new entries during the restart (this behavior is also influenced by the start_at setting).

The filelog receiver references this extension using:

storage: file_storage

This closes the loop and ensures robust, restart-safe log ingestion.

Why We Use the `filelog` Receiver

In our current setup, we're relying on the filelog receiver to collect logs directly from the node's filesystem. While OpenTelemetry SDKs for most languages support sending logs and metrics programmatically to the collector, this feature isn't always production-ready. For example, at the time of writing, Node.js SDK documentation explicitly marks log export support as "in development".

That's why we're sticking to the old, reliable approach: scraping logs directly from disk. It's robust, language-agnostic, and doesn't require modifying application code — a practical choice when SDKs lag behind or instrumentation needs to stay lightweight.

It's also worth noting that not all containerized applications produce logs in the same format. Depending on the base image, logging library, or runtime, your logs might be plain text, JSON, or even multi-line stack traces.

Fortunately, OpenTelemetry's filelog receiver supports a wide range of parsing operators to help with that. You can mix and match operators like json_parser, regex_parser, trace_parser, multiline, move, and more to handle even the most bizarre log formats.

In our case, we use the purpose-built container log parser operator — designed specifically for Kubernetes environments. It's optimized for parsing container logs where metadata is appended outside of the actual log payload.

Log Parsing Flow in Our Configuration

In our setup, we're collecting logs from container files that match the pattern /var/log/containers/*app-*.log. These are symbolic links pointing to actual log files managed by the container runtime — and depending on which runtime is used (like Docker, containerd, or CRI-O), the format of those logs can differ quite a bit.

Here's a quick comparison of the most common formats:

Runtime	Log Format Example
Docker	JSON: `{"log":"msg","stream":"stdout","time":"2024-01-01T12:00:00.000000000Z"}`
containerd	Very similar to Docker, with minor variations in buffering and timing
CRI-O	Plaintext: `2024-01-01T12:00:00.000000000Z stdout F {"time":..., "level":"info"}`

These differences matter — because if we want to parse logs properly, we need to apply the right sequence of operators to extract structured data from the raw lines.

In our case, we're working with CRI-O formatted logs that contain JSON log bodies. Here's how we process them using the filelog receiver:

filelog:
  include: [/var/log/containers/*app-*.log]
  start_at: end
  include_file_path: true
  include_file_name: false
  storage: file_storage
  operators:
    - id: container-parser
      type: container
      add_metadata_from_filepath: false
    - type: json_parser
      parse_to: body
      if: body matches "^{.*}$"
      timestamp:
        parse_from: body.time
        layout_type: epoch
        layout: ms
      severity:
        parse_from: body.level
        overwrite_text: true
    - type: copy
      from: body.service_name
      to: resource["service.name"]
    - type: trace_parser
      trace_id:
        parse_from: body.trace_id
      span_id:
        parse_from: body.span_id
      trace_flags:
        parse_from: body.trace_flags

Let's break that down:

container — This one strips away the CRI-O metadata prefix (timestamp, stream, flag) and gives us just the log body.
json_parser — It kicks in if the body looks like JSON. It extracts timestamp and severity from specific fields like body.time and body.level.
copy — We map service_name from the body to a proper OpenTelemetry resource attribute.
trace_parser — We extract trace context (trace_id, span_id, trace_flags) so the log can be linked to a trace.

A Note on the OpenTelemetry Log Data Model

All of this parsing leads to one goal: transforming raw logs into structured entries that comply with the OpenTelemetry Log Data Model.

This model defines how logs should be structured — with clear separation between:

Body: the actual log message
Attributes: key-value pairs for context
Trace context: TraceId, SpanId, and TraceFlags
Resource metadata: like service name, environment, or deployment info

By following this model, we ensure that our logs are ready for advanced querying, correlation with traces, and rich visualizations in tools like ClickHouse and Grafana.

I also recommend reading Attribute Registry Specification to gain a better understanding of standard attributes and how they correlate with this model.

Step 6: Sample Application Deployment

Alright — the hardest part is behind us. Let's recap what we've accomplished so far.

We have a local Kubernetes cluster up and running with Prometheus, Grafana, and ClickHouse installed. The OpenTelemetry Collector has been successfully deployed as a DaemonSet, and we've created a headless service to expose it. We also verified that ClickHouse automatically created the required tables for logs and traces.

Everything is now in place to deploy a sample application that emits logs and traces — so we can validate the full end-to-end pipeline.

Setting Up the Sample Service

Now, we are going to create a simple web application using NestJS. Don't worry if you've never used NestJS before — we'll write everything in plain TypeScript, and the code is easy to follow. The idea is quite simple: we will deploy our application to our local cluster as two independent services, each with two replicas. There are two endpoints inside the application: ping and call. The first endpoint simply returns pong and simulates some activity for 50 milliseconds, creating a test-span in the process. The second endpoint, call, is more complicated: it calls its sibling's ping (app-a/call → app-b/ping, and vice versa), which does the same thing. This will allow us to see how traces are built for two independent applications.

If Nest CLI isn't installed globally yet:

npm install -g @nestjs/cli

We'll assume you're still inside the test-observability directory.
Generate a new project:

nest new my-observability-app
cd my-observability-app

Choose defaults when prompted (you can go with either npm or yarn).

Installing Dependencies

The base NestJS app doesn't include observability tools, so let's install everything needed for logging, tracing, and OTLP exports:

npm install \
  nestjs-pino pino pino-http pino-pretty pino-opentelemetry-transport \
  @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-metrics-otlp-grpc \
  @opentelemetry/instrumentation-http \
  @opentelemetry/instrumentation-express \
  @opentelemetry/instrumentation-pino \
  @opentelemetry/instrumentation-redis \
  @opentelemetry/instrumentation-ioredis \
  @opentelemetry/instrumentation-mysql \
  sonyflake

Wiring Up Tracing and Logging

src/tracing.ts — this is where our observability begins. We use @opentelemetry/sdk-node to connect to our collector via gRPC and begin sending traces. We won't dive into SDK configuration here — it's a deep topic that deserves its own article. The main thing to pay attention to is getNodeAutoInstrumentations() function (github). This tells our application that we have installed OpenTelemetry packages and that it should make our traces and spans compatible with them. I've also added comments to explain how to enable debugging logs when using @opentelemetry/sdk-node.

💡 One more important clarification: for everything to work, this file needs to be called before the main application is launched. We will come back to this when we launch the application through Docker.

src/tracing.ts:

import { credentials } from '@grpc/grpc-js';
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
// import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
    credentials: credentials.createInsecure(),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

// diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

sdk.start();

Next one is our main.ts file, where the application is bootstrapped. I have made some slight modifications to it to ensure that the Pino logger is used properly.

src/main.ts:

import { NestFactory } from '@nestjs/core';
import { Logger } from 'nestjs-pino';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule, { bufferLogs: true });
  app.useLogger(app.get(Logger));
  await app.listen(3000);
}
bootstrap();

The next two are our controller, which has endpoints for ping and call, and a service that simulates some work.

src/app.controller.ts:

import { Controller, Get } from '@nestjs/common';
import { InjectPinoLogger, PinoLogger } from 'nestjs-pino';
import { AppService } from './app.service';

@Controller()
export class AppController {
  private readonly targetHost: string;

  constructor(
    @InjectPinoLogger(AppService.name)
    private readonly logger: PinoLogger,
    private readonly appService: AppService,
  ) {
    const self = process.env.OTEL_SERVICE_NAME;
    if (self === 'app-a') {
      this.targetHost = 'http://app-b:3000';
    } else if (self === 'app-b') {
      this.targetHost = 'http://app-a:3000';
    } else {
      this.targetHost = 'http://0.0.0.0:3000';
    }
  }

  @Get('ping')
  async ping(): Promise<string> {
    await this.getHello();
    return 'pong';
  }

  @Get('call')
  async call(): Promise<string> {
    if (!this.targetHost) {
      return 'Unknown app role, cannot determine target';
    }

    try {
      const res = await fetch(`${this.targetHost}/ping`);
      const text = await res.text();
      return `Response from ${this.targetHost}: ${text}`;
    } catch (err) {
      this.logger.error(err);
      return `Failed to call ${this.targetHost}: ${err}`;
    }
  }

  @Get()
  getHello(): Promise<string> {
    return this.appService.getHello();
  }
}

src/app.service.ts:

import { Injectable } from '@nestjs/common';
import { trace } from '@opentelemetry/api';
import { PinoLogger, InjectPinoLogger } from 'nestjs-pino';

@Injectable()
export class AppService {
  constructor(
    @InjectPinoLogger(AppService.name)
    private readonly logger: PinoLogger,
  ) {}

  async getHello(): Promise<string> {
    const tracer = trace.getTracer('manual-test');

    await tracer.startActiveSpan('test-span', async (span) => {
      await new Promise((res) => setTimeout(res, 50));
      span.end();
    });

    this.logger.info('Handling getHello request...');
    this.doSomething();
    this.logger.info('Finished getHello request.');

    return 'Hello World!';
  }

  doSomething(): void {
    this.logger.info('Doing something internal...');
  }
}

Now go to module setup. AppModule is our main, or root, module. In NestJS, modules form the basis of business logic by uniting services and controllers, and are a recommended and convenient way to organise code. Modules can import other modules. We will therefore import the Pino module into our root module, which will allow us to use the Pino logger in our application. We're going to take a closer look at what's actually going on here.

src/app.module.ts:

import { Module } from '@nestjs/common';
import { LoggerModule } from 'nestjs-pino';
import { Sonyflake } from 'sonyflake';
import { IncomingMessage } from 'http';
import { AppController } from './app.controller';
import { AppService } from './app.service';

const isProd = process.env.NODE_ENV === 'production';

const sonyflake = new Sonyflake({
  machineId: 2,
  epoch: Date.UTC(2020, 4, 18, 0, 0, 0),
});

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        base: {
          service_name: process.env.OTEL_SERVICE_NAME || 'app-a',
        },
        customLevels: {
          trace: 1,
          debug: 5,
          info: 9,
          warn: 13,
          error: 17,
          fatal: 21,
        },
        useOnlyCustomLevels: true,
        genReqId: (req: IncomingMessage) => {
          const id = sonyflake.nextId();
          req.id = id;
          return id;
        },
        level: process.env.LOG_LEVEL || 'info',
        ...(isProd
          ? {}
          : {
              transport: {
                target: 'pino-pretty',
                options: {
                  colorize: true,
                },
              },
            }),
      },
    }),
  ],
  controllers: [AppController],
  providers: [AppService],
})
export class AppModule {}

Let's go through the settings for the logger.

This

base: {
  service_name: process.env.OTEL_SERVICE_NAME || 'app-a',
}

... will add a service_name field to our logs.

This

const sonyflake = new Sonyflake({
  machineId: 2, // in range 2^16
  epoch: Date.UTC(2020, 4, 18, 0, 0, 0), // timestamp
});

...

genReqId: (req: IncomingMessage) => {
  const id = sonyflake.nextId();
  req.id = id;
  return id;
}

... associates an incoming request with a unique id that we generate with sonyflake.

And, finally, this

customLevels: {
  trace: 1,
  debug: 5,
  info: 9,
  warn: 13,
  error: 17,
  fatal: 21,
},
useOnlyCustomLevels: true

... defines custom levels. Because Pino uses a default numeric level set (e.g., info = 30, warn = 40) that doesn't align with OpenTelemetry's level conventions (info = 9, error = 17, etc.). Defining custom levels ensures logs are correctly parsed and severity is accurately extracted in the OpenTelemetry Collector pipeline.

To start our application run:

npm run build
NODE_ENV=production node -r ./dist/tracing dist/main

In the terminal you should see the output, showing that the service has started and that our ping and call routes have been mapped.

{"level":9,"time":1747552531452,"service_name":"app-a","context":"NestFactory","msg":"Starting Nest application..."}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"InstanceLoader","msg":"LoggerModule dependencies initialized"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"InstanceLoader","msg":"AppModule dependencies initialized"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RoutesResolver","msg":"AppController {/}:"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/ping, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/call, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"RouterExplorer","msg":"Mapped {/, GET} route"}
{"level":9,"time":1747552531452,"service_name":"app-a","context":"NestApplication","msg":"Nest application successfully started"}

OpenTelemetry Instrumentation Debugging

Before we continue, I'll explain how to troubleshoot if logs or traces don't appear in the database, for example.
To achieve this, we simply need to uncomment the commented lines in src/tracing.ts and then restart our service.

import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';

...

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

Then run it again:

npm run build
NODE_ENV=production node -r ./dist/tracing dist/main

Even though our service has successfully started, you will see something like this in the logs:

{"stack":"AggregateError [ECONNREFUSED]: \n    at internalConnectMultiple (node:net:1121:18)\n    at afterConnectMultiple (node:net:1688:7)\n    at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17)","errors":"Error: connect ECONNREFUSED ::1:4318,Error: connect ECONNREFUSED 127.0.0.1:4318","code":"ECONNREFUSED","name":"AggregateError"}
{"stack":"Error: 14 UNAVAILABLE: No connection established. Last error: Error: connect ECONNREFUSED 127.0.0.1:4317"}

This clearly indicates that it was unable to establish an HTTP and gRPC connection to collector. This debugging mechanism, together with the debug exporter in the OpenTele Collector itself, provides us with a minimal set of tools for troubleshooting.

Deploying to Kind Cluster

To simplify deployment, we'll add a helper script at the root of the app:

./deploy-app.sh

Deployment script (deploy-app.sh):

#!/bin/bash

set -e

VERSION_FILE="VERSION"
if [[ -f "$VERSION_FILE" ]]; then
  VERSION=$(cat "$VERSION_FILE")
else
  VERSION=1
fi

if ! [[ "$VERSION" =~ ^[0-9]+$ ]]; then
  echo "❌ Invalid version number in VERSION file"
  exit 1
fi

TAG="$VERSION"
IMAGE_NAME="my-observability-app"
KIND_CLUSTER_NAME="observability"
HELM_RELEASE_NAME="my-observability-app"
HELM_CHART_PATH="./helm/my-observability-app"

echo "🛠 Building Docker image: $IMAGE_NAME:$TAG"
docker build -t "$IMAGE_NAME:$TAG" .

echo "🐳 Loading image into Kind cluster: $KIND_CLUSTER_NAME"
kind load docker-image "$IMAGE_NAME:$TAG" --name "$KIND_CLUSTER_NAME"

echo "🚀 Installing/upgrading Helm release: $HELM_RELEASE_NAME"
helm upgrade --install "$HELM_RELEASE_NAME" "$HELM_CHART_PATH" \
  --set image.repository="$IMAGE_NAME" \
  --set image.tag="$TAG" \
  --set image.pullPolicy=IfNotPresent

NEXT_VERSION=$((VERSION + 1))
echo "$NEXT_VERSION" > "$VERSION_FILE"

echo "✅ Done. Deployed image with tag: $TAG"
echo "📄 Updated VERSION for next release to: $NEXT_VERSION"

This script does the following:

Builds the app and Docker image
Loads it into the Kind cluster
Deploys two services via Helm: app-a and app-b, each with 2 replicas

And our Helm chart structure will look like:

helm/my-observability-app/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── app-a-deployment.yaml
    ├── app-a-service.yaml
    ├── app-b-deployment.yaml
    └── app-b-service.yaml

helm/Chart.yaml:

apiVersion: v2
name: my-observability-app
description: Deploys a single app twice as app-a and app-b
version: 0.1.0
appVersion: '1.0'

helm/values.yaml:

replicaCount: 2

image:
  repository: my-observability-app
  tag: latest
  pullPolicy: IfNotPresent

service:
  port: 3000

otel:
  endpoint: 'otel-collector-opentelemetry-collector-agent.observability.svc:4317'

helm/templates/app-a-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-a
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: app-a
  template:
    metadata:
      labels:
        app: app-a
    spec:
      containers:
        - name: app-a
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          env:
            - name: NODE_ENV
              value: production
            - name: OTEL_SERVICE_NAME
              value: app-a
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: {{ .Values.otel.endpoint }}
            - name: OTEL_EXPORTER_OTLP_PROTOCOL
              value: grpc
          ports:
            - containerPort: {{ .Values.service.port }}

helm/templates/app-a-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: app-a
spec:
  selector:
    app: app-a
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000

app-b-deployment.yaml and app-b-service.yaml are identical except for names and labels changed to app-b.

And do not forget to add a Dockerfile and .dockerignore files.

Dockerfile:

FROM node:22-alpine as base
WORKDIR /app
COPY package*.json ./
RUN npm ci

FROM base as builder
COPY . .
RUN npm run build

FROM base as prod
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
RUN npm ci --omit=dev
CMD ["node", "-r", "./dist/tracing.js", "./dist/main.js"]

.dockerignore:

# Node dependencies
node_modules
npm-debug.log
yarn.lock

# Build output
dist
*.ts

# OS / Editor / IDE
.DS_Store
.env
*.env
.vscode
.idea
*.swp

# Git
.git
.gitignore

VERSION
deploy-app.sh

Now we're ready to deploy our servises to Kubernetes. Simply run our deploy-app.sh script and observe the example output:

🛠 Building Docker image: my-observability-app:1
🐳 Loading image into Kind cluster: observability
🚀 Installing/upgrading Helm release: my-observability-app
✅ Done. Deployed image with tag: 1
📄 Updated VERSION for next release to: 2

Testing the Pipeline

Port-forward the app:

kubectl port-forward service/app-a 3000:3000

In another terminal run:

curl http://localhost:3000/call

Expected output:

Response from http://app-b:3000: pong%

To see logs from app-a and app-b run:

kubectl logs -l app=app-a
kubectl logs -l app=app-b

Now inspect the OpenTelemetry Collector logs and ClickHouse tables to confirm that we store logs and traces.

OpenTele Collector Logs

You should see log records with full trace context, custom severity, and service name:

kubectl logs otel-collector-opentelemetry-collector-agent-fzwqv -n observability -f

LogRecord #0
ObservedTimestamp: 2025-05-17 18:47:52.880985305 +0000 UTC
Timestamp: 2025-05-17 18:47:52.783 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
Body: Map({"level":9,"msg":"request completed","req":{"headers":{"accept":"*/*","host":"localhost:3000","user-agent":"curl/8.7.1"},"id":"2646566778519420930","method":"GET","params":{"path":["call"]},"query":{},"remoteAddress":"::ffff:127.0.0.1","remotePort":49448,"url":"/call"},"res":{"headers":{"content-length":"37","content-type":"text/html; charset=utf-8","etag":"W/\"25-YIs9s+nPVAD6eNe/gEyORquumI4\"","x-powered-by":"Express"},"statusCode":200},"responseTime":74,"service_name":"app-a","span_id":"0190a252bfc22dab","time":1747507672783,"trace_flags":"01","trace_id":"18211f4c6b1dd7990bc8a6f113b774e6"})
Attributes:
     -> log.file.path: Str(/var/log/containers/app-a-79fdcb5d84-5s5rl_default_app-a-d456b9167a9578c65b5f6c223461594b1e1599ee220b7c633694f62018e31258.log)
     -> log.iostream: Str(stdout)
     -> logtag: Str(F)
Trace ID: 18211f4c6b1dd7990bc8a6f113b774e6
Span ID: 0190a252bfc22dab
Flags: 1
 {"resource": {}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs"}
2025-05-17T18:47:56.571Z info Metrics {"resource": {}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "resource metrics": 1, "metrics": 36, "data points": 45}
2025-05-17T18:47:56.577Z info ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
     -> service.name: Str(otelcol-contrib)
     -> server.address: Str(10.244.0.14)
     -> service.instance.id: Str(86f2eb87-dbc0-4516-95b9-d3a254fb10e4)
     -> server.port: Str(8888)
     -> url.scheme: Str(http)
     -> service.version: Str(0.126.0)

ClickHouse (otel_logs)

select * from otel_logs order by Timestamp desc limit 1 format vertical;

Timestamp:          2025-05-17 18:47:54.966000000
TimestampTime:      2025-05-17 18:47:54
TraceId:            efb66e731e09d4abe7be282c704adf13
SpanId:             72514ac7a4067633
TraceFlags:         1
SeverityText:       INFO
SeverityNumber:     9
ServiceName:        app-a
Body:               {"level":9,"msg":"request completed","req":{"headers":{"accept":"*/*","host":"localhost:3000","user-agent":"curl/8.7.1"},"id":"2646566815362187266","method":"GET","params":{"path":["call"]},"query":{},"remoteAddress":"::ffff:127.0.0.1","remotePort":49468,"url":"/call"},"res":{"headers":{"content-length":"37","content-type":"text/html; charset=utf-8","etag":"W/\"25-YIs9s+nPVAD6eNe/gEyORquumI4\"","x-powered-by":"Express"},"statusCode":200},"responseTime":62,"service_name":"app-a","span_id":"72514ac7a4067633","time":1747507674966,"trace_flags":"01","trace_id":"efb66e731e09d4abe7be282c704adf13"}
ResourceSchemaUrl:
ResourceAttributes: {'service.name':'app-a'}
ScopeSchemaUrl:
ScopeName:
ScopeVersion:
ScopeAttributes:    {}
LogAttributes:      {'log.file.path':'/var/log/containers/app-a-79fdcb5d84-5s5rl_default_app-a-d456b9167a9578c65b5f6c223461594b1e1599ee220b7c633694f62018e31258.log','log.iostream':'stdout','logtag':'F'}

ClickHouse (otel_traces)

select * from otel_traces order by Timestamp desc limit 1 format vertical;

Timestamp:          2025-05-17 18:47:54.911000000
TraceId:            efb66e731e09d4abe7be282c704adf13
SpanId:             289622f5583790bb
ParentSpanId:       7eed83778c3497d7
TraceState:
SpanName:           request handler - /ping
SpanKind:           Internal
ServiceName:        app-b
ResourceAttributes: {'host.arch':'arm64','host.name':'app-b-7d9f6b6bf7-z56nq','process.command':'/app/dist/main.js','process.command_args':'["/usr/local/bin/node","-r","./dist/tracing.js","/app/dist/main.js"]','process.executable.name':'node','process.executable.path':'/usr/local/bin/node','process.owner':'root','process.pid':'1','process.runtime.description':'Node.js','process.runtime.name':'nodejs','process.runtime.version':'22.15.1','service.name':'app-b','telemetry.sdk.language':'nodejs','telemetry.sdk.name':'opentelemetry','telemetry.sdk.version':'2.0.1'}
ScopeName:          @opentelemetry/instrumentation-express
ScopeVersion:       0.50.0
SpanAttributes:     {'express.name':'/ping','express.type':'request_handler','http.route':'/ping'}
Duration:           54657000
StatusCode:         Unset
...

If both return recent entries and no errors in the collector — congratulations, you now have end-to-end observability working in your stack!

Step 7: Explore Traces in Grafana

All right — it's finally time to explore and customize trace visualization in Grafana.

To do this, we need to add ClickHouse as a new data source in Grafana.

Open Grafana at http://localhost:30080 (default login: admin/prom-operator).
Go to Connections → Add new connection.
In the search bar, type ClickHouse and select the plugin named ClickHouse.
Click Install and wait until the plugin is installed.
Then press "Add new data source".

You can leave the name as the default (grafana-clickhouse-datasource) and toggle the "Default" switch to on.

Set the following connection details:

Server address: clickhouse.monitoring.svc.cluster.local
Port number: 9000
Protocol: Native
Skip TLS Verify: On
Credentials:
- Username: admin
- Password: clickhouse123

Click "Save & test" — you should see the message: Data source is working.

Reload this page, once it reloads, scroll down to the Additional settings section.

Configure Logs and Traces

In the Logs configuration section:

Default log database: observability
Default log table: otel_logs
Toggle "Use OTel columns" to on

In the Traces configuration section:

Default trace database: observability
Default trace table: otel_traces
Toggle "Use OTel columns" for traces as well

Press "Save & test" again.

Import Dashboards

Now scroll to the top of the data source page and open the Dashboards tab.

Click "Import" for each available dashboard. You should now have a few new dashboards available.

From the left-hand menu in Grafana, go to Dashboards, then type ClickHouse in the search bar. You should see your newly imported dashboards.

Go to:
Home → Dashboards → ClickHouse OTel Dashboard
This is where we'll be observing traces shortly.

Generate Some Data

Let's send some traffic through our services.

First, in one terminal, port-forward app-a:

kubectl port-forward service/app-a 3000:3000

Then, in another terminal, run:

ab -n 200 -c 10 http://localhost:3000/call

This uses Apache Benchmark to generate 200 requests with a concurrency of 10.

Once it finishes, refresh the ClickHouse OTel Dashboard. You should see metrics like request count and latency.

Scroll down to the Traces section and click on one of the recent traces.

You'll see the full trace journey, including spans across both services. Scroll further down to the Logs section to view logs correlated with that particular trace.

💡 Bonus Tip:
Check out the ClickHouse - Data Analysis dashboard to see how much data is currently stored in your observability tables.

Fixing Log Expansion in Grafana Panel

You might notice that when you expand a log entry in the dashboard, the details are not very useful — you only see the log level. Let's fix that so you can access structured and meaningful fields.

1. Edit the Panel

In the panel menu, click the three vertical dots (︙) in the top-right corner of the panel.
Select Edit. This will open the Edit Logs Visualization view.

2. Update the Log Query

Scroll down to the Queries section.
Set:
- Editor type to SQL Editor
- Query type to Logs
Replace the default query:

SELECT Timestamp as "timestamp", Body as "body", SeverityText as "level" FROM "default"."otel_logs" LIMIT 1000

with the enhanced one:

SELECT
    Timestamp AS timestamp,
    Body AS body,
    TraceId AS trace_id,
    ServiceName AS service,
    JSONExtractString(JSONExtractRaw(Body, 'req'), 'id') AS request_id,
    JSONExtractString(Body, 'msg') AS message,
    JSONExtractInt(JSONExtractRaw(Body, 'res'), 'statusCode') AS status_code,
    JSONExtractString(JSONExtractRaw(Body, 'req'), 'url') AS url,
    JSONExtractInt(Body, 'responseTime') AS response_time,
    SeverityText AS level
FROM observability.otel_logs
ORDER BY Timestamp DESC
LIMIT 1000

3. Run and Save

Press the Run Query button.
Scroll down to ensure no error messages appear.
If everything looks good:
- Click Save Dashboard (top-right).
- Confirm by pressing Save.
- You might see a warning: "This is a Plugin dashboard". Just click Save and overwrite.

4. Verify Enhanced Logs

Navigate back to the Simple ClickHouse OTel Dashboard.
Reload the page if needed.
Scroll down to the Trace Logs panel.
Click on any log entry — now you should see:
- Structured fields (like trace_id, request_id, url, etc.)
- Two new buttons: View traces and View logs.

These buttons give you direct navigation to a more convenient view for the current trace or its associated logs.

Final Thoughts

This guide turned out quite a bit longer than I initially expected — and even so, we've only scratched the surface. Each step above introduces just the basics of every component, and there's still so much more to explore in the world of observability.
(We haven't even touched on collecting metrics and setting up alerts!)

In production environments, you'll need to think about proper database schema design, secure credential management, persistent storage, automation for scaling and deployment — and many other real-world concerns.
Most importantly, you'll need to choose the right set of tools that align with your team's needs and your business context.

But regardless of scale, the core principle stays the same: start small, understand how your data flows, and evolve your stack as your system grows.

Good luck — and may no incident ever go unnoticed. 🚀

Troubleshooting

Kind Cluster Creation Fails

   # Check Docker is running
   docker ps

   # Clean up any existing clusters
   kind delete clusters --all

   # Verify system resources
   docker info

Prometheus/Grafana Pods Not Starting

   # Check pod status
   kubectl get pods -n monitoring

   # Check pod logs
   kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
   kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

   # Check resource limits
   kubectl describe pod -n monitoring -l app.kubernetes.io/name=prometheus

ClickHouse Connection Issues

   # Verify ClickHouse pod is running
   kubectl get pods -n monitoring -l app.kubernetes.io/name=clickhouse

   # Check ClickHouse logs
   kubectl logs -n monitoring -l app.kubernetes.io/name=clickhouse

   # Test connection from within cluster
   kubectl exec -n monitoring -it svc/clickhouse -- clickhouse-client --user=admin --password=clickhouse123

OpenTelemetry Collector Issues

   # Check collector status
   kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector

   # View collector logs
   kubectl logs -n observability -l app.kubernetes.io/name=opentelemetry-collector

Data Not Showing in Grafana
- Verify data sources are properly configured in Grafana
- Verify ClickHouse tables are being created and populated
- Check OpenTelemetry Collector logs for any export errors

Alex @alex_yurchenko_c2d664c0a