Distributed Tracing Instrumentation with OpenTelemetry and Jaeger

Distributed tracing is a way to track a request as it moves through a system, especially in setups where multiple services talk to each other, like in microservices.

Imagine a user clicking "buy" on an e-commerce site. That action might hit a front-end service, a payment processor, an inventory checker, a database and a Redis cache.

If something goes wrong, figuring out where it failed can be a nightmare without a clear map. That’s where distributed tracing comes in. It’s like a GPS for your application, showing the path of a request across services, how long each step takes, and where things might break.

Unlike logs, which are like diary entries of what happened, or metrics, which give you numbers like CPU usage, tracing gives you the full story of a request’s journey. It’s critical for spotting bottlenecks, debugging errors, and understanding how your system behaves under real world conditions.

Logs are great for capturing detailed information about what your application is doing. But they are often scattered and not tied together. Traces, on the other hand, bring structure. You can think of them as stitched logs that belong to the same request. When you attach key log details as attributes or events inside spans, you end up with the same information, but now it is grouped by request and connected across services.

This article explains distributed tracing using my GitHub repository distributed_tracing.

The repo includes:

📦 Instrumentation with OpenTelemetry & Jaeger
🎯 Head-based sampling
🧠 Tail-based sampling
⚖️ Scaling collectors with a load balancer

We will walk through each step and explain what the code is doing.

⚠️Disclaimer:
There are other important topics that we will not cover here, such as:

Custom instrumentation for capturing application-specific spans

How to define a good trace and what makes a trace useful

Correlating logs with traces so that logs are grouped around a single request

Aggregating data from your traces to derive metrics without exporting them separately

Before we dive into the repository, we will rely on automatic instrumentation to keep things simple. Specifically, this includes instrumenting HTTP servers to capture incoming requests and HTTP clients to trace outgoing calls. But in real systems, it is rarely enough. You often need to go beyond that.

Good traces require good data🤖 That means making sure you are instrumenting all the key parts of your system. HTTP clients and servers, relational databases, cache layers, Elasticsearch, and any other critical services should be automatically instrumented where possible. Then you layer on custom instrumentation to fill the gaps and highlight the things that matter most in your business logic.

Repository Overview

This repository demonstrates distributed tracing using OpenTelemetry with Jaeger as the backend to collect and visualize traces.

It features three services:

X: written in Go
Y: written in Ruby
Z: written in Node.js

The services are connected in a chain: Service X calls Y, and Y calls Z. This creates a traceable path for a request as it moves across the system.

The goal is to trace the full lifecycle of a request as it flows from the entry point (service X) to the final service (Z). In the real world, this pattern is common in microservice-based applications where distributed tracing can help identify where time is spent or where failures occur.

To simulate real behavior and failures:

Service Z is configured to return a 500 Internal Server Error on every 10th request. This is done deliberately to help us observe how different sampling strategies (head-based vs tail-based) behave when errors are present in the trace.
The script hit_x_service.sh repeatedly sends 10 HTTP GET requests to the /x endpoint in service X. This creates a consistent flow of traces that travel through all three services.

Architecture Overview

Here’s a high level diagram that shows how everything fits together under the hood. Each of the three services (X, Y, and Z) is instrumented using OpenTelemetry and exports trace data via OTLP over HTTP to the Jaeger Collector. The Jaeger Collector receives and processes the traces, forwarding them to the backend for storage and visualization in the Jaeger UI.

Instrumentation and Basic Tracing

To start, we have a docker-compose.yml file that sets up the environment. It installs Jaeger along with the necessary ports to receive and visualize trace data.

version: '3'

services:
  jaeger:
    image: jaegertracing/all-in-one:1.71.0
    command:
      - "--collector.otlp.grpc.tls.enabled=false"
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP

Let’s break down what each port does:

Port 16686 is used to access the Jaeger UI.
Port 4317 allows the Jaeger Collector to receive trace data using the OpenTelemetry Protocol (OTLP) over gRPC.
Port 4318 does the same, but over HTTP instead of gRPC.

Ports 4317 and 4318 are both handled by the Jaeger Collector, which ingests trace data from services instrumented with OpenTelemetry. These services generate spans, and the collector receives, processes, and forwards them to the Jaeger backend for storage and visualization.

With this setup in place, you can start sending traces from your services to Jaeger using either OTLP over gRPC or HTTP. This flexibility makes it easier to integrate tracing into different environments and across various tech stacks.

Sampling: Default Behavior

By default, OpenTelemetry samples 100% of traces. That means every span created in your service will be recorded and exported.

Unless you have a specific need to manage trace volume such as in high-throughput production environments, you don’t need to configure a custom sampler.

The default sampler is a combination of ParentBased and ALWAYS_ON. Here's what that means:

The root span of a trace is always sampled.
All child spans inherit the sampling decision of their parent.

This guarantees that once a trace is started, every span within it will be sampled and exported.

In the first step, the tracing logic added to all three services (X, Y, and Z) will use the default sampler, meaning no sampling limits are applied.

Here’s how 100% sampling is configured in each language used in this repository:

Go (Service X)

We start our Go app by invoking the initTracer function. This function is responsible for tracing the HTTP server (receiving requests).

There are two key things to consider in this function:

It sends traces through HTTP to the Jaeger Collector on port 4318, and disables TLS since we are running locally:

exp, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint("localhost:4318"),
        otlptracehttp.WithInsecure(), // disables TLS
    )

Context propagation: the process of passing trace context (like trace and span IDs) across service boundaries, enabling the tracking of requests as they move through a distributed system. It ensures that the trace remains intact and connected, providing full observability. Propagation is usually handled by instrumentation libraries as we will see in the next snippet, however In the event that you need to manually propagate context, you can use the Propagators API.

otel.SetTextMapPropagator(
        propagation.NewCompositeTextMapPropagator(
            propagation.TraceContext{}, propagation.Baggage{},
        ),
    )

You can observe context propagation in action by inspecting the request headers passed between services.

For example, since the Go service (X) calls the Ruby service (Y), if you log the incoming request headers in the Ruby app, you will see something like:

As you can see, the HTTP_TRACEPARENT header is present. This header carries trace context across service boundaries and allows the spans created by each service to be linked to the same trace.

Finally, we trace outgoing HTTP requests made by the Go service using an instrumented http.Client. This is essential for tracing HTTP client calls from Go to downstream services:

var client = http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

Ruby (Service Y)

In Ruby, we use Sinatra to serve web requests and Faraday as the HTTP client. Instrumenting Ruby with OpenTelemetry is much simpler and requires less code compared to Go.

Here’s what you need in the server.rb file:

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'service-y'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
  c.use 'OpenTelemetry::Instrumentation::Faraday'
end

Unlike Go, we don’t need to manually configure context propagation.

The OpenTelemetry Ruby SDK handles this automatically as long as you are using auto-instrumented libraries like Sinatra and Faraday. It will extract incoming context from request headers and inject it into outgoing HTTP requests without additional setup.

Node.js (Service Z)

In Node.js, we use Express as our web server. The instrumentation setup is located in a separate file, tracing.js, which is imported at the top of index.js.

tracing.js file configures the OpenTelemetry setup for the service.

We send traces to the Jaeger Collector using HTTP:

const exporter = new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' })

We enable automatic instrumentation for both the HTTP layer and Express. It’s important to instrument the HTTP layer first, since Express relies on it:

const provider = new NodeTracerProvider({
  resource: new resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "service-z",
  }),
  spanProcessors: [new SimpleSpanProcessor(exporter)],
});

registerInstrumentations({
  tracerProvider: provider,
  instrumentations: [
    // Express instrumentation expects HTTP layer to be instrumented
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

🎉 Installation Complete, Time to Trace 🚀

Start all three services (X, Y, and Z)
Run the Jaeger backend using: docker-compose up
Now run the following script to send 10 requests to Service X: ./hit_x_service.sh
Open your browser and go to: http://localhost:16686

You should see 10 traces listed in the Jaeger UI and if you click into the last one, you'll notice it contains an error. That’s because service Z is configured to fail on every 10th request, just like we planned.

When you click on the link for the last trace with error you will be able to trace the full request lifecycle across all three services.

🧵 Below is a trace that shows:

The request starts in service-x

It propagates to service-y

Then it hits service-z and fails with a 500

Back in service-y, we log the error and correlate it with the trace

This makes it super easy to debug distributed systems and pinpoint which service is failing and why.

🚨 But Wait There’s a Problem

Cool - at this point everything is working. You’ve got traces flowing, spans being recorded, and the Jaeger UI showing the request paths across your services.

But here’s the catch, In production things look very different. Your system might generate millions of traces every day. And with that comes a few serious challenges:

High cost for exporting and storing all spans - especially when using hosted platforms
Too much noise - making it hard to focus on what's important (for example health checks)
Hard to catch the interesting traces - the ones with high latency, errors, or performance bottlenecks

This is where sampling comes in. It helps you reduce the volume of trace data while keeping the insights that matter most.

We’ll talk about sampling strategies - including head-based and tail-based sampling in the next article.

Stay tuned 🔥

taman9333 @taman9333