Getting your feet wet with OpenTelemetry

A few months back I gave an introduction to OpenTelemetry for an engineering team. Most of it is general enough to be useful to anyone, so here it is — stripped of anything context-specific, for whoever searches for “what the hell is OTEL.”

OpenTelemetry (OTEL) is an open-source observability framework: a vendor-neutral standard for instrumenting applications to emit traces, metrics, and logs. It graduated from the CNCF in 2024 and is backed by every major cloud provider and observability vendor. It is the industry standard for distributed tracing.

The deck below covers the core concepts, the architecture, common pitfalls, and how to get started. Use the arrows to click through. Below the slides are links to go deeper, and a full written walkthrough for those who prefer reading.


1 / 14

Getting your feet wet with OpenTelemetry

A brief introduction to distributed tracing

2 / 14

When logs aren't enough

Traditional logging gets painful as your system grows:

  • Logs are scattered across services with no shared thread
  • No easy way to correlate what happened across a single request
  • Hard to tell timing, causality, or which service is actually slow
  • Grep-driven debugging across six services is not a strategy

Which service is slow? Why did this request fail? Which downstream call is the bottleneck?

3 / 14

What is OpenTelemetry?

  • An open-source observability framework for traces, metrics, and logs
  • Vendor-neutral — instrument once, send anywhere (Jaeger, Datadog, Grafana Tempo, AWS X-Ray)
  • CNCF graduated project, backed by AWS, Google, Microsoft, Datadog, Grafana
  • Born from the merger of OpenTracing and OpenCensus in 2019
The three pillars of observability: Traces, Metrics, Logs
4 / 14

Three core concepts

Trace

The complete journey of a single request through your system — from the moment it arrives to the moment it returns.

Span

A single operation within that journey: a database query, an HTTP call, a queue publish. Spans nest to form a tree.

Context Propagation

How trace information travels across service boundaries — via the W3C traceparent header. Every service adds its span and passes the context forward.

Trace flow diagram showing spans across services
5 / 14

Anatomy of a span

Each span carries:

  • Span ID + Parent Span ID — where it sits in the tree
  • Operation name — what it represents (e.g., GET /users/:id)
  • Timestamps — start, end, duration
  • Attributes — key-value pairs: HTTP method, DB query, queue topic, custom data
  • Events — timestamped log entries within the span
  • Status — OK, Error (with message)
Span anatomy diagram
6 / 14

Reading the waterfall view

Every tracing backend shows traces as a waterfall (also called a flame graph). How to read it:

  • Each horizontal bar = one span
  • Nested bars = parent-child relationships
  • Bar width = duration
  • Colors = different services

At a glance you see: what happened, in what order, how long each step took, and where time was actually spent. A DB call taking 800ms of a 900ms request is immediately obvious.

7 / 14

The architecture

Data flows from your service, through a collector, to a backend:

Your Service → OTEL SDK/Agent → OTEL Collector → Backend (Jaeger / Datadog / Tempo / X-Ray)

The Collector is the key piece:

  • Centralises configuration — services just point at the collector
  • Redacts sensitive data before it hits any backend
  • Fan-out to multiple backends simultaneously
  • Handles buffering, retries, backpressure
OTEL architecture diagram
8 / 14

Language support

OTEL has mature libraries for every major language. Most auto-instrument HTTP, databases, queues, and gRPC with zero code changes:

  • Java / Scala / JVM: opentelemetry-java-instrumentation — attach the agent, done
  • Python: opentelemetry-python-contrib — wraps Django, Flask, FastAPI, SQLAlchemy, Celery
  • Node.js: opentelemetry-js — Express, Fastify, NestJS, pg, Redis
  • Go: opentelemetry-go with contrib packages
  • .NET: opentelemetry-dotnet — ASP.NET Core, HttpClient, SQL Client

Check the OTEL Registry for your specific framework.

9 / 14

Getting started

Three steps:

  1. Set up a backend to receive traces (Jaeger is free and easy; Grafana Tempo if you're on Grafana stack)
  2. Set up an OTEL Collector to sit in front of it
  3. Add the SDK or agent to your service, configure via environment variables:
OTEL_SERVICE_NAME=my-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

That last one is 10% sampling — important in production (see next slide).

10 / 14

Common pitfalls

  1. Missing context propagation — if one service doesn't forward the traceparent header, the trace breaks there. Check every service boundary.
  2. 100% sampling in production — traces have real overhead. Start at 10%, go lower on high-traffic paths.
  3. Meaningless span namesspan-1 and handler tell you nothing. Name spans after what they do.
  4. Sensitive data in baggage — baggage propagates to every downstream service and third-party API. Don't put PII there.
  5. Losing async context — async code needs explicit context passing; the SDK doesn't always capture it automatically.
  6. Not setting OTEL_SERVICE_NAME — all your spans show up under "unknown_service". Always set it.
  7. Ignoring collector health — a backed-up collector silently drops spans. Monitor it.
11 / 14

What you can answer with traces

  • Why is this request slow — and which specific operation is responsible?
  • Which service is the bottleneck — mine or the one I call?
  • Why did this request fail — and which downstream call triggered the error?
  • What does our actual service communication graph look like?
  • Is that cache actually reducing database load?
  • What is our p99 latency broken down by operation type?

The payoff: debug production issues in minutes, not hours. The waterfall view usually points at the problem immediately.

12 / 14

FAQ

What's the performance overhead?
Less than 3% CPU at 10% sampling. Most teams find it negligible. At 100% sampling on high-traffic paths it matters more.
Do I need to change my code?
Usually no. Auto-instrumentation handles HTTP, DB, queues, and gRPC. Custom spans are optional, for operations the SDK can't see.
How long should I keep traces?
3–14 days is typical. Storage is cheap, but unbounded retention gets expensive. Most debugging happens within 24–48 hours anyway.
Traces vs logs — when do I use which?
Logs are point-in-time events. Traces show timing and causality across a request. Use both — they complement each other.
Am I locked into a vendor?
No. Switching backends (Jaeger → Datadog → Tempo) is a config change, not a code change. That's the whole point.
13 / 14

Key takeaways

  • Traces tell stories. Logs tell facts. Both matter, but a trace gives you the full picture of a request.
  • OTEL is the standard. Not a vendor product, not a framework you'll need to replace in three years.
  • Auto-instrumentation gets you 80% of the value for free. Add the agent, point it at a collector, and you're already useful.
  • Always sample in production. 100% tracing is a footgun. Start at 10%.
  • Watch the context propagation gaps. They silently break traces across service boundaries. Test early.
14 / 14

Where to go next

1 / 14

Resources


The full walkthrough

The problem with logs alone

Picture a common scenario: a user reports a slow request. You check the logs. Service A says “request received, request completed.” Service B says “processing started, done.” Service C says “query executed.”

But which service was slow? How long did each step actually take? You don’t know.

Traditional logging breaks down in distributed systems. Logs scatter across services with no correlation between them. They record events but not durations. When thousands of requests run concurrently, there is no way to tell which log line belongs to which request.

Distributed tracing solves this by following a single request through all your services.

What is OpenTelemetry?

OpenTelemetry, or OTEL, is an open-source observability framework. It works with any backend — Jaeger, Datadog, Grafana Tempo, AWS X-Ray. It is a CNCF graduated project, at the same level as Kubernetes, born from the merger of OpenTracing and OpenCensus in 2019.

AWS, Google, Microsoft, Datadog, and Grafana all support it. As Martin Thwaites puts it: nobody gets fired for suggesting they move their telemetry to OpenTelemetry.

The key benefit is portability. Switch backends with a config change, not a code change.

OpenTelemetry handles three signals: traces, metrics, and logs. Traces follow a request’s journey. Metrics are measurements over time. Logs are event records. The three complement each other, but traces are the most powerful for debugging distributed systems.

Traces and spans

Two concepts sit at the core of tracing.

A trace is the complete journey of a single request through your system. You don’t create traces directly — a trace is simply a group of spans that share the same Trace ID.

A span is one operation within that journey — a database query, an HTTP call, a queue publish. Each span has a unique Span ID, a Trace ID, a Parent Span ID (linking it to whoever called it), timestamps, and attributes.

One way to think about it: spans are fancy logs. Or flipped around: logs are boring traces.

Context propagation

How does the trace ID travel between services? Through context propagation.

When Service A calls Service B, it passes a traceparent header. This is defined by the W3C Trace Context spec. The header carries the Trace ID, the parent Span ID, and sampling flags. Service B reads it, creates its own span as a child, and passes the header forward to any service it calls.

Watch out for Baggage. Baggage attaches arbitrary key-value data to a trace context and propagates to every downstream service — including third-party APIs. Never put sensitive data in baggage.

What’s inside a span

Each span contains:

  • Span ID and Parent Span ID — where it sits in the call tree
  • Operation name — what it represents, e.g. GET /api/orders
  • Timestamps — start time and duration
  • Attributes — searchable key-value metadata
  • Events — timestamped log entries attached to the span
  • Status — OK, ERROR, or UNSET

Attributes deliver most of the searchable value. HTTP calls automatically carry method, status code, URL, and route. Database calls carry the system, statement, and database name. Messaging carries the queue system and destination. Add custom attributes for business context — user ID, order ID, tenant name — and you can search across your entire system: show me all spans where HTTP status code is 500.

Reading the waterfall

Every tracing backend visualises traces as a waterfall (sometimes called a flame graph). Each horizontal bar is one span. Nested bars show parent-child relationships. Bar width shows duration. Different colors represent different services.

You see immediately where time goes and what calls what. A database call taking 800ms of a 900ms request is the widest bar — obvious at a glance. An N+1 query shows up as a row of identical small bars repeating in a loop.

The architecture

Three components work together:

OTEL SDK or Agent — instruments your application, creating spans automatically for HTTP, database, messaging, and gRPC calls.

OTEL Collector — aggregates, processes, and routes telemetry data. Your services send to the Collector; the Collector sends to your backend.

Backend — stores and visualises traces. Jaeger, Datadog, Grafana Tempo, AWS X-Ray, Honeycomb — your choice.

The Collector is the piece worth understanding. It centralises configuration so API keys live in one place. It redacts sensitive data before it leaves your network. It fans out to multiple backends simultaneously. And if your backend is slow, it buffers rather than dropping spans.

Language support

OTEL has mature libraries for every major language. Most auto-instrument HTTP, databases, queues, and gRPC with zero code changes:

  • Java, Scala, JVM — auto-instrumentation via a Java agent. Attach the agent at startup, done. HTTP clients, JDBC, Kafka, RabbitMQ, gRPC, Akka, Play all covered.
  • Python — wrap your app with opentelemetry-instrument. Covers Django, Flask, FastAPI, requests, SQLAlchemy, Celery.
  • JavaScript / TypeScript — Express, Fastify, NestJS, pg, Redis auto-instrumented.
  • Go — contrib packages for net/http, gRPC, database/sql.
  • .NET — ASP.NET Core, HttpClient, SQL Client.

Check the OTEL Registry for the full list and your specific framework.

Getting started

Three steps:

  1. Check if tracing infrastructure exists. Is there an OTEL Collector running? Is there a backend like Jaeger or Datadog to send to? If not, spin up Jaeger locally — it’s a single Docker container.

  2. Add the SDK or agent. For JVM languages this means adding a -javaagent flag. For Python, pip install the distro and exporter then wrap with opentelemetry-instrument. For others, follow the library’s quickstart.

  3. Configure via environment variables. These are standard across all languages:

OTEL_SERVICE_NAME=my-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

That last variable sets 10% sampling — more on why that matters below.

Common pitfalls

Missing context propagation. If any service in the chain doesn’t forward the traceparent header, the trace breaks there. Make sure you’re using OTEL-instrumented HTTP clients, not raw ones that strip unknown headers.

100% sampling in production. Tracing has real overhead. There are two sampling strategies: head sampling decides at request start (efficient but has limited info), and tail sampling decides after the trace is complete (can selectively keep errors and slow requests). Start at 10% and tune from there.

Meaningless span names. Don’t call your spans span-1 or doStuff. Use descriptive names like HTTP GET /api/orders or process-payment. Meaningless names make traces useless for debugging.

Sensitive data in attributes and baggage. Baggage propagates to every downstream service, including third-party APIs. Don’t put PII, SSNs, or secrets in baggage.

Losing async context. In threaded or async code, spans won’t automatically connect across thread boundaries. Use your language’s context propagation APIs explicitly.

Not setting OTEL_SERVICE_NAME. Without it, all your spans appear as unknown_service. Always set it.

Ignoring collector health. A backed-up or crashed collector silently drops spans. Monitor the otelcol_exporter_send_failed_spans metric.

What you can answer with traces

Once tracing is running, these questions have answers:

  • Why is this specific request slow — and which operation is responsible?
  • Is my service slow, or is the slowness coming from something it calls?
  • Why did this request fail — which downstream call threw the error?
  • What does our actual service communication graph look like at runtime?
  • Is that Redis cache actually reducing database queries?
  • What is our p99 latency, broken down by operation type?

The payoff: production issues debugged in minutes, not hours. Pull up the trace, and the waterfall points straight at the problem.