A few months back I gave an introduction to OpenTelemetry for an engineering team. Most of it is general enough to be useful to anyone, so here it is — stripped of anything context-specific, for whoever searches for “what the hell is OTEL.”
OpenTelemetry (OTEL) is an open-source observability framework: a vendor-neutral standard for instrumenting applications to emit traces, metrics, and logs. It graduated from the CNCF in 2024 and is backed by every major cloud provider and observability vendor. It is the industry standard for distributed tracing.
The deck below covers the core concepts, the architecture, common pitfalls, and how to get started. Use the arrows to click through. Below the slides are links to go deeper, and a full written walkthrough for those who prefer reading.
Resources
- What Is This OpenTelemetry Thing? — Martin Thwaites, GOTO 2024 — Start here. 46 minutes, 133k views. The clearest, most practical intro to OTEL I’ve come across. Strongly recommended.
- opentelemetry.io/docs — official docs; the concepts section is worth reading top to bottom
- OTEL Registry — instrumentation libraries by language and framework
- Jaeger — the easiest open-source backend to get started with
- Grafana Tempo — good choice if you’re already on the Grafana stack
- OpenTelemetry on GitHub — specification, SDKs, the collector
The full walkthrough
The problem with logs alone
Picture a common scenario: a user reports a slow request. You check the logs. Service A says “request received, request completed.” Service B says “processing started, done.” Service C says “query executed.”
But which service was slow? How long did each step actually take? You don’t know.
Traditional logging breaks down in distributed systems. Logs scatter across services with no correlation between them. They record events but not durations. When thousands of requests run concurrently, there is no way to tell which log line belongs to which request.
Distributed tracing solves this by following a single request through all your services.
What is OpenTelemetry?
OpenTelemetry, or OTEL, is an open-source observability framework. It works with any backend — Jaeger, Datadog, Grafana Tempo, AWS X-Ray. It is a CNCF graduated project, at the same level as Kubernetes, born from the merger of OpenTracing and OpenCensus in 2019.
AWS, Google, Microsoft, Datadog, and Grafana all support it. As Martin Thwaites puts it: nobody gets fired for suggesting they move their telemetry to OpenTelemetry.
The key benefit is portability. Switch backends with a config change, not a code change.
OpenTelemetry handles three signals: traces, metrics, and logs. Traces follow a request’s journey. Metrics are measurements over time. Logs are event records. The three complement each other, but traces are the most powerful for debugging distributed systems.
Traces and spans
Two concepts sit at the core of tracing.
A trace is the complete journey of a single request through your system. You don’t create traces directly — a trace is simply a group of spans that share the same Trace ID.
A span is one operation within that journey — a database query, an HTTP call, a queue publish. Each span has a unique Span ID, a Trace ID, a Parent Span ID (linking it to whoever called it), timestamps, and attributes.
One way to think about it: spans are fancy logs. Or flipped around: logs are boring traces.
Context propagation
How does the trace ID travel between services? Through context propagation.
When Service A calls Service B, it passes a traceparent header. This is defined by the W3C Trace Context spec. The header carries the Trace ID, the parent Span ID, and sampling flags. Service B reads it, creates its own span as a child, and passes the header forward to any service it calls.
Watch out for Baggage. Baggage attaches arbitrary key-value data to a trace context and propagates to every downstream service — including third-party APIs. Never put sensitive data in baggage.
What’s inside a span
Each span contains:
- Span ID and Parent Span ID — where it sits in the call tree
- Operation name — what it represents, e.g.
GET /api/orders - Timestamps — start time and duration
- Attributes — searchable key-value metadata
- Events — timestamped log entries attached to the span
- Status — OK, ERROR, or UNSET
Attributes deliver most of the searchable value. HTTP calls automatically carry method, status code, URL, and route. Database calls carry the system, statement, and database name. Messaging carries the queue system and destination. Add custom attributes for business context — user ID, order ID, tenant name — and you can search across your entire system: show me all spans where HTTP status code is 500.
Reading the waterfall
Every tracing backend visualises traces as a waterfall (sometimes called a flame graph). Each horizontal bar is one span. Nested bars show parent-child relationships. Bar width shows duration. Different colors represent different services.
You see immediately where time goes and what calls what. A database call taking 800ms of a 900ms request is the widest bar — obvious at a glance. An N+1 query shows up as a row of identical small bars repeating in a loop.
The architecture
Three components work together:
OTEL SDK or Agent — instruments your application, creating spans automatically for HTTP, database, messaging, and gRPC calls.
OTEL Collector — aggregates, processes, and routes telemetry data. Your services send to the Collector; the Collector sends to your backend.
Backend — stores and visualises traces. Jaeger, Datadog, Grafana Tempo, AWS X-Ray, Honeycomb — your choice.
The Collector is the piece worth understanding. It centralises configuration so API keys live in one place. It redacts sensitive data before it leaves your network. It fans out to multiple backends simultaneously. And if your backend is slow, it buffers rather than dropping spans.
Language support
OTEL has mature libraries for every major language. Most auto-instrument HTTP, databases, queues, and gRPC with zero code changes:
- Java, Scala, JVM — auto-instrumentation via a Java agent. Attach the agent at startup, done. HTTP clients, JDBC, Kafka, RabbitMQ, gRPC, Akka, Play all covered.
- Python — wrap your app with
opentelemetry-instrument. Covers Django, Flask, FastAPI, requests, SQLAlchemy, Celery. - JavaScript / TypeScript — Express, Fastify, NestJS, pg, Redis auto-instrumented.
- Go — contrib packages for
net/http, gRPC,database/sql. - .NET — ASP.NET Core, HttpClient, SQL Client.
Check the OTEL Registry for the full list and your specific framework.
Getting started
Three steps:
-
Check if tracing infrastructure exists. Is there an OTEL Collector running? Is there a backend like Jaeger or Datadog to send to? If not, spin up Jaeger locally — it’s a single Docker container.
-
Add the SDK or agent. For JVM languages this means adding a
-javaagentflag. For Python,pip installthe distro and exporter then wrap withopentelemetry-instrument. For others, follow the library’s quickstart. -
Configure via environment variables. These are standard across all languages:
OTEL_SERVICE_NAME=my-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1
That last variable sets 10% sampling — more on why that matters below.
Common pitfalls
Missing context propagation. If any service in the chain doesn’t forward the traceparent header, the trace breaks there. Make sure you’re using OTEL-instrumented HTTP clients, not raw ones that strip unknown headers.
100% sampling in production. Tracing has real overhead. There are two sampling strategies: head sampling decides at request start (efficient but has limited info), and tail sampling decides after the trace is complete (can selectively keep errors and slow requests). Start at 10% and tune from there.
Meaningless span names. Don’t call your spans span-1 or doStuff. Use descriptive names like HTTP GET /api/orders or process-payment. Meaningless names make traces useless for debugging.
Sensitive data in attributes and baggage. Baggage propagates to every downstream service, including third-party APIs. Don’t put PII, SSNs, or secrets in baggage.
Losing async context. In threaded or async code, spans won’t automatically connect across thread boundaries. Use your language’s context propagation APIs explicitly.
Not setting OTEL_SERVICE_NAME. Without it, all your spans appear as unknown_service. Always set it.
Ignoring collector health. A backed-up or crashed collector silently drops spans. Monitor the otelcol_exporter_send_failed_spans metric.
What you can answer with traces
Once tracing is running, these questions have answers:
- Why is this specific request slow — and which operation is responsible?
- Is my service slow, or is the slowness coming from something it calls?
- Why did this request fail — which downstream call threw the error?
- What does our actual service communication graph look like at runtime?
- Is that Redis cache actually reducing database queries?
- What is our p99 latency, broken down by operation type?
The payoff: production issues debugged in minutes, not hours. Pull up the trace, and the waterfall points straight at the problem.