
OpenTelemetry in 2026 is the default way to instrument a backend: one SDK, one wire protocol (OTLP), and any vendor on the receiving end. If you start a new service today, you instrument with OTel SDKs, ship through an OpenTelemetry Collector, and pick a backend (Honeycomb, Grafana Cloud, Datadog, or self-hosted Tempo + Prometheus + Loki) based on cost and DX, not lock-in.
That sentence is the whole playbook. The rest of this post is the working code, the Collector config you actually need, the sampling strategy that keeps your bill under control, and an honest cost comparison across the three backends most teams pick from.
Three things shifted in the last two years.
First, OpenTelemetry graduated from CNCF and the three core signals (traces, metrics, logs) are stable across every major SDK. Continuous profiling, the fourth signal, hit public alpha in March 2026 with GA tracking for late 2026. That means OTel is now the first open standard to unify all four observability pillars under a single wire protocol.
Second, every serious vendor speaks OTLP natively. Datadog, New Relic, Honeycomb, Grafana Cloud, Splunk, Dynatrace, Azure Monitor, and Google Cloud Operations all accept OTLP without a translation shim. That kills the "rip out the agent to switch vendors" project that used to cost six months and a contractor.
Third, eBPF instrumentation matured. The OpenTelemetry eBPF Instrumentation project (OBI, the successor to Grafana Beyla) hit beta at KubeCon EU 2026. For Go, Node, Python, Ruby, .NET, and JVM workloads on Linux, you can get HTTP and gRPC traces with zero code changes, just by deploying a DaemonSet.
Most teams reach for a vendor SDK first. They install dd-trace, newrelic, or the Sentry tracing SDK, ship it, and call observability done. It works for a quarter. Then one of three things happens.
The bill triples after a viral launch. They want to switch from Datadog to Grafana Cloud to cut cost, and discover every service has vendor-specific instrumentation calls baked into business logic. The migration estimate comes back at 8 to 12 weeks of senior engineering time.
Or they want to add a second backend (e.g., keep Datadog for ops, add Honeycomb for product analytics). They cannot, without dual-instrumenting and doubling the agent footprint.
Or they need to enrich every span with a tenant ID and route high-cardinality logs to a cheap bucket. The vendor SDK does not let them do that without a custom processor that they would also have to maintain.
Vendor SDKs are fine for prototypes. For anything you expect to keep running in 18 months, you want OTel under the hood and the vendor SDK only as a thin compatibility layer (or removed entirely).
Every major language has zero-code or near-zero-code instrumentation in 2026. Use it. You get HTTP servers, HTTP clients, database drivers, message queues, and most popular frameworks for free.
Node (Express, Fastify, Next.js API routes):
npm install @opentelemetry/api \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/sdk-node
// otel.js (load with -r ./otel.js or as the first import)
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { resourceFromAttributes } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: 'checkout-api',
[ATTR_SERVICE_VERSION]: process.env.GIT_SHA || 'dev',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Python (Flask, FastAPI, Django):
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run any Python app with auto-instrumentation
opentelemetry-instrument \
--service_name=checkout-api \
--exporter_otlp_endpoint=http://localhost:4318 \
python app.py
That is it. You get traces for requests, psycopg, sqlalchemy, redis, boto3, kafka, and 60+ other libraries without touching a single line of business code. Same approach works in a Next.js 15 app using Server Actions where you want to trace what each action does end-to-end.
Auto-instrumentation gives you the skeleton. Add manual spans where you make a meaningful business decision: "did the checkout succeed?", "which payment method did the user pick?", "was the cache hit?".
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('checkout');
async function chargeCard(order) {
return tracer.startActiveSpan('charge_card', async (span) => {
span.setAttribute('order.id', order.id);
span.setAttribute('order.amount_cents', order.amountCents);
span.setAttribute('payment.method', order.paymentMethod);
try {
const result = await stripe.charges.create({ ... });
span.setAttribute('payment.outcome', result.status);
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message });
throw err;
} finally {
span.end();
}
});
}
The trick is restraint. Five well-placed manual spans per service beat fifty noisy ones. Each manual span is also a query target on Honeycomb or a dashboard widget on Grafana, so treat them like a small ontology you maintain on purpose.
Direct-to-vendor exporters are tempting and almost always wrong at scale. Run an OpenTelemetry Collector. Two deployment patterns cover most cases.
Agent (DaemonSet on each node, or sidecar per pod): local, low-latency, batches data before shipping. Good for capturing pod-level resource attributes.
Gateway (a small fleet of Collector pods behind a Service): centralized place to apply tail sampling, redaction, routing, and rate-limiting before data hits the wire to your vendor.
In production you usually run both: agents on each node forward to a gateway pool, the gateway makes the expensive decisions and ships to backends.
A minimal gateway config that routes the same data to two backends and tail-samples on errors:
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
send_batch_size: 8192
timeout: 5s
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 5000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
attributes/redact:
actions:
- key: http.request.header.authorization
action: delete
- key: user.email
action: hash
exporters:
otlp/honeycomb:
endpoint: api.honeycomb.io:443
headers: { x-honeycomb-team: ${env:HONEYCOMB_API_KEY} }
otlphttp/grafana:
endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
headers: { Authorization: "Basic ${env:GRAFANA_OTLP_TOKEN}" }
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes/redact, tail_sampling, batch]
exporters: [otlp/honeycomb, otlphttp/grafana]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/grafana]
That config also demonstrates the vendor-neutrality reality: one pipeline, two destinations. Swapping Datadog in or Honeycomb out is a 4-line change.
Sampling is where the bill is decided. Two real strategies work in 2026.
Head sampling (decide at the SDK before the trace exists): cheap, predictable, kills cardinality early. Fine if you serve well under 1,000 traces per second and your traffic is uniform. Set OTEL_TRACES_SAMPLER=parentbased_traceidratio and OTEL_TRACES_SAMPLER_ARG=0.1 for 10%.
Tail sampling (decide in the Collector after the full trace assembles): expensive but you can keep 100% of errors and latency outliers while sampling 5% of healthy baseline traffic. Mandatory above ~5,000 traces per second.
A useful default for a B2B SaaS: keep all 5xx responses, all spans over 1 second, all spans tagged tenant.tier=enterprise, plus 5% of everything else. That config shows up in the YAML above. Pair this with API rate limiting so a single misbehaving tenant cannot blow up your trace budget overnight.
In 2026, OTLP for logs is stable and wire-compatible with what every vendor accepts. Stop running a separate Fluent Bit pipeline for logs and an OTel pipeline for traces. One Collector, three signal types, one place to set retention and cost policy.
For metrics, the OTel SDK ships Prometheus-compatible exporters and OTLP exporters out of the box. Push to Grafana Cloud Mimir or scrape with a Prometheus that has the OTLP receive endpoint enabled. Either pattern works; OTLP push tends to be cheaper at scale because you skip the scrape interval polling.
The profiling signal hit public alpha in March 2026 with GA targeted for late 2026. The eBPF profiler agent runs as a DaemonSet, captures CPU stack traces with under 1% overhead, and (this is the bit that matters) stamps each profile sample with the same trace_id and span_id as the request that triggered it.
That correlation is new. You can click from a slow span in Honeycomb directly into the flamegraph that shows which Go function in which goroutine ate the 800ms. That is a workflow no APM vendor offered before 2026 outside their walled garden.
Do not put alpha profiling into a regulated workload yet. Do put it on staging and one non-critical production service so the team builds muscle memory before GA.
Here is what "vendor-neutral" actually buys you in 2026, with all three vendors accepting OTLP natively.
| Backend | Strength | Weakness | Per-100GB-traces ingest list price |
|---|---|---|---|
| Datadog (OTLP-native) | Best out-of-box dashboards, mature alerting, huge integration catalog | Most expensive at scale; per-host pricing on top of ingest | $200-400 |
| Honeycomb | Best query model for high-cardinality debugging; BubbleUp is unmatched | Smaller integration catalog, no log product (yet) | $100-150 |
| Grafana Cloud (Tempo + Mimir + Loki) | Cheapest at high volume; you can self-host the same OSS stack later | Most operational complexity; dashboards take taste to build | $50-90 |
| Self-hosted (Tempo + Mimir + Loki on your own k8s) | Effectively zero per-GB cost above hardware | You are now running an observability platform team | Hardware + 0.5 SRE FTE |
The switch itself is a Collector config change, not a code change. We have seen teams move from Datadog to Grafana Cloud in an afternoon by:
Compare that to the pre-OTel reality of ripping dd-trace calls out of every service. The afternoon is real.
Cardinality explosions on attributes. Every distinct value of every attribute is a tag dimension. Setting user.id as a span attribute on a 10,000-user product is fine on Honeycomb, expensive on Datadog, and catastrophic on Prometheus. Read your vendor's cardinality docs before you ship.
Forgetting context propagation across async boundaries. Auto-instrumentation handles HTTP and gRPC. Background jobs, Kafka consumers, and Lambda invocations need explicit context propagation. Symptom: orphan spans that look like they came from nowhere.
Sampling at the SDK and the Collector simultaneously. If your SDK samples 10% and your Collector tail-samples 5%, you keep 0.5% of traffic, not 5%. Pick one place to make the decision. Usually the Collector.
Running the Collector under-resourced. Tail sampling holds full traces in memory until the decision window closes. A gateway pool processing 5,000 traces per second wants at least 4 GB of RAM per pod and a memory_limiter processor in front of everything else.
Skipping resource attributes. Without service.name, service.version, deployment.environment, and k8s.pod.name, your spans are anonymous. Most vendors will silently bucket them into a "default" service that nobody looks at.
If you are two founders pre-revenue running one Vercel project and a Supabase database, you do not need the OTel Collector. Use the built-in Vercel observability and Supabase logs. Revisit this post the week you hire your third engineer or cross 100 paying customers, whichever comes first.
If you have one Rails monolith and a Heroku Postgres, the Heroku-native log drains and a single Honeycomb exporter from the Rails OTel SDK is enough. Skip the Collector until you have a second service. Same logic applies when picking the boring stack your hires already know: observability complexity should track team size, not ambition.
OpenTelemetry rollouts are a classic senior-engineer week of work. A senior on Cadence ($1,500/week) typically lands an end-to-end OTel setup (SDK in 2 to 4 services, Collector deployed, one backend wired, baseline dashboards, redaction policies for PII) in 4 to 6 working days, then a follow-on week to tune sampling once real traffic hits the gateway.
Every engineer on Cadence is AI-native by baseline, vetted on Cursor and Claude Code fluency before they unlock bookings, which matters here because most of the OTel grind is YAML, exporter config, and reading three vendor docs in parallel. That is exactly the work where a fluent prompt-as-spec engineer ships in two days what would take a non-AI-native engineer a week.
If you want a sanity check on whether your current observability stack is the right call, you can audit your stack with the Ship-or-Skip tool before booking anyone.
Ready to wire OTel into a real service? Book a senior Cadence engineer for one week, get a 48-hour free trial, and have your traces, metrics, and logs flowing into the backend of your choice by Friday. Replace the engineer any week, no notice period. Start the 2-minute booking flow.
For a single service with auto-instrumentation and a managed Collector (or vendor-hosted equivalent), a working setup takes 1 to 2 days. A multi-service rollout with a self-hosted gateway Collector, tail sampling, and dashboards built on the new data takes 2 to 3 weeks for a senior engineer working part-time, or one focused week of full-time effort.
Not yet. Public alpha as of March 2026, with GA tracking for late 2026. The eBPF profiler agent is stable enough for staging and non-critical production. Wait for GA before putting it in front of regulated workloads. Profile-to-trace correlation via trace_id is the killer feature once GA lands.
Use a hosted Collector for the first 6 months, then run your own once you need tail sampling, multi-backend routing, or PII redaction the vendor does not support. Honeycomb's Refinery, Grafana Alloy, and the Datadog Agent all accept OTLP and are reasonable starting points.
Yes, if your application code only depends on the OpenTelemetry API and SDK. Switch the exporter in your Collector config, point at the new backend, and your traces, metrics, and logs flow to the new vendor with no code changes. The dashboards, alerts, and SLOs you built on the old vendor do not migrate; budget 2 to 4 weeks for that side of the move.
Tempo, Mimir, and Loki on a 3-node Kubernetes cluster with 200GB/day of trace volume runs roughly $400 to $800/month in cloud compute and storage, plus 0.25 to 0.5 SRE FTE to operate. Below 50GB/day, a managed backend is almost always cheaper than the operational overhead.