How to set up a microservices monitoring stack

A 2026 microservices monitoring stack is OpenTelemetry SDKs in every service, an OpenTelemetry Collector aggregating the data, and a vendor of your choice (Datadog, Honeycomb, Grafana Cloud) or self-hosted Tempo, Loki, and Prometheus on the receiving end. Pick the vendor based on cost-per-service at your real scale, not feature lists. The instrumentation choice (OTel) is locked in; the backend is replaceable.

First: do you actually need a microservices monitoring stack?

Most teams reaching for this guide shouldn't be. If you're running 1 to 3 services, a single APM tool pointed at your monolith covers 90% of what you need: request traces, error rates, slow queries, host metrics. You don't need a collector, you don't need three pillars, you don't need a service mesh.

Microservices observability is a tax you pay for the freedom to deploy services independently. That tax shows up as a 5 to 15% line item against your infra spend at 50+ services, plus a meaningful chunk of your senior engineers' attention every week. Before you build the stack, read our breakdown of when to actually move from a monolith to microservices and confirm you're past the threshold.

If you're already past 5 services, deploying independently, and your incidents now span 2+ services on average, keep reading. The rest of this guide is for you.

The 2026 default stack: OTel Collector plus a vendor of choice

Three things changed between 2022 and 2026 that locked in the default architecture.

First, OpenTelemetry won. Every major vendor (Datadog, Honeycomb, Grafana Cloud, New Relic, Splunk, Dynatrace) now accepts OTLP, the OTel wire protocol. Auto-instrumentation libraries cover Node, Python, Java, Go, .NET, Ruby, and PHP. You wire in the SDK once and get traces, metrics, and logs for HTTP, gRPC, SQL, Redis, Kafka, and most cloud SDKs out of the box.

Second, the OTel Collector matured into the obvious aggregation layer. You run it as a sidecar, a DaemonSet, or a central deployment. It buffers, transforms, samples, and routes data to one or more backends. Crucially, it lets you switch backends without redeploying every service.

Third, the vendor pricing models split into two camps: per-host (Datadog, New Relic) and usage-based (Honeycomb, Grafana Cloud). At microservices scale, per-host pricing creates structural problems we'll cover below.

The shape of the stack: services emit OTel data via SDKs (auto + custom), a Collector receives it, applies sampling and transformations, and forwards it to your backend. If you ever change vendors, only the Collector config changes.

Auto-instrumentation typically costs 2 to 5% CPU per service. That's the price of admission. If your service is so hot you can't spare 5% CPU, you have bigger problems than monitoring.

Distributed tracing: pick where your traces live

Tracing is the bit that makes microservices debuggable. A single user request might touch 8 services, and without traces you're grepping logs across 8 dashboards trying to reconstruct what happened.

Four real choices in 2026:

Grafana Tempo (self-hosted or Grafana Cloud). Cheap at scale, supports OTLP / Jaeger / Zipkin protocols, can do 100% sampling because storage is object-store-backed. Pairs naturally with Loki and Prometheus.
Jaeger (self-hosted). CNCF project, multiple storage backends (Cassandra, Elasticsearch, ClickHouse), solid sampling. UI is functional but not great for ad-hoc querying.
Honeycomb. Strongest tool for high-cardinality trace querying. You can group traces by user_id, request_id, or feature flag value without thinking about it. Their pricing has no per-host or per-span surprise.
Datadog APM. Ubiquitous, polished UI, deep integrations. Costs $31 per host per month plus indexed-span charges plus the custom-metric tax we'll cover below.

The right choice depends on what you do with traces. If your incident response is "find the slow span," any of them work. If you live in traces and need to slice by 12 dimensions to find a 0.1% error class, Honeycomb pays for itself.

Service-level metrics: USE, RED, and the four golden signals

Three frameworks, each useful for different things, all fed by the same OTel SDKs.

RED (Rate, Errors, Duration) describes a service from the caller's perspective. Tom Wilkie codified it for microservices specifically. Use RED for every request-driven service. The three SLIs you derive from RED (request rate, error rate, p99 latency) become your default dashboard for that service.

USE (Utilization, Saturation, Errors) describes a resource. Use it for CPU, disk, network interfaces, queues, connection pools. Brendan Gregg's framework. If RED tells you the service is slow, USE tells you which resource is the bottleneck.

The four golden signals (latency, traffic, errors, saturation) come from Google's SRE book and are basically RED + saturation. Useful as a north-star checklist when you build dashboards.

Don't mix frameworks per service. Pick RED for the API, pick USE for the queue worker, document which dashboards live where. The mixing is what produces unreadable dashboards with 47 panels.

Log aggregation: Loki, Elastic, Datadog Logs, Grafana Cloud Logs

Logs are the cheapest signal to produce and the most expensive to store. Treat them accordingly.

Tool	Pricing	Best for
Loki	Self-host (cheap), or Grafana Cloud (50 GB free tier)	Teams that already use Prometheus, want label-based indexing, accept slower full-text search
Elastic / OpenSearch	Self-host, expensive at scale	Full-text search, historical analytics, but operational burden is real
Datadog Logs	$0.10/GB indexed, $1.27 per million events ingested	Teams already on Datadog, need one pane of glass
Grafana Cloud Logs	50 GB free, then usage-based	Teams comfortable with the Grafana stack

The single decision that controls your log bill is what you index versus what you only retain. Loki indexes labels (service, env, level), not log bodies, which is why it stays cheap. Elastic indexes everything, which is why it scales linearly with your log volume. Datadog gives you a knob for each log type. Set it intentionally.

Service mesh observability: Istio vs Linkerd

If you already run a service mesh for mTLS, traffic policy, or canary deploys, you get free observability as a side effect. Every service-to-service call gets latency, success rate, and request-rate metrics emitted by the proxy, no instrumentation required.

Istio gives you the richest data and the most knobs. Envoy proxies are heavy (around 50 to 150 MB RAM per pod) and the mesh has its own learning curve. Worth it if you need its policy and security features.

Linkerd is lighter (the linkerd2-proxy is 10 to 30 MB), opinionated, and easier to operate. Less powerful, fewer features, fewer ways to shoot yourself.

Don't add a mesh purely for observability. OTel SDKs do the same job at a fraction of the operational cost. But if you have a mesh anyway, point its metric exporters at your collector and stop instrumenting service-to-service calls by hand.

SLOs, SLIs, and alert routing without burnout

The point of all this instrumentation is to defend a small number of service-level objectives. Without SLOs you produce dashboards no one looks at and alerts everyone mutes.

A workable SLO discipline:

Define 1 to 3 SLIs per user-facing service (p99 latency, success rate, freshness).
Set SLOs at 99.5% for internal services, 99.9% for user-facing APIs, 99.95% only when revenue depends on it. 100% is never the answer.
Replace threshold alerts with multi-window burn-rate alerts (Google SRE workbook chapter 5). A burn-rate alert pages you when you're consuming your error budget too fast, not when a single metric crosses a magic number.
Route pages to PagerDuty or incident.io. Cap each on-call at 2 incidents per week as a target; if you're consistently above, your alerts are wrong.

The single biggest mistake teams make: paging on symptoms instead of SLO violations. CPU spikes don't matter; missed user requests matter. If your alert doesn't degrade user experience, it shouldn't wake anyone up.

Real cost math: 5, 50, and 500 services

This is the part of the conversation every other guide skips. Here's roughly what each stack costs at three real scales, assuming containerized workloads and moderate traffic.

Stack	5 services	50 services	500 services
OTel + Datadog	~$230/mo	~$2,300/mo (often 2x with logs + custom metrics)	$23,000+/mo (frequently $50k-$100k real-world)
OTel + Grafana Cloud	$0 (free tier)	~$400-$800/mo	~$5,000-$15,000/mo
OTel + Honeycomb	~$200/mo	~$1,500-$2,000/mo	~$10,000-$25,000/mo
OTel + self-hosted (Tempo/Loki/Prom)	~$50/mo infra	~$200-$500/mo infra + 1 platform engineer	~$2,000-$8,000/mo infra + 2 platform engineers

Two non-obvious things drive the Datadog numbers. First, per-host pricing punishes containerized workloads where each Kubernetes node hosts dozens of services; you pay per node, not per service, which can either help or hurt depending on density. Second, Datadog bills every OTel metric as a "custom metric." At scale, custom metrics frequently make up 52% of the total bill. Vantage and Last9 have both published case studies of teams cutting Datadog bills by 40 to 90% just by dropping or downsampling custom metrics.

Honeycomb and Grafana Cloud don't have the OTel custom-metric tax. That's the single biggest reason teams running OTel-first end up there.

The self-hosted column has hidden cost: a platform engineer's time. Budget 0.25 to 0.5 of a senior engineer's headcount per 50 services. If that's cheaper than your vendor bill, self-host. Below that scale it almost never pencils out.

Common pitfalls

Cardinality explosion. Don't put user_id or request_id as a Prometheus label. Each unique value creates a new time series. A million users equals a million series. Put high-cardinality data in traces and logs, not metrics.
100% trace sampling at scale. Fine at 5 services, ruinous at 500. Use head-based sampling (decide at the entry point) or tail-based sampling (decide after the trace is complete) once you cross 1,000 traces per second.
Building a homegrown collector. People still try this. The OTel Collector exists, it's mature, it has 100+ receiver and exporter plugins. Use it.
Alert fatigue. If your team mutes pages, your SLOs are wrong, not your team.
Vendor lock-in. If you ship proprietary agents (the old Datadog agent, New Relic agent, Dynatrace OneAgent) instead of OTel SDKs, switching vendors becomes a six-month project. The OTel SDK is your insurance policy.

When you can skip this entirely

If you're a 2-founder team pre-revenue running a single service on Render or Fly.io, you don't need any of this. Sentry for errors, Vercel/Render's built-in metrics for traffic, and a single Datadog or Better Stack account for uptime gets you to product-market fit. Spend the saved engineering weeks on the product, and revisit the monitoring stack the day you cross 5 services or 100 paid customers, whichever comes first.

The same logic applies once you do scale: the right monitoring stack is the simplest one your team can defend, not the most sophisticated one you can buy. Most observability bills are over-spend by 2 to 3x relative to the value extracted. Audit yours twice a year.

Who should set this up: your team or a contractor

Standing up an OTel-based monitoring stack is well-bounded work. A senior engineer who has done it before can ship the SDKs, the Collector configuration, three production dashboards, and the first round of SLO-backed alerts in 1 to 2 weeks. A team doing it for the first time, while running everything else, will take 4 to 8 weeks and produce something brittle.

This is the kind of scope where a booked engineer is a clean fit. Every engineer on Cadence is AI-native by default (Cursor, Claude Code, and Copilot used daily, vetted on prompt-as-spec discipline before they unlock bookings), and the senior tier at $1,500 per week is the right level for an observability rollout. Most teams ship a working OTel + collector + vendor + first-pass SLOs in 2 to 4 weeks at that tier. Cadence's pool of 12,800 vetted engineers includes plenty with production OpenTelemetry, Grafana, Datadog, and Honeycomb experience.

If you'd rather audit your current stack before changing anything, our ship-or-skip tool gives an honest read on whether your monitoring problem is actually a monitoring problem (often it's an architecture problem in disguise).

For deeper background on instrumentation choices, see our OpenTelemetry guide for 2026, our Datadog review for SaaS teams, and the Sentry vs Datadog breakdown if you're choosing between error tracking and full APM. Teams shipping their first production-ready stack should also read how to scale from MVP to production for the full week-one checklist.

Auditing your monitoring stack? Run it through Cadence's ship-or-skip tool for a 5-minute honest grade on what's worth keeping, what's over-instrumented, and what to cut. Free, no signup.

Steps

Confirm you need it. Count your services. Below 5, stop here and use a single APM tool on your monolith.
Pick your vendor first, on cost. Run the 5/50/500 math against your real service count. Pricing changes the architecture.
Add the OpenTelemetry SDK to one service. Auto-instrumentation only. Verify traces and metrics arrive at your vendor.
Deploy the OpenTelemetry Collector. Sidecar for small clusters, DaemonSet for Kubernetes, central deployment for low-volume environments. Point one service at it, then expand.
Migrate the rest of the services. One or two per day. Watch CPU overhead; expect 2 to 5% per service.
Define your first SLOs. One per user-facing service. Pick 99.5 to 99.9% based on revenue impact. Wire burn-rate alerts to PagerDuty or incident.io.
Build three dashboards per service: RED (rate/errors/duration), USE (resource saturation), and a trace explorer. Delete every dashboard no one opens after 30 days.
Audit cost monthly for the first quarter. Drop unused custom metrics. Adjust trace sampling. Re-run the math against your bill.

FAQ

Do I need a separate monitoring stack for microservices?

Only if you have 5+ services or independent deploy cadences. For 1 to 3 services, a single APM tool covers it. The microservices monitoring stack is a tax for the freedom to deploy independently; don't pay it before you're using the freedom.

How long does it take to set up?

A senior engineer can ship OTel SDKs, a Collector, and a vendor backend in 1 to 2 weeks. Self-hosted Tempo, Loki, and Prometheus takes 2 to 4 weeks plus ongoing platform-engineering time. First-time teams doing it without help typically take 6 to 8 weeks.

Why is Datadog so expensive for OpenTelemetry users?

Datadog bills every OTel metric as a "custom metric." At scale, custom metrics frequently account for 52% of the total bill. Honeycomb and Grafana Cloud don't have this charge, which is why most OTel-first teams end up there.

Should I use a service mesh just for observability?

No. If you already need mTLS, traffic policy, or canary deploys, the free service-to-service metrics are a bonus. But the operational cost of running Istio or Linkerd purely for observability is higher than just adding OTel SDKs to each service.

What's the right SLO target?

99.5% for internal services, 99.9% for user-facing APIs, 99.95% only when revenue directly depends on it. 100% is never the right answer; it leaves no room for deploys, config changes, or controlled experiments.

All posts