
A 2026 microservices monitoring stack is OpenTelemetry SDKs in every service, an OpenTelemetry Collector aggregating the data, and a vendor of your choice (Datadog, Honeycomb, Grafana Cloud) or self-hosted Tempo, Loki, and Prometheus on the receiving end. Pick the vendor based on cost-per-service at your real scale, not feature lists. The instrumentation choice (OTel) is locked in; the backend is replaceable.
Most teams reaching for this guide shouldn't be. If you're running 1 to 3 services, a single APM tool pointed at your monolith covers 90% of what you need: request traces, error rates, slow queries, host metrics. You don't need a collector, you don't need three pillars, you don't need a service mesh.
Microservices observability is a tax you pay for the freedom to deploy services independently. That tax shows up as a 5 to 15% line item against your infra spend at 50+ services, plus a meaningful chunk of your senior engineers' attention every week. Before you build the stack, read our breakdown of when to actually move from a monolith to microservices and confirm you're past the threshold.
If you're already past 5 services, deploying independently, and your incidents now span 2+ services on average, keep reading. The rest of this guide is for you.
Three things changed between 2022 and 2026 that locked in the default architecture.
First, OpenTelemetry won. Every major vendor (Datadog, Honeycomb, Grafana Cloud, New Relic, Splunk, Dynatrace) now accepts OTLP, the OTel wire protocol. Auto-instrumentation libraries cover Node, Python, Java, Go, .NET, Ruby, and PHP. You wire in the SDK once and get traces, metrics, and logs for HTTP, gRPC, SQL, Redis, Kafka, and most cloud SDKs out of the box.
Second, the OTel Collector matured into the obvious aggregation layer. You run it as a sidecar, a DaemonSet, or a central deployment. It buffers, transforms, samples, and routes data to one or more backends. Crucially, it lets you switch backends without redeploying every service.
Third, the vendor pricing models split into two camps: per-host (Datadog, New Relic) and usage-based (Honeycomb, Grafana Cloud). At microservices scale, per-host pricing creates structural problems we'll cover below.
The shape of the stack: services emit OTel data via SDKs (auto + custom), a Collector receives it, applies sampling and transformations, and forwards it to your backend. If you ever change vendors, only the Collector config changes.
Auto-instrumentation typically costs 2 to 5% CPU per service. That's the price of admission. If your service is so hot you can't spare 5% CPU, you have bigger problems than monitoring.
Tracing is the bit that makes microservices debuggable. A single user request might touch 8 services, and without traces you're grepping logs across 8 dashboards trying to reconstruct what happened.
Four real choices in 2026:
The right choice depends on what you do with traces. If your incident response is "find the slow span," any of them work. If you live in traces and need to slice by 12 dimensions to find a 0.1% error class, Honeycomb pays for itself.
Three frameworks, each useful for different things, all fed by the same OTel SDKs.
RED (Rate, Errors, Duration) describes a service from the caller's perspective. Tom Wilkie codified it for microservices specifically. Use RED for every request-driven service. The three SLIs you derive from RED (request rate, error rate, p99 latency) become your default dashboard for that service.
USE (Utilization, Saturation, Errors) describes a resource. Use it for CPU, disk, network interfaces, queues, connection pools. Brendan Gregg's framework. If RED tells you the service is slow, USE tells you which resource is the bottleneck.
The four golden signals (latency, traffic, errors, saturation) come from Google's SRE book and are basically RED + saturation. Useful as a north-star checklist when you build dashboards.
Don't mix frameworks per service. Pick RED for the API, pick USE for the queue worker, document which dashboards live where. The mixing is what produces unreadable dashboards with 47 panels.
Logs are the cheapest signal to produce and the most expensive to store. Treat them accordingly.
| Tool | Pricing | Best for |
|---|---|---|
| Loki | Self-host (cheap), or Grafana Cloud (50 GB free tier) | Teams that already use Prometheus, want label-based indexing, accept slower full-text search |
| Elastic / OpenSearch | Self-host, expensive at scale | Full-text search, historical analytics, but operational burden is real |
| Datadog Logs | $0.10/GB indexed, $1.27 per million events ingested | Teams already on Datadog, need one pane of glass |
| Grafana Cloud Logs | 50 GB free, then usage-based | Teams comfortable with the Grafana stack |
The single decision that controls your log bill is what you index versus what you only retain. Loki indexes labels (service, env, level), not log bodies, which is why it stays cheap. Elastic indexes everything, which is why it scales linearly with your log volume. Datadog gives you a knob for each log type. Set it intentionally.
If you already run a service mesh for mTLS, traffic policy, or canary deploys, you get free observability as a side effect. Every service-to-service call gets latency, success rate, and request-rate metrics emitted by the proxy, no instrumentation required.
Istio gives you the richest data and the most knobs. Envoy proxies are heavy (around 50 to 150 MB RAM per pod) and the mesh has its own learning curve. Worth it if you need its policy and security features.
Linkerd is lighter (the linkerd2-proxy is 10 to 30 MB), opinionated, and easier to operate. Less powerful, fewer features, fewer ways to shoot yourself.
Don't add a mesh purely for observability. OTel SDKs do the same job at a fraction of the operational cost. But if you have a mesh anyway, point its metric exporters at your collector and stop instrumenting service-to-service calls by hand.
The point of all this instrumentation is to defend a small number of service-level objectives. Without SLOs you produce dashboards no one looks at and alerts everyone mutes.
A workable SLO discipline:
The single biggest mistake teams make: paging on symptoms instead of SLO violations. CPU spikes don't matter; missed user requests matter. If your alert doesn't degrade user experience, it shouldn't wake anyone up.
This is the part of the conversation every other guide skips. Here's roughly what each stack costs at three real scales, assuming containerized workloads and moderate traffic.
| Stack | 5 services | 50 services | 500 services |
|---|---|---|---|
| OTel + Datadog | ~$230/mo | ~$2,300/mo (often 2x with logs + custom metrics) | $23,000+/mo (frequently $50k-$100k real-world) |
| OTel + Grafana Cloud | $0 (free tier) | ~$400-$800/mo | ~$5,000-$15,000/mo |
| OTel + Honeycomb | ~$200/mo | ~$1,500-$2,000/mo | ~$10,000-$25,000/mo |
| OTel + self-hosted (Tempo/Loki/Prom) | ~$50/mo infra | ~$200-$500/mo infra + 1 platform engineer | ~$2,000-$8,000/mo infra + 2 platform engineers |
Two non-obvious things drive the Datadog numbers. First, per-host pricing punishes containerized workloads where each Kubernetes node hosts dozens of services; you pay per node, not per service, which can either help or hurt depending on density. Second, Datadog bills every OTel metric as a "custom metric." At scale, custom metrics frequently make up 52% of the total bill. Vantage and Last9 have both published case studies of teams cutting Datadog bills by 40 to 90% just by dropping or downsampling custom metrics.
Honeycomb and Grafana Cloud don't have the OTel custom-metric tax. That's the single biggest reason teams running OTel-first end up there.
The self-hosted column has hidden cost: a platform engineer's time. Budget 0.25 to 0.5 of a senior engineer's headcount per 50 services. If that's cheaper than your vendor bill, self-host. Below that scale it almost never pencils out.
If you're a 2-founder team pre-revenue running a single service on Render or Fly.io, you don't need any of this. Sentry for errors, Vercel/Render's built-in metrics for traffic, and a single Datadog or Better Stack account for uptime gets you to product-market fit. Spend the saved engineering weeks on the product, and revisit the monitoring stack the day you cross 5 services or 100 paid customers, whichever comes first.
The same logic applies once you do scale: the right monitoring stack is the simplest one your team can defend, not the most sophisticated one you can buy. Most observability bills are over-spend by 2 to 3x relative to the value extracted. Audit yours twice a year.
Standing up an OTel-based monitoring stack is well-bounded work. A senior engineer who has done it before can ship the SDKs, the Collector configuration, three production dashboards, and the first round of SLO-backed alerts in 1 to 2 weeks. A team doing it for the first time, while running everything else, will take 4 to 8 weeks and produce something brittle.
This is the kind of scope where a booked engineer is a clean fit. Every engineer on Cadence is AI-native by default (Cursor, Claude Code, and Copilot used daily, vetted on prompt-as-spec discipline before they unlock bookings), and the senior tier at $1,500 per week is the right level for an observability rollout. Most teams ship a working OTel + collector + vendor + first-pass SLOs in 2 to 4 weeks at that tier. Cadence's pool of 12,800 vetted engineers includes plenty with production OpenTelemetry, Grafana, Datadog, and Honeycomb experience.
If you'd rather audit your current stack before changing anything, our ship-or-skip tool gives an honest read on whether your monitoring problem is actually a monitoring problem (often it's an architecture problem in disguise).
For deeper background on instrumentation choices, see our OpenTelemetry guide for 2026, our Datadog review for SaaS teams, and the Sentry vs Datadog breakdown if you're choosing between error tracking and full APM. Teams shipping their first production-ready stack should also read how to scale from MVP to production for the full week-one checklist.
Auditing your monitoring stack? Run it through Cadence's ship-or-skip tool for a 5-minute honest grade on what's worth keeping, what's over-instrumented, and what to cut. Free, no signup.
Only if you have 5+ services or independent deploy cadences. For 1 to 3 services, a single APM tool covers it. The microservices monitoring stack is a tax for the freedom to deploy independently; don't pay it before you're using the freedom.
A senior engineer can ship OTel SDKs, a Collector, and a vendor backend in 1 to 2 weeks. Self-hosted Tempo, Loki, and Prometheus takes 2 to 4 weeks plus ongoing platform-engineering time. First-time teams doing it without help typically take 6 to 8 weeks.
Datadog bills every OTel metric as a "custom metric." At scale, custom metrics frequently account for 52% of the total bill. Honeycomb and Grafana Cloud don't have this charge, which is why most OTel-first teams end up there.
No. If you already need mTLS, traffic policy, or canary deploys, the free service-to-service metrics are a bonus. But the operational cost of running Istio or Linkerd purely for observability is higher than just adding OTel SDKs to each service.
99.5% for internal services, 99.9% for user-facing APIs, 99.95% only when revenue directly depends on it. 100% is never the right answer; it leaves no room for deploys, config changes, or controlled experiments.