How to set up monitoring for a SaaS app in 2026

Q: How long does it take to set up SaaS monitoring?

A single engineer can ship the full $0 to $200/month stack in 2 to 4 days, including SLOs and on-call rotation. The actual SDK installs take an afternoon; the rest of the time goes to writing thresholds, the on-call runbook, and tuning sample rates against real traffic.

Q: Should I use Datadog from day one?

No. Datadog is excellent but priced for teams with 5+ services and meaningful traffic. At seed stage you'll pay $1,500 to $3,000 per month for capability you can't yet use, because there's nothing to correlate. Start with Sentry, Better Stack, and PostHog. Graduate to Datadog when you cross 50 paying customers, $20k MRR, or 3+ services that need correlated tracing.

Q: Do I need OpenTelemetry on day one?

Probably not. OpenTelemetry is the long-term standard and where most serious SaaS infrastructure ends up, but it adds real setup cost. Start with vendor SDKs (Sentry, PostHog) and migrate to OTel once you have 3+ services that need correlated tracing across them.

Q: What about logs?

Better Stack Logs at $0.30/GB, or your host's built-in log viewer (Vercel, Render, Fly), is enough for the first 12 months. Do not pay for Datadog Logs or Splunk at seed stage; the bill scales with ingest volume and you'll regret enabling debug logs on a hot path.

Q: Who runs the on-call rotation if I'm solo?

You do, with one rule: only page yourself for paying-customer-impacting alerts. Configure Better Stack quiet hours (no Slack noise after 11pm for non-SEV-1 alerts) and escalation policies that delay paging by 2 minutes for transient flaps. When you hire your second engineer, split the week.

Q: What's the single most important alert to wire first?

The Postgres connection-pool saturation alert (or the equivalent on whatever database you use). It catches connection leaks, runaway migrations, and the moment your app outgrows its database tier. Set it at 80% pool usage for 5 minutes.

To monitor a SaaS app in 2026, install Sentry for errors, Better Stack for uptime and on-call, PostHog for product analytics, and lean on your host's built-in metrics (Vercel, Render, Fly). Total cost: $0 to $200 per month until you cross roughly 50 paying customers. Wire alerts against the four golden signals (latency, traffic, errors, saturation). Graduate to Datadog or full OpenTelemetry only when paging volume or revenue justifies the bill.

That's the answer. The rest of this post is the playbook: which vendor does what, how to install each one, what an actual SLO looks like, how to wire alerts, and when to graduate to a heavier stack.

What "monitoring" actually means for a SaaS in 2026

Most founders use the word "monitoring" to mean five different things, and the confusion is what makes them either over-buy or under-buy. The five layers, in priority order:

Errors. Exceptions, stack traces, the broken-button moments. Sentry, Bugsnag, Rollbar.
Uptime. Is the homepage returning 200? Is /api/checkout reachable from Sydney? Better Stack, Pingdom, UptimeRobot.
Product analytics. Did the user click the button? How many trials converted? PostHog, Mixpanel, Amplitude.
Infrastructure metrics. CPU, memory, request count, database connections. Vercel Analytics, Render dashboard, Fly metrics, Datadog APM.
Logs. Raw text events for forensic debugging. Better Stack Logs, Axiom, Logtail.

You need at least three of the five from day one: errors, uptime, and product analytics. The other two come along for free with whatever PaaS you're already on. Skipping any of the first three means you'll learn about your outages from a Slack DM that starts with "hey, is anything broken on your end?"

Why most early teams over-buy or under-buy

There are two failure modes, and both are expensive.

Over-buy: You read a Datadog blog post, get excited about distributed tracing, and provision Datadog APM at seed stage. Datadog APM starts at $15 to $23 per host per month, and "host" includes every Render instance, every Fly machine, every preview deploy. A typical seed-stage SaaS with 8 to 12 hosts pays $1,500 to $3,000 per month for capability it cannot yet use, because there are no microservices to trace.

Under-buy: You ship the product with no monitoring beyond the Vercel error log. Your first paying customer hits a bug at 2am Pacific. You find out 14 hours later when they email asking for a refund. The lifetime-value math on that customer is now negative, and you wasted the engineering hour you would have spent on a $25 Better Stack monitor.

The right answer in 2026 is the same as it was in 2022: start cheap, instrument the four golden signals, and let usage drive the upgrade. The tooling has gotten dramatically better at the low end. Sentry, PostHog, and Better Stack all have free tiers that comfortably cover a SaaS doing $0 to $20k MRR.

The minimum viable monitoring stack ($0 to $200 per month)

Here's the full stack a Cadence engineer would ship for a typical Next.js or Node SaaS in 2026:

Layer	Starter pick	Free tier	Paid tier	When to upgrade
Errors	Sentry	5k errors / month	$26 / month	When errors > 5k/mo
Uptime + on-call	Better Stack	10 monitors	$25 / month	Pay from day one
Product analytics	PostHog Cloud	1M events / month	$0.00045 / event	When events > 1M/mo
Infra metrics	Vercel / Render built-ins	Included	Included	When you outgrow PaaS
Logs	Better Stack Logs	30 GB / month	$0.30 / GB	When logs > 30 GB/mo

Total cost at zero traffic: $25 per month (Better Stack paid plan; everything else on free tier). Total cost at 50 paying customers: $75 to $200 per month, depending on event volume. You will not need Datadog at this stage. You will not need full OpenTelemetry. You will need exactly five accounts and about 40 lines of integration code.

The four golden signals, applied to a real SaaS

Google's SRE team named the four golden signals back in 2016, and they still hold up. The trick in 2026 is mapping each one onto a real tool in the stack above, with concrete thresholds.

Latency

The time from request received to response sent. Measure P50 (median), P95 (slow user experience), and P99 (worst case). For a typical SaaS API in 2026, sane starting SLOs:

P50 < 200 ms
P95 < 500 ms
P99 < 1.5 s

These come from Vercel Analytics, Render's built-in metrics, or Sentry Performance. Don't measure from your own laptop. Measure from the edge, where your users actually live.

Traffic

Requests per minute, broken down by route. The point isn't the absolute number; it's the slope. A 3x spike at 4am that doesn't match your usual pattern is either a bot, a customer's broken integration, or a viral tweet. Vercel and Render both expose this for free. PostHog gives you the product-event version.

Errors

Two flavors: HTTP 5xx (server errors) and HTTP 4xx (client errors). 5xx is your problem. 4xx is sometimes your problem (broken auth flow) and sometimes the user's. Sentry handles uncaught exceptions; your host's logs handle 5xx counts. Set the alert threshold at "more than 1% of requests failing for 5 minutes," not "any single error."

Saturation

How full your most-constrained resource is. For a Next.js app on Vercel, this is rarely you (Vercel autoscales). For a Postgres-backed SaaS, this is almost always database connections. For a queue worker on Render, it's queue depth.

The single most useful saturation alert in 2026: "Postgres connection pool > 80% in use for 5 minutes." This catches connection leaks, runaway migrations, and the moment your app outgrows its database tier. Render and Supabase both expose pool metrics in their dashboards. Wire that into Better Stack.

Working code: Sentry, Better Stack, PostHog

Real working snippets for a Next.js 15 app on Vercel. Drop these in and you have errors, uptime, and product analytics live in under an hour.

Sentry init (sentry.client.config.ts):

import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: 0.1,
  replaysSessionSampleRate: 0.0,
  replaysOnErrorSampleRate: 1.0,
  environment: process.env.NODE_ENV,
});

Better Stack health endpoint (app/api/health/route.ts):

import { db } from "@/lib/db";

export async function GET() {
  try {
    await db.execute("SELECT 1");
    return Response.json({ status: "ok", ts: Date.now() });
  } catch (e) {
    return Response.json({ status: "degraded" }, { status: 503 });
  }
}

Then in Better Stack: create an HTTP monitor pointing at https://yourapp.com/api/health, set "expected status: 200," and add yourself to the on-call escalation policy. Three clicks.

PostHog init (app/providers.tsx):

"use client";
import posthog from "posthog-js";
import { PostHogProvider } from "posthog-js/react";

if (typeof window !== "undefined") {
  posthog.init(process.env.NEXT_PUBLIC_POSTHOG_KEY!, {
    api_host: "https://us.i.posthog.com",
    capture_pageview: "history_change",
  });
}

export function Providers({ children }: { children: React.ReactNode }) {
  return <PostHogProvider client={posthog}>{children}</PostHogProvider>;
}

A real SLO definition (drop in slos.yaml for documentation; tools like Nobl9 or Grafana SLO consume the same shape):

slos:
  - name: api-availability
    target: 0.995
    window: 30d
    indicator: http_5xx_rate < 0.005
  - name: api-latency-p95
    target: 0.99
    window: 30d
    indicator: http_p95_ms < 500

That is the entire monitoring stack. Roughly 40 lines of code, three vendor accounts, one health endpoint. Many teams overcomplicate this because the Datadog blog makes it sound like you need 14 dashboards before you ship. You don't.

For teams that want a second opinion before locking in vendors, auditing your stack with Ship-or-Skip takes 5 minutes and tells you which layers you're actually missing versus which ones you can defer.

Steps

Pick the vendors. Sentry for errors, Better Stack for uptime and on-call, PostHog for product analytics. Do not deliberate for more than 30 minutes; all three are easily swappable later.
Install the SDKs. Sentry into the Next.js or Node app, PostHog into the browser bundle, Better Stack as an external HTTP monitor against /api/health. Total: about 40 lines of code.
Set the SLOs. Write three: API availability (99.5%), API latency P95 (< 500 ms), and Postgres connection saturation (< 80%). Document them in slos.yaml in the repo. The number matters less than having one written down.
Wire the alerts. Better Stack handles paging. Set thresholds at "1% error rate for 5 minutes" and "health endpoint down for 2 minutes." Route to Slack first, phone only for SEV-1.
Set up the on-call rotation. If you are solo, you are on-call. Configure Better Stack quiet hours (no Slack noise after 11pm for non-paging alerts). When you hire your second engineer, split the week. Document the runbook for the top three alerts in a Notion page everyone can find at 3am.

When to graduate to Datadog or full OpenTelemetry

The starter stack carries you to roughly 50 paying customers or $20k MRR, whichever comes first. Past that, you'll start hitting one of three triggers:

Service count. You now have 3+ deployed services (web app, worker, scheduled jobs, ML inference). Correlated tracing across them stops being optional.
Compliance. SOC 2 or HIPAA wants centralized log retention with audit trails. Better Stack handles this; so does Datadog and so does Grafana Cloud.
Team size. When you have 4+ engineers, the cost of mis-routed pages and stale dashboards exceeds the Datadog bill.

When any two of those trigger, graduate. The two clean paths in 2026:

Datadog APM + Datadog Logs. Pay $15 to $23 per host per month for APM, plus log ingestion. Expect $1,500 to $3,000 per month at 10 hosts and 100 GB of logs. Worth it for the unified UI and excellent integrations.
Full OpenTelemetry + Grafana Cloud. Self-instrument everything with OTel, ship to Grafana Cloud (free tier holds for a while). More setup, lower long-term cost, vendor-portable. This is where teams that already have a strong infrastructure engineer end up. The OpenTelemetry guide for 2026 walks through the migration. For multi-service teams considering this jump, our microservices monitoring playbook covers the trace-correlation patterns you'll need.

Either way, do not graduate before the triggers fire. Premature observability is one of the more expensive mistakes early SaaS teams make.

Common pitfalls

A few patterns we see repeatedly when we drop a senior engineer into a SaaS codebase to fix monitoring:

Alerting on every error. Set thresholds, not single-event triggers. Alert fatigue tops 60% of SRE survey complaints, and a team that ignores Slack noise will eventually ignore the real outage too.
No on-call rotation. "We'll just check Slack" is not a rotation. Use Better Stack's escalation policy, even if the only person on the policy is you.
Ignoring saturation. Latency and errors get attention; saturation gets discovered the day Postgres runs out of connections at 3pm. Wire that alert first.
Monitoring without SLOs. A dashboard without a target is just art. Write the three numbers down, even if you adjust them in month three.
Logging PII into Sentry. Configure beforeSend to scrub emails, tokens, and credit-card numbers. Doing this in month one is much cheaper than doing it during a SOC 2 audit. Pair this with a GDPR data-deletion playbook so logs respect the same retention rules as your primary database.
No SDK budget for tracesSampleRate. Sentry will happily eat your whole free tier in a week if you set tracesSampleRate to 1.0 on a high-traffic page. Start at 0.1 and tune up.

Who should ship this

For most seed-stage SaaS teams, this is a half-week of work for one strong full-stack engineer (and yes, the half-week estimate holds up once you sit down with someone who has shipped this stack before). The hard parts are not the SDK installs; they're the SLO definitions, the alert thresholds (which are judgment calls based on real traffic), and the on-call runbook. A junior engineer can do the installs. A senior engineer is what you want for the SLO and runbook work, because they've been paged at 3am before and write very different runbooks because of it.

If you don't have that engineer in-house, this is exactly the kind of bounded scope a Cadence senior engineer ($1,500/week) ships in one billing week. Every engineer on Cadence is AI-native by default (vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings), so the SDK install and runbook drafting are dramatically faster than they were three years ago. Cadence's pool of 12,800 engineers means you can usually have someone shortlisted in the time it takes to write the spec.

Skip the recruiter loop. Book a senior engineer in 2 minutes, get a 48-hour free trial, and have your monitoring stack live by Friday. Replace the engineer any week, no notice period.

FAQ

How long does it take to set up SaaS monitoring?

A single engineer can ship the full $0 to $200/month stack in 2 to 4 days, including SLOs and on-call rotation. The actual SDK installs take an afternoon; the rest of the time goes to writing thresholds, the on-call runbook, and tuning sample rates against real traffic.

Should I use Datadog from day one?

No. Datadog is excellent but priced for teams with 5+ services and meaningful traffic. At seed stage you'll pay $1,500 to $3,000 per month for capability you can't yet use, because there's nothing to correlate. Start with Sentry, Better Stack, and PostHog. Graduate to Datadog when you cross 50 paying customers, $20k MRR, or 3+ services that need correlated tracing.

Do I need OpenTelemetry on day one?

Probably not. OpenTelemetry is the long-term standard and where most serious SaaS infrastructure ends up, but it adds real setup cost. Start with vendor SDKs (Sentry, PostHog) and migrate to OTel once you have 3+ services that need correlated tracing across them.

What about logs?

Better Stack Logs at $0.30/GB, or your host's built-in log viewer (Vercel, Render, Fly), is enough for the first 12 months. Do not pay for Datadog Logs or Splunk at seed stage; the bill scales with ingest volume and you'll regret enabling debug logs on a hot path.

Who runs the on-call rotation if I'm solo?

You do, with one rule: only page yourself for paying-customer-impacting alerts. Configure Better Stack quiet hours (no Slack noise after 11pm for non-SEV-1 alerts) and escalation policies that delay paging by 2 minutes for transient flaps. When you hire your second engineer, split the week.

What's the single most important alert to wire first?

The Postgres connection-pool saturation alert (or the equivalent on whatever database you use). It catches connection leaks, runaway migrations, and the moment your app outgrows its database tier. Set it at 80% pool usage for 5 minutes.

All posts