
To design webhooks for SaaS in 2026, start from the consumer's perspective: ship a stable event taxonomy with versioned payloads, sign every request with a rotatable secret, retry on an exponential curve with a hard cap, and give every customer a self-serve dashboard showing recent deliveries, failures, and a replay button. Everything else is detail.
Most webhook systems are built once, in a hurry, by a backend engineer who has never had to debug their own webhooks at 2 a.m. The result is the pattern you see across half the SaaS market: a POST /your-webhook endpoint, no signing, no retries, no dashboard, and a support inbox full of "did the event fire?" tickets. This guide is the producer-side playbook for getting it right the first time.
If you want the receiving side instead (verifying signatures, idempotency, queue handoff), we covered that separately in our Stripe webhook best practices guide. This post is about the side that emits the events.
Three shifts have raised the bar.
First, every modern SaaS now sits inside a workflow graph. Zapier, n8n, Make, and a generation of AI agent platforms consume your webhooks. A flaky webhook fan-out doesn't just annoy one customer; it breaks every automation downstream of them. The blast radius of a missed event is larger than it was three years ago.
Second, customers compare you to Stripe. Stripe's webhook UX (the event log, the resend button, the test mode, the typed payloads) has become the industry baseline. If your dashboard ships fewer features than Stripe's circa 2019, your prospects notice in the trial.
Third, AI-generated integrations break differently than human-written ones. When a customer's Claude or Cursor session writes a webhook handler, it expects predictable event names, stable payload shapes, and a published JSON schema it can paste into a prompt. Vague or undocumented payloads cost you integration velocity in a market where shipping a webhook handler should take 20 minutes, not two days.
Most teams ship something like this:
async function onUserCreated(user) {
for (const url of user.org.webhookUrls) {
fetch(url, { method: 'POST', body: JSON.stringify(user) });
}
}
It works in the small case. It breaks the moment any of these happen: the customer's endpoint times out, the customer adds a second webhook URL, you change a field name, a payload contains PII you didn't realize, or the customer asks "did event X actually fire on 2026-05-11?" and you have no log to point to.
The fix is not a smarter fetch call. It's a producer architecture: a separate webhook service with a typed event taxonomy, a delivery queue, a signing layer, a retry policy, and a dashboard. Each of these is a small decision; together they are the difference between a feature and a liability.
The event names you ship today are the API you will be supporting in 2028. Get the naming right before you ship a single event.
The convention we recommend, and that Stripe, Shopify, and GitHub all use in some form, is noun.verb. Examples:
invoice.paidinvoice.payment_failedsubscription.updatedcustomer.deletedPast tense, always. Webhooks describe things that have happened, not things that are about to. user.creating is a bug; user.created is correct.
Group events under stable nouns even if the underlying tables change. If your users table becomes accounts next year, the event should still be user.created. Customers wrote handlers against the public name, not the schema.
Reserve a small set of top-level nouns at launch (5 to 10), and resist the urge to expand. Stripe ships roughly 250 event types across about 40 nouns. Shopify ships about 100 across 20. Slack ships about 80 across 15. Your first version should probably ship 15 to 30 events across 5 to 8 nouns.
Every event carries a payload, and every payload will need to change. Bake the version into the envelope, not the URL.
{
"id": "evt_01HXYZ...",
"type": "invoice.paid",
"api_version": "2026-05-01",
"created": 1747526400,
"data": {
"object": { "id": "in_...", "amount_paid": 4200, "currency": "usd" }
}
}
Pin every customer to the api_version that was current when they connected the webhook. When you ship a breaking change, customers stay on their pinned version until they explicitly upgrade. Stripe popularized this pattern; it's the single decision that lets you evolve payloads without breaking integrations.
The cost is real: you maintain a payload transformer per active version. The reward is also real: you can ship breaking changes any week without a customer-wide migration project.
Every webhook POST must carry a signature header so the consumer can verify the body came from you. The standard pattern:
X-Webhook-Signature: t=1747526400,v1=5257a869e7...
The signature is HMAC-SHA256 over {timestamp}.{raw_body} using a per-endpoint secret. The timestamp lets consumers reject replay attacks (drop anything older than 5 minutes).
Two implementation rules that are easy to get wrong:
A solid rotation API has three endpoints: POST /endpoints/{id}/rotate-secret, GET /endpoints/{id}/secrets (returns active and pending), and POST /endpoints/{id}/expire-old-secret (cuts the grace period short if the customer suspects compromise).
Naive retry policies fall into two failure modes: too few retries and you lose events on transient blips; too many and you DDoS the customer's endpoint when they have a real outage.
The curve we recommend, and that matches what most mature webhook systems ship:
| Attempt | Delay from previous | Cumulative |
|---|---|---|
| 1 | 0 | immediate |
| 2 | 1 min | 1 min |
| 3 | 5 min | 6 min |
| 4 | 30 min | 36 min |
| 5 | 2 hr | ~2.5 hr |
| 6 | 6 hr | ~8.5 hr |
| 7 | 12 hr | ~20 hr |
| 8 | 24 hr | ~44 hr |
After about 72 hours, give up and mark the event as permanently failed. Email the workspace owner. Surface the failure in the dashboard with a one-click replay.
Treat any 2xx as success. Treat 410 Gone as a signal to disable the endpoint automatically (the customer's app told you it no longer exists). Treat 5xx, timeouts, and connection errors as retriable. Treat 4xx other than 410 as a permanent failure on this attempt, no retry, because the consumer's app rejected the payload shape and replaying won't help.
Cap concurrent in-flight retries per endpoint at something low (we use 4). A flapping customer endpoint shouldn't be able to saturate your delivery workers.
The dashboard is the feature that determines whether your webhook product feels professional or homemade. The bar is set by Stripe's event log: list view, filters by event type and status, click into a single event, see the full payload, see every delivery attempt with response code and body, and a "Resend" button.
A minimum-viable webhook dashboard ships:
Stripe goes further with a CLI (stripe listen, stripe trigger) that forwards live events to localhost. Shopify ships a similar local-tunnel tool. If you have any budget for tooling, this is where it pays back fastest; consumer engineers love it and it shortens integration time from days to hours.
If you're auditing your current webhook stack against this list, our ship-or-skip tool is a one-page grader that scores producer architectures against the same rubric we'd use in an engineering review.
The single most useful operational tool you can build is "replay all events for endpoint X between timestamp A and B." Customers will misconfigure their endpoint, miss a window of events, fix the config, and then ask you to replay. Without a tool, this is a 30-minute SQL query for an engineer. With a tool, it's a 10-second support action.
Build it as an internal admin page first, then expose a customer-facing version once the abuse model is clear. Rate-limit it (one replay per endpoint per hour is reasonable) so a panicked customer can't trigger a replay storm.
Two pitfalls that take down webhook systems in their second year:
The PII pitfall: an event payload contains a customer's email, the customer's account is later deleted under a GDPR request, but the event is still in the consumer's queue or log forever. The producer-side fix is to keep payloads thin (IDs, not denormalized objects) and let consumers fetch the current object via API if they need details. This also keeps your payload schema stable when the underlying object grows fields.
The payload-size pitfall: a customer creates a bulk import of 50,000 line items in one invoice, the invoice.paid payload balloons past 1 MB, and consumer endpoints start rejecting it. Cap payload size (Stripe caps at 256 KB; Shopify at about 1 MB). When you hit the cap, send a truncated payload with a flag like data_truncated: true and a URL to fetch the full object.
A few patterns that look correct but break in production:
id so consumers can dedupe.created timestamp.Be honest about scope. If you're two founders pre-revenue and exactly one customer has asked for webhooks, you do not need a versioned event taxonomy and a replay dashboard. Ship the simplest thing: one event, one POST, a shared secret, manual retry from a SQL console. Spend 4 hours, not 4 weeks.
The full playbook starts paying back somewhere around 20 active webhook customers, or the first integration partner who builds a productized connector on top of your events. Before that, you're building infrastructure for users who don't exist.
A complete webhook system is roughly 2 to 4 weeks of focused work: queue, signing, retry curve, dashboard UI, replay tool, docs. It's well-defined, unglamorous, and exactly the kind of scope a senior engineer can own end-to-end while your team focuses on the core product.
This is one of the cleanest scope shapes for an on-demand booking. On Cadence, a senior tier engineer ($1,500/week, AI-native by baseline, vetted on Cursor and Claude Code) typically ships a production-grade webhook layer in 2 to 3 weeks. Median time to first commit across our 12,800-engineer pool is 27 hours, and the 48-hour free trial means you can scope the work with them before any billing starts. If you want a Build/Buy/Book recommendation on this specific workload, our decide tool walks the framework.
If you're rebuilding existing technical debt around webhooks rather than greenfield, our guide on managing technical debt in a startup covers how to sequence the rewrite without freezing feature shipping.
Audit your current webhook system against this list:
A clean sweep means your webhook system is production-ready for the next two years. Any gap is a future support ticket. The fix for most teams is a focused 2 to 4 week build, and it's one of the rare engineering investments that pays back in reduced support load within a quarter.
If you'd rather not own this build internally, a senior engineer on Cadence can scope and ship the full producer-side webhook layer inside three weeks at $1,500/week, with a 48-hour trial before any billing. Brief the spec and you'll see vetted matches in under two minutes.
15 to 30, across 5 to 8 nouns. Resist shipping events for every internal action; you'll regret supporting them. Stripe started with fewer than 30 and now ships about 250. You can always add events; removing them breaks customer integrations.
Webhooks for state changes consumers need within seconds (payment success, document signed, build completed). Polling for slowly-changing aggregate state where freshness within an hour is fine. Many mature SaaS products ship both: webhooks for the hot path, a GET /events?since=... endpoint as a fallback for consumers who missed deliveries.
Batch where the semantics allow it. Instead of 5,000 individual order.created events for a bulk import, emit one order.batch_created with an array of IDs (capped at 100). Document the batching rules clearly. Shopify uses this pattern for inventory updates; it's the difference between "the customer's endpoint stayed up" and "we DDoS'd them at 2 a.m."
HMAC-SHA256 over {timestamp}.{raw_body} with a per-endpoint secret, sent in X-Webhook-Signature as t=...,v1=.... Support secret rotation with a 24-hour grace window. Don't invent your own scheme; the Stripe header format is well-understood by every framework and AI code generator.
Ship a local-forwarding CLI (Stripe's pattern: stripe listen --forward-to localhost:3000/webhooks) or rely on tunneling tools like ngrok. Inside your dashboard, ship a "Send test event" button that fires a hand-crafted sample payload of any event type to any endpoint. Both unlock fast iteration for the consumer-side developer and shorten time-to-first-integration from days to hours.