
To set up event-driven architecture in 2026, pick a broker (Redpanda or AWS SNS+SQS for most teams), use the Outbox pattern from a transactional database to guarantee delivery, design events as past-tense facts (OrderPlaced, not PlaceOrder), and add idempotent consumers with a dead-letter queue from day one. Skip it entirely if you're under 10k events per day; a Postgres LISTEN/NOTIFY will do the job for a tenth of the operational cost.
This guide is the playbook a senior backend engineer would actually ship in their first week. Not a survey of every pattern, not a Kafka deep-dive, just the smallest set of decisions that gets your first event in production without painting your team into a corner you'll regret in six months.
Event-driven architecture (EDA) is a pattern where services communicate by publishing events (past-tense facts about something that happened) to a broker, and other services subscribe to those events asynchronously. The producer doesn't know or care who consumes the event. The consumer doesn't know or care who produced it.
That decoupling is the whole point. In a request-response world, your checkout service has to call email, analytics, inventory, and fulfillment in sequence, and if any of them is down, checkout breaks. In an event-driven world, checkout publishes one OrderPlaced event, and the other four services pick it up on their own schedule. If email is down for an hour, the email goes out an hour late. The customer still got their order confirmation page in 200ms.
The pattern isn't new (the LMAX disruptor pre-dates Kafka by years), but two things made it the dominant backend pattern of 2026. First, AI agents are pure event consumers: an agent that watches for SupportTicketOpened and drafts a reply is the canonical 2026 architecture. Second, the broker market finally got cheap and operable enough that small teams can run it without a dedicated platform engineer.
If you're a two-founder team pre-revenue, you don't need event-driven architecture. You need to ship the next feature. A Postgres database with a background worker (or even a cron job) handles the first 50,000 events per day for free. Adding Kafka before you need it doubles your infrastructure cost and triples your debugging time.
The five triggers that mean you actually need EDA:
If none of those apply, write a technical specification for the simpler version first. You can always add a broker later. Going the other direction (ripping out Kafka after you realize you didn't need it) is six months of work.
There are roughly four credible choices for a 2026 startup. Each wins in different conditions.
| Broker | Cost at 1M events/mo | Setup difficulty | Best for | Worst at |
|---|---|---|---|---|
| Apache Kafka (MSK) | ~$200/mo (3-node) | High | Replay, retention, big-data integration | Operational overhead at small scale |
| Redpanda | ~$80/mo (single node) | Medium | Kafka API without JVM | Smaller third-party tool ecosystem |
| AWS SNS+SQS | ~$0.50/mo | Low | AWS-native, cheap, fully managed | No replay, weaker ordering |
| NATS JetStream | ~$30/mo | Low | Lightweight pub/sub, edge deployments | Fewer enterprise integrations |
The honest defaults: if you're on AWS and your events are operational (not analytical), use SNS+SQS. It's cheap, it scales, and you don't operate anything. If you want Kafka's API and replay semantics but not the JVM tax, run Redpanda. Pick actual Apache Kafka only when you're already on Confluent Cloud, you need its full ecosystem (Connect, Streams, ksqlDB), or you're operating at the scale where Confluent's tooling matters more than its cost.
NATS is the dark horse for IoT and edge workloads. If your events come from devices and you need a footprint smaller than 50MB, JetStream is the right call.
A senior backend engineer can ship the first production event by Friday. Here is the day-by-day breakdown.
Day 1: pick the broker and define the first event schema. Don't try to design every event in your domain. Pick one (OrderPlaced, UserSignedUp, PaymentFailed) and write its schema in JSON Schema or Avro. Stand up a schema registry alongside the broker. Confluent Schema Registry and AWS Glue Schema Registry both support both formats.
Day 2: implement the Outbox pattern in Postgres. This is the single most important decision in the whole rollout. Don't publish events directly from application code. Insert them into an outbox table in the same transaction as your domain write, then have a small worker process poll the outbox and publish to the broker. This guarantees that if the database write succeeds, the event eventually publishes, and if it fails, no event publishes. No two-phase commit, no surprise duplicates from in-memory queues.
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_id UUID NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
published_at TIMESTAMPTZ
);
CREATE INDEX outbox_unpublished ON outbox (created_at) WHERE published_at IS NULL;
Day 3: ship the publisher with idempotency keys. Every event carries a unique event_id. The publisher uses it as the broker's idempotent producer key (Kafka, Redpanda) or as the message deduplication ID (SNS FIFO). Don't skip this. Without it, your outbox worker will republish on retry and your downstream consumers will process duplicates.
Day 4: write the first consumer with retry and DLQ. The consumer must be idempotent: processing the same event twice produces the same result. Store processed event_ids in a small processed_events table (or Redis with a 7-day TTL). On any unrecoverable error, route the message to a dead-letter queue after 3 retries. AWS SQS handles this natively; for Kafka you build it with a .dlq topic suffix convention.
Day 5: add distributed tracing. Every event carries a correlation_id (the original request's trace ID). Wire OpenTelemetry through the producer, the broker, and the consumer. Without this you cannot answer "why did this user's email never send?" in under three hours, and you will get asked that question.
This same five-day shape applies whether you're on Vercel, Render, or AWS. The platform doesn't change the playbook.
The cost of a bad event schema isn't paid on Day 5. It's paid in month six when you realize half your consumers parse order.total as cents and the other half as dollars, and every downstream report is wrong.
Five rules that prevent this:
OrderPlaced, not PlaceOrder. The first is a record of something that happened. The second is a command, which belongs in a different pattern (command bus, not event bus).v1.OrderPlaced in the event type. When you change the schema, publish v2.OrderPlaced and run both in parallel until consumers migrate. Confluent Schema Registry has compatibility checks for this; use them.v1.OrderCorrected event referencing the original. Don't mutate.OrderPlaced event is about one order. Don't bundle "order plus all line items plus the customer profile plus the shipping address" into a single payload. Each consumer reads what it needs and joins from its own data.The five failure modes you'll hit in your first quarter, in roughly the order you'll hit them.
Forgetting idempotency. A network blip causes the broker to redeliver an event. Your consumer processes it twice. If that consumer charges the customer's card, you just double-charged them. Every consumer needs a deduplication check, full stop. This is the same discipline as handling Stripe webhooks correctly; the patterns transfer.
No dead-letter queue. A poison message (one that always throws on parse) sits at the head of the queue and blocks every subsequent message. After 3 retries, route it to a DLQ and alert. Without a DLQ your entire event stream stops the first time a single message has a malformed field.
Mistaking at-least-once for exactly-once. Almost no broker delivers exactly-once. Kafka claims to via transactions, but only between Kafka topics; the moment you write to Postgres on the consumer side, you're back to at-least-once. Build for at-least-once delivery and make your consumers idempotent.
Schema breaks. Someone changes order.total from an integer (cents) to a float (dollars). Half the consumers crash, the other half silently misprocess. A schema registry with backward-compatibility checks catches this in CI before it ships.
No correlation IDs. A customer asks why their order confirmation never arrived. You have logs from checkout, logs from outbox-publisher, logs from the broker, logs from email. None of them share a request ID. You spend 4 hours grepping timestamps. Wire correlation IDs from request entry to final consumer; this is non-negotiable.
Three layers, in order of how much they actually help.
Unit tests on the consumer logic. Pass in a fake event payload, assert the consumer does the right thing. Cheap, fast, catches most regressions.
Integration tests with a real broker via testcontainers. Spin up Redpanda or LocalStack-backed SNS+SQS in a Docker container during your CI run. Publish events, assert consumers process them. This is the same testcontainers approach used to run integration tests in CI for database code. The setup pays back the first time it catches a serialization bug that unit tests missed.
Contract tests between teams. When two teams own producer and consumer respectively, use Pact or the schema registry's compatibility checks as the contract. The producer's CI fails if they break the schema; the consumer's CI fails if they don't handle a new field. This eliminates 90% of "who broke prod" arguments.
Bonus, when you can afford it: replay production events into staging. Pipe a copy of production events into a staging cluster and run new consumer versions against them before promoting. This is the only test that catches schema drift you didn't notice.
A senior backend engineer (the kind who has done two or three EDA rollouts before) can stand up the broker, ship the Outbox, write the first publisher and consumer, and wire distributed tracing in about a week. They'll spend the second week converting your first real domain event and onboarding the rest of the team.
The full-time hire question depends on whether you have ongoing event-system work or a one-shot setup. If you'll add two or three new event flows per quarter, full-time makes sense. If this is a single rollout and your team can maintain it once it's live, a booking is the cleaner shape.
Every engineer on Cadence is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings. For an EDA rollout, the senior tier ($1,500/week) is the right match: this is owner-of-scope work, not feature-level execution. Mid tier ($1,000/week) can run the playbook once the senior has shipped the foundation. If you want to validate the fit before committing, the 48-hour free trial covers the schema-design and Outbox-implementation phase, which is usually enough to know whether the engineer is the right call.
If you're not sure whether EDA is the right move at all, audit your stack first and get an honest read before you commit a quarter of engineering time to it.
Try it: Book a senior engineer on Cadence to scope and ship the first event in your stack. 48-hour free trial, weekly billing, replace any week. The full first-week playbook above is the kind of scope a Cadence senior owns end-to-end.
A senior engineer can ship the first production event in about a week. Full rollout across an existing service mesh takes 4 to 8 weeks depending on how many synchronous endpoints you're converting. Budget 80 to 160 engineer-hours for the initial setup; recurring costs are mostly in maintaining schemas and consumers.
Under $100 per month for the broker at up to roughly 1 million events per month. Redpanda single-node, AWS SQS, and NATS JetStream all land in this range. Engineer time dominates infrastructure cost by 50 to 100x in the first year. Past 10 million events per month, broker cost starts to matter and you'll want to compare managed Kafka (MSK, Confluent) against self-hosted Redpanda seriously.
For most teams under 10 million events per day, Redpanda or AWS SNS+SQS is the right answer. Pick Apache Kafka when you need replay across days of history, multi-region replication, or you're already paying for Confluent Cloud. Don't pick Kafka because it's the resume-default; the operational overhead is real.
Yes. The cost of adding a schema registry later is rewriting every consumer to validate payloads it previously trusted. Confluent Schema Registry and AWS Glue Schema Registry both support Avro and JSON Schema; either works. The point is having a single source of truth for what an event looks like, enforced at publish time.
Yes, and it's the canonical 2026 pattern. AWS Lambda triggered by EventBridge, SNS, or SQS handles most use cases. Cold starts add 100 to 500ms of latency, so use provisioned concurrency on hot paths. The same patterns from this guide (Outbox, idempotency, DLQ, correlation IDs) apply unchanged. See serverless backend design for the broader architecture context.