How to set up event-driven architecture

To set up event-driven architecture in 2026, pick a broker (Redpanda or AWS SNS+SQS for most teams), use the Outbox pattern from a transactional database to guarantee delivery, design events as past-tense facts (OrderPlaced, not PlaceOrder), and add idempotent consumers with a dead-letter queue from day one. Skip it entirely if you're under 10k events per day; a Postgres LISTEN/NOTIFY will do the job for a tenth of the operational cost.

This guide is the playbook a senior backend engineer would actually ship in their first week. Not a survey of every pattern, not a Kafka deep-dive, just the smallest set of decisions that gets your first event in production without painting your team into a corner you'll regret in six months.

What event-driven architecture actually means in 2026

Event-driven architecture (EDA) is a pattern where services communicate by publishing events (past-tense facts about something that happened) to a broker, and other services subscribe to those events asynchronously. The producer doesn't know or care who consumes the event. The consumer doesn't know or care who produced it.

That decoupling is the whole point. In a request-response world, your checkout service has to call email, analytics, inventory, and fulfillment in sequence, and if any of them is down, checkout breaks. In an event-driven world, checkout publishes one OrderPlaced event, and the other four services pick it up on their own schedule. If email is down for an hour, the email goes out an hour late. The customer still got their order confirmation page in 200ms.

The pattern isn't new (the LMAX disruptor pre-dates Kafka by years), but two things made it the dominant backend pattern of 2026. First, AI agents are pure event consumers: an agent that watches for SupportTicketOpened and drafts a reply is the canonical 2026 architecture. Second, the broker market finally got cheap and operable enough that small teams can run it without a dedicated platform engineer.

Why most early-stage teams should skip EDA (and how to know you can't)

If you're a two-founder team pre-revenue, you don't need event-driven architecture. You need to ship the next feature. A Postgres database with a background worker (or even a cron job) handles the first 50,000 events per day for free. Adding Kafka before you need it doubles your infrastructure cost and triples your debugging time.

The five triggers that mean you actually need EDA:

Multi-team scaling. Three or more teams ship to the same service and you spend more time coordinating than building. Events let each team own a consumer independently.
Real-time analytics. Product needs sub-minute dashboards on user behavior. Polling Postgres doesn't scale; streaming events to a warehouse does.
Audit replay. Regulatory or financial requirements mean you need to reconstruct system state at any point in time. Event sourcing is the only sane answer.
AI agent workflows. You're building agents that react to system events (new ticket, failed payment, abandoned cart). Pub/sub is how they get fed.
Scale beyond a single database write. A single Postgres can absorb roughly 10,000 writes per second on commodity hardware. Past that, you fan out via events.

If none of those apply, write a technical specification for the simpler version first. You can always add a broker later. Going the other direction (ripping out Kafka after you realize you didn't need it) is six months of work.

Picking your broker: Kafka vs Redpanda vs SNS+SQS vs NATS

There are roughly four credible choices for a 2026 startup. Each wins in different conditions.

Broker	Cost at 1M events/mo	Setup difficulty	Best for	Worst at
Apache Kafka (MSK)	~$200/mo (3-node)	High	Replay, retention, big-data integration	Operational overhead at small scale
Redpanda	~$80/mo (single node)	Medium	Kafka API without JVM	Smaller third-party tool ecosystem
AWS SNS+SQS	~$0.50/mo	Low	AWS-native, cheap, fully managed	No replay, weaker ordering
NATS JetStream	~$30/mo	Low	Lightweight pub/sub, edge deployments	Fewer enterprise integrations

The honest defaults: if you're on AWS and your events are operational (not analytical), use SNS+SQS. It's cheap, it scales, and you don't operate anything. If you want Kafka's API and replay semantics but not the JVM tax, run Redpanda. Pick actual Apache Kafka only when you're already on Confluent Cloud, you need its full ecosystem (Connect, Streams, ksqlDB), or you're operating at the scale where Confluent's tooling matters more than its cost.

NATS is the dark horse for IoT and edge workloads. If your events come from devices and you need a footprint smaller than 50MB, JetStream is the right call.

The first-week setup playbook

A senior backend engineer can ship the first production event by Friday. Here is the day-by-day breakdown.

Day 1: pick the broker and define the first event schema. Don't try to design every event in your domain. Pick one (OrderPlaced, UserSignedUp, PaymentFailed) and write its schema in JSON Schema or Avro. Stand up a schema registry alongside the broker. Confluent Schema Registry and AWS Glue Schema Registry both support both formats.

Day 2: implement the Outbox pattern in Postgres. This is the single most important decision in the whole rollout. Don't publish events directly from application code. Insert them into an outbox table in the same transaction as your domain write, then have a small worker process poll the outbox and publish to the broker. This guarantees that if the database write succeeds, the event eventually publishes, and if it fails, no event publishes. No two-phase commit, no surprise duplicates from in-memory queues.

CREATE TABLE outbox (
  id UUID PRIMARY KEY,
  aggregate_id UUID NOT NULL,
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  published_at TIMESTAMPTZ
);
CREATE INDEX outbox_unpublished ON outbox (created_at) WHERE published_at IS NULL;

Day 3: ship the publisher with idempotency keys. Every event carries a unique event_id. The publisher uses it as the broker's idempotent producer key (Kafka, Redpanda) or as the message deduplication ID (SNS FIFO). Don't skip this. Without it, your outbox worker will republish on retry and your downstream consumers will process duplicates.

Day 4: write the first consumer with retry and DLQ. The consumer must be idempotent: processing the same event twice produces the same result. Store processed event_ids in a small processed_events table (or Redis with a 7-day TTL). On any unrecoverable error, route the message to a dead-letter queue after 3 retries. AWS SQS handles this natively; for Kafka you build it with a .dlq topic suffix convention.

Day 5: add distributed tracing. Every event carries a correlation_id (the original request's trace ID). Wire OpenTelemetry through the producer, the broker, and the consumer. Without this you cannot answer "why did this user's email never send?" in under three hours, and you will get asked that question.

This same five-day shape applies whether you're on Vercel, Render, or AWS. The platform doesn't change the playbook.

Event design rules that prevent six-month rewrites

The cost of a bad event schema isn't paid on Day 5. It's paid in month six when you realize half your consumers parse order.total as cents and the other half as dollars, and every downstream report is wrong.

Five rules that prevent this:

Name events as past-tense facts. OrderPlaced, not PlaceOrder. The first is a record of something that happened. The second is a command, which belongs in a different pattern (command bus, not event bus).
Version from day one. Use v1.OrderPlaced in the event type. When you change the schema, publish v2.OrderPlaced and run both in parallel until consumers migrate. Confluent Schema Registry has compatibility checks for this; use them.
Immutable payloads. Once published, an event is a historical fact. If you discover the data was wrong, publish a v1.OrderCorrected event referencing the original. Don't mutate.
No PII you can't delete. Email addresses and names eventually have to be deletable under GDPR. Either keep PII out of event payloads entirely (reference by user ID and look up at consumer time) or use a crypto-shredding pattern where you can throw away the decryption key per-user.
One aggregate per event. An OrderPlaced event is about one order. Don't bundle "order plus all line items plus the customer profile plus the shipping address" into a single payload. Each consumer reads what it needs and joins from its own data.

Common pitfalls that break EDA in production

The five failure modes you'll hit in your first quarter, in roughly the order you'll hit them.

Forgetting idempotency. A network blip causes the broker to redeliver an event. Your consumer processes it twice. If that consumer charges the customer's card, you just double-charged them. Every consumer needs a deduplication check, full stop. This is the same discipline as handling Stripe webhooks correctly; the patterns transfer.

No dead-letter queue. A poison message (one that always throws on parse) sits at the head of the queue and blocks every subsequent message. After 3 retries, route it to a DLQ and alert. Without a DLQ your entire event stream stops the first time a single message has a malformed field.

Mistaking at-least-once for exactly-once. Almost no broker delivers exactly-once. Kafka claims to via transactions, but only between Kafka topics; the moment you write to Postgres on the consumer side, you're back to at-least-once. Build for at-least-once delivery and make your consumers idempotent.

Schema breaks. Someone changes order.total from an integer (cents) to a float (dollars). Half the consumers crash, the other half silently misprocess. A schema registry with backward-compatibility checks catches this in CI before it ships.

No correlation IDs. A customer asks why their order confirmation never arrived. You have logs from checkout, logs from outbox-publisher, logs from the broker, logs from email. None of them share a request ID. You spend 4 hours grepping timestamps. Wire correlation IDs from request entry to final consumer; this is non-negotiable.

Testing event-driven systems without losing your mind

Three layers, in order of how much they actually help.

Unit tests on the consumer logic. Pass in a fake event payload, assert the consumer does the right thing. Cheap, fast, catches most regressions.

Integration tests with a real broker via testcontainers. Spin up Redpanda or LocalStack-backed SNS+SQS in a Docker container during your CI run. Publish events, assert consumers process them. This is the same testcontainers approach used to run integration tests in CI for database code. The setup pays back the first time it catches a serialization bug that unit tests missed.

Contract tests between teams. When two teams own producer and consumer respectively, use Pact or the schema registry's compatibility checks as the contract. The producer's CI fails if they break the schema; the consumer's CI fails if they don't handle a new field. This eliminates 90% of "who broke prod" arguments.

Bonus, when you can afford it: replay production events into staging. Pipe a copy of production events into a staging cluster and run new consumer versions against them before promoting. This is the only test that catches schema drift you didn't notice.

When to bring in help versus build it in-house

A senior backend engineer (the kind who has done two or three EDA rollouts before) can stand up the broker, ship the Outbox, write the first publisher and consumer, and wire distributed tracing in about a week. They'll spend the second week converting your first real domain event and onboarding the rest of the team.

The full-time hire question depends on whether you have ongoing event-system work or a one-shot setup. If you'll add two or three new event flows per quarter, full-time makes sense. If this is a single rollout and your team can maintain it once it's live, a booking is the cleaner shape.

Every engineer on Cadence is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings. For an EDA rollout, the senior tier ($1,500/week) is the right match: this is owner-of-scope work, not feature-level execution. Mid tier ($1,000/week) can run the playbook once the senior has shipped the foundation. If you want to validate the fit before committing, the 48-hour free trial covers the schema-design and Outbox-implementation phase, which is usually enough to know whether the engineer is the right call.

If you're not sure whether EDA is the right move at all, audit your stack first and get an honest read before you commit a quarter of engineering time to it.

Try it: Book a senior engineer on Cadence to scope and ship the first event in your stack. 48-hour free trial, weekly billing, replace any week. The full first-week playbook above is the kind of scope a Cadence senior owns end-to-end.

FAQ

How long does it take to set up event-driven architecture?

A senior engineer can ship the first production event in about a week. Full rollout across an existing service mesh takes 4 to 8 weeks depending on how many synchronous endpoints you're converting. Budget 80 to 160 engineer-hours for the initial setup; recurring costs are mostly in maintaining schemas and consumers.

What does event-driven architecture cost at startup scale?

Under $100 per month for the broker at up to roughly 1 million events per month. Redpanda single-node, AWS SQS, and NATS JetStream all land in this range. Engineer time dominates infrastructure cost by 50 to 100x in the first year. Past 10 million events per month, broker cost starts to matter and you'll want to compare managed Kafka (MSK, Confluent) against self-hosted Redpanda seriously.

Should I use Kafka or something simpler?

For most teams under 10 million events per day, Redpanda or AWS SNS+SQS is the right answer. Pick Apache Kafka when you need replay across days of history, multi-region replication, or you're already paying for Confluent Cloud. Don't pick Kafka because it's the resume-default; the operational overhead is real.

Do I need a schema registry from day one?

Yes. The cost of adding a schema registry later is rewriting every consumer to validate payloads it previously trusted. Confluent Schema Registry and AWS Glue Schema Registry both support Avro and JSON Schema; either works. The point is having a single source of truth for what an event looks like, enforced at publish time.

Can I use event-driven architecture with a serverless backend?

Yes, and it's the canonical 2026 pattern. AWS Lambda triggered by EventBridge, SNS, or SQS handles most use cases. Cold starts add 100 to 500ms of latency, so use provisioned concurrency on hot paths. The same patterns from this guide (Outbox, idempotency, DLQ, correlation IDs) apply unchanged. See serverless backend design for the broader architecture context.

Sadhana M S

TA Expert

Talent acquisition expert at withRemote. Writes on candidate-experience design, JD craft, and stage-gated interview loops.

All posts