How to add rate limiting to your API

Q: What headers should a 429 response include?

RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset (the IETF standard), plus Retry-After for legacy clients. Send them on every response, not just 429s, so well-behaved clients can pace themselves.

Q: How do I rate limit a serverless function?

Use a managed Redis with HTTP transport (Upstash) and the @upstash/ratelimit package. Local in-memory state does not work because each cold start gets a fresh process. The HTTP transport lets you call Redis from edge runtimes like Cloudflare Workers and Vercel Edge.

To add rate limiting to your API, put a token bucket in Redis (or Upstash if you are serverless), key it on the API key (falling back to IP), and return 429 Too Many Requests with the standard RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers. Layer it: a coarse limit at the CDN or edge for cheap abuse protection, a precise limit in your application for per-user fairness, and a hard ceiling at the database layer for runaway queries.

Most teams ship a single express-rate-limit middleware on one node and call it done. That works until you scale to two pods or a serverless function fans out across regions. Then the in-memory counter resets per instance, the limit becomes "100 requests per pod" instead of "100 requests per user," and a determined client can hammer you 10x by hitting different cold starts. The fix is not a bigger memory cache. It is a shared atomic counter (Redis with Lua, or a managed equivalent) and a deliberate choice of algorithm for your traffic shape.

This is the playbook we hand to engineers building production APIs. It covers the four algorithms worth knowing, working code for the two you should actually pick, where to place limiters in the request path, and how to design the 429 response so clients can self-heal.

Why rate limiting matters more in 2026

Three things changed in the last 24 months. AI agents now make a meaningful fraction of API traffic, and they retry aggressively when they get a 500. Edge functions made it trivial to fan one user request into 50 downstream calls without realizing it. And the cost asymmetry between a request to your API and a request to OpenAI on your behalf means a single unbounded loop can burn a month's inference budget in 90 seconds.

A 2025 Cloudflare report attributed roughly 30% of all API requests to automated clients. If your API is public or semi-public (mobile app, partner integration, AI agent SDK), unbounded endpoints are a financial liability, not just an availability one. Rate limiting is the cheapest insurance you can buy: a weekend of work caps your worst-case bill and makes your 99th-percentile latency predictable.

The default approach (and why it breaks)

Open the npm registry and search for "rate limit." The top result is express-rate-limit with about 6 million weekly downloads. The 30-second integration looks like this:

import rateLimit from 'express-rate-limit';
app.use(rateLimit({ windowMs: 60_000, max: 100 }));

That is fine for a hobby project. It breaks the moment any of the following is true:

You run more than one instance. The default store is in-memory per process. Two pods means each user effectively gets 2 * max requests. Ten pods means 10x. Auto-scaling makes this worse, not better.
You deploy to serverless. Vercel, AWS Lambda, and Cloudflare Workers spin up new isolates on demand. There is no persistent process to hold the counter. You are functionally not rate limited.
You key on IP only. A single corporate NAT can hide thousands of users behind one IP. You will rate limit a whole office before you catch one abuser.
You use a fixed-window counter. A client can send max requests in the last second of one window and max in the first second of the next, getting 2 * max in two seconds. This is the boundary-burst attack.

Each of these has a fix. Combined, they argue for a different architecture: shared state (Redis), smarter keys (composite identity), a better algorithm (token bucket or sliding window counter), and layered enforcement (edge plus app).

Layered defense: where to apply rate limits

A rate limit is most useful where the request is cheapest to drop. That argues for layering, not picking one spot.

Layer	What it catches	Cost to evaluate	Granularity
CDN / edge (Cloudflare, Fastly, Vercel)	DDoS, scrapers, bot traffic	Free or near-free	IP, ASN, country
API gateway (Kong, Tyk, AWS API Gateway)	Per-API-key abuse, plan enforcement	Sub-millisecond	API key, route
Application middleware (your code)	Per-user fairness, business logic	1-3 ms with Redis	User ID, tenant, action
Database / downstream (Postgres, OpenAI)	Runaway queries, expensive operations	Variable	Per-statement, per-token

The pattern we recommend: a coarse CDN rule (e.g. 1,000 requests per minute per IP) catches obvious abuse before it touches your origin. The application layer enforces the precise per-user limit (60 requests per minute per authenticated user). The database layer uses statement timeouts and connection pool caps as a final safety net. If you are calling expensive third-party APIs (LLMs, payments), add a per-tenant token budget around those calls specifically.

For more on building resilient API contracts that compose with rate limiting, see our API design best practices guide.

The four algorithms, and which to actually pick

There are four canonical rate limiting algorithms. Two are worth implementing. The other two exist mostly so interview questions have wrong answers.

Algorithm	Memory per key	Burst behavior	When to pick
Fixed window counter	1 integer	Allows 2x at boundaries	Internal tools, occasional spike OK
Sliding window log	O(n) timestamps	No bursts, exact	Audit-grade APIs, low traffic
Sliding window counter	2 integers	Smoothed boundaries	General-purpose default
Token bucket	2 floats	Allows controlled bursts	Bursty real-world traffic
Leaky bucket	2 floats	No bursts, strict shaping	Outbound throttling, queues

Pick token bucket if your clients are mobile apps, AI agents, or anything else that batches work after idle periods. Token bucket lets a user burn 100 tokens at once if they have been idle, then refill at 10 per second. That matches how humans and bots actually use APIs.

Pick sliding window counter if you want one default that handles everything reasonably well with two integers per key. Skip fixed window unless boundary bursts are fine. Skip sliding window log unless you need audit-grade request history.

Working code: Redis token bucket with Lua

The atomic check-and-decrement is the only hard part. If two requests arrive at the same instant and both read the bucket before either writes, both get through, and your limit is wrong. Lua scripts run atomically inside Redis, which is why every production rate limiter does the same thing.

-- token_bucket.lua
-- KEYS[1] = bucket key (e.g. "rl:user:123")
-- ARGV[1] = max tokens (capacity)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp (ms)
-- ARGV[4] = tokens requested (usually 1)

local capacity   = tonumber(ARGV[1])
local refill     = tonumber(ARGV[2])
local now        = tonumber(ARGV[3])
local cost       = tonumber(ARGV[4])

local data       = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens     = tonumber(data[1]) or capacity
local last       = tonumber(data[2]) or now

local elapsed    = math.max(0, now - last) / 1000
tokens           = math.min(capacity, tokens + elapsed * refill)

local allowed    = tokens >= cost
if allowed then
  tokens = tokens - cost
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', KEYS[1], math.ceil(capacity / refill * 1000) * 2)

return { allowed and 1 or 0, tokens, capacity }

Calling it from Node:

import { createClient } from 'redis';
import { readFileSync } from 'fs';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
const script = readFileSync('./token_bucket.lua', 'utf8');
const sha = await redis.scriptLoad(script);

export async function checkRate(key: string) {
  const [allowed, remaining, limit] = (await redis.evalSha(sha, {
    keys: [`rl:${key}`],
    arguments: ['100', '10', String(Date.now()), '1'],
  })) as [number, number, number];

  return { allowed: allowed === 1, remaining: Math.floor(remaining), limit };
}

That is 40 lines of code, runs in roughly 1 millisecond against a co-located Redis, and works correctly across any number of application instances. It is the core of every production rate limiter we have shipped.

Working code: Upstash Ratelimit for serverless

If you are on Vercel, Cloudflare Workers, or AWS Lambda, you do not want to manage a Redis cluster. Upstash gives you a serverless Redis with HTTP transport, and their @upstash/ratelimit package wraps the same algorithms above with a sane API.

import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.tokenBucket(10, '10 s', 100),
  analytics: true,
  prefix: 'api',
});

export async function POST(req: Request) {
  const key = req.headers.get('x-api-key') ?? getIp(req);
  const { success, limit, remaining, reset } = await ratelimit.limit(key);

  const headers = {
    'RateLimit-Limit': String(limit),
    'RateLimit-Remaining': String(remaining),
    'RateLimit-Reset': String(Math.ceil((reset - Date.now()) / 1000)),
  };

  if (!success) {
    return new Response('Too Many Requests', { status: 429, headers });
  }
  // ... handle request
}

This is 15 lines and handles the cold-start problem out of the box. For Next.js apps using the App Router, you can drop this directly into a route handler or a Server Action wrapper.

Key-pattern strategy: never trust just one identifier

The key you rate limit on determines what you are actually protecting. Get this wrong and a legitimate user sees a 429 while an attacker walks through.

The right pattern is a composite key with fallbacks:

function getRateLimitKey(req: Request): string {
  const apiKey = req.headers.get('x-api-key');
  if (apiKey) return `api:${apiKey}`;

  const userId = getUserIdFromSession(req);
  if (userId) return `user:${userId}`;

  const ip = req.headers.get('cf-connecting-ip')
    ?? req.headers.get('x-forwarded-for')?.split(',')[0]
    ?? 'unknown';
  return `ip:${ip}`;
}

Order matters. API key first (most specific, hardest to forge). Authenticated user ID second (catches users without an API key). IP last (the noisy fallback that catches the unauthenticated traffic but cannot distinguish users behind a NAT). Apply different limits per key type: a paid API key might get 1,000 requests per minute, a free user 60, an anonymous IP 10.

A second pattern: limit per (user, action) tuple. A user uploading a file, sending a chat message, and resetting their password all hit /api but have wildly different cost profiles. Key as user:123:upload and user:123:chat separately so one expensive endpoint cannot starve the others.

Designing the 429 response so clients can self-heal

A 429 with no information is hostile. Good 429 responses tell the client three things: what the limit was, when it resets, and (optionally) how long to wait before retrying. The IETF standardized this in RFC 9239, which most modern frameworks now support.

HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 27
Retry-After: 27

{
  "type": "https://api.example.com/errors/rate-limit",
  "title": "Rate limit exceeded",
  "detail": "You have made 100 requests in the last 60 seconds. Try again in 27 seconds.",
  "limit": 100,
  "remaining": 0,
  "resetAt": "2026-05-05T18:01:27Z"
}

A few non-obvious rules. Always include Retry-After (in seconds) in addition to RateLimit-Reset. Some HTTP clients respect one and not the other. Use application/problem+json (RFC 9457) for the body so generic error handlers can parse it. Never return a 429 from inside an authenticated session without including a clear WWW-Authenticate hint when the limit is auth-related, because clients will silently retry forever and burn quota.

Send the RateLimit-* headers on every response, not just 429s. That lets well-behaved clients pace themselves before they hit the wall. Stripe and GitHub both do this and it cuts retry traffic noticeably.

Common pitfalls

Trusting X-Forwarded-For directly. Anyone can set this header. Use your CDN's signed header (Cf-Connecting-Ip for Cloudflare, X-Vercel-Forwarded-For for Vercel) or take only the leftmost value and verify the chain.
Rate limiting before authentication. Order: parse identity, then rate limit. Otherwise an unauthenticated request crashes instead of returning 401.
Forgetting the OPTIONS preflight. CORS preflights count as requests. Skip rate limiting on OPTIONS or use a much higher limit.
Counting failed requests against the limit. A user who sends 100 invalid requests should not lock themselves out. Skip 4xx responses or count them under a separate, looser key.
No bypass for internal callers. Whitelist monitoring, health checks, and webhook retries by an internal API key prefix or a header only your edge can set.

When to skip rate limiting

Be honest about scope. If you are pre-launch with five users you know personally, rate limiting is cargo-cult work. The same is true for internal APIs behind a VPN where the threat model is "a coworker writes a bad loop" (the answer there is a code review, not Redis).

You should add rate limiting before any of the following:

A public API endpoint, including unauthenticated forms (signup, contact, password reset)
Anything that calls a paid downstream API on the user's behalf (LLMs, SMS, payments, geocoding)
A public REST or GraphQL surface for partners or third-party developers
Anywhere a successful response costs more than 10 milliseconds of compute or one external network call

If none of those apply, ship without it and revisit when the first abuse incident happens. That sounds reckless until you remember that adding a token bucket later is a one-day project, and shipping a feature that no one uses is a multi-week loss.

The Cadence connection

Most rate limiter rollouts are a 2-day senior engineer project: pick the algorithm, deploy Redis (or Upstash if you are serverless), wire the middleware, design the 429 response, ship it behind a feature flag, then tune limits against real traffic for a week. If your team is heads-down on product and does not have an engineer with production Redis experience to spare, this is the kind of scoped, well-defined project we built Cadence for.

A senior engineer on Cadence costs $1,500 per week. They take the booking spec on Monday, ship the limiter behind a flag by Wednesday, run load tests Thursday, and roll out to 100% by Friday. Every engineer on the platform is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings, so the actual implementation goes faster than the design conversation.

If you want a sanity check on your current API's stack before you commit to changes, you can also audit your stack with Ship or Skip and get an honest read on which infrastructure decisions are worth keeping.

Try it: book a senior engineer on Cadence to scope and ship rate limiting (and the supporting observability) inside one week, with a 48-hour free trial to confirm fit before you pay.

For deeper context on data layer choices that affect how rate limiting integrates, see our guide on choosing between SQL and NoSQL, and our tech stack guide for 2026 covers the boring defaults worth picking.

FAQ

Where should I put rate limiting in my stack: CDN, gateway, or app?

All three, with different jobs. The CDN catches obvious abuse cheaply (1,000 req/min per IP). The application enforces precise per-user fairness (60 req/min per authenticated user). The gateway handles plan enforcement if you sell tiered API access. They are not redundant; they each catch a different failure mode.

Should I use IP, API key, or user ID as the rate limit key?

Use a composite key with fallbacks: API key first (most specific), authenticated user ID second, IP last. Apply different limits per key type so a corporate NAT does not get throttled like one user, and so a free-tier user does not get the same generous quota as a paying customer.

What is the difference between token bucket and sliding window counter?

Token bucket allows controlled bursts (idle users accumulate tokens up to a cap). Sliding window counter smooths traffic and approximates a true sliding window with two integers per key. Token bucket fits bursty real traffic better; sliding window counter is the safer general-purpose default if you are unsure.

What headers should a 429 response include?

RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset (the IETF standard), plus Retry-After for legacy clients. Send them on every response, not just 429s, so well-behaved clients can pace themselves.

How do I rate limit a serverless function?

Use a managed Redis with HTTP transport (Upstash) and the @upstash/ratelimit package. Local in-memory state does not work because each cold start gets a fresh process. The HTTP transport lets you call Redis from edge runtimes like Cloudflare Workers and Vercel Edge.

How long does it take to implement rate limiting properly?

A senior engineer with production Redis experience can ship a layered limiter (edge plus app), composite keying, standard 429 headers, and observability in 2 to 5 days. The longer tail is tuning limits against real traffic, which usually takes another week of monitoring.

All posts