How to roll out feature flags safely

To roll out a feature flag safely, ramp on a fixed curve (1% → 5% → 25% → 50% → 100%), gate every step behind an error-rate check per cohort, and ship a kill switch that flips faster than your CI can deploy. Add a cleanup date to the flag at creation. Skip any of these four and you will eventually wake up at 2am.

The hard part is not turning a flag on. The hard part is turning it off cleanly, knowing which cohort broke, and not having 400 dead flags rotting in your codebase a year later. This playbook covers the ramp curve, the kill switch, sticky vs non-sticky bucketing, server vs client evaluation, the observability hooks that make a rollout debuggable, and the gate-decay process that keeps flag count under control.

Why feature flags became safety infrastructure in 2026

Five years ago, flags were a deploy-decoupling trick. In 2026, they are how you ship anything serious. The shift came from three things: trunk-based development became the norm, AI agents started writing 40 to 60% of production code on most teams, and observability tools got cheap enough to attribute errors to flag cohorts in real time.

When you can plot error rate per cohort live, a 5% ramp is no longer a guess; it is a controlled experiment. The flag service is now part of your safety stack, alongside CI and monitoring.

The default approach (and why it breaks)

The default rollout: a developer opens LaunchDarkly or Statsig, creates a boolean flag called new_checkout, points it at 100% of users on Friday afternoon, and waits to see what Sentry says.

This works when the change is small and the surface area is one endpoint. It breaks the moment the change touches a write path, interacts with a cached value, depends on another flag, or the cohort split is not random. The fix is a rollout playbook the team runs the same way every time, with dial movements and rollback path written down before the first user sees the code.

The ramp curve: 1 → 5 → 25 → 50 → 100

Pick a curve. Stick to it. Most teams converge on something close to this:

Step	Cohort size	Min soak time	What you watch	Auto-rollback trigger
1	1%	1 hour	Error rate, latency p95, custom event volume	Error rate > 2x baseline
2	5%	4 hours	Same + conversion / completion rate	Error rate > 1.5x or conversion drop > 10%
3	25%	24 hours	Same + downstream service load	Any SLO breach
4	50%	24 hours	Same + cost / queue depth	Any SLO breach
5	100%	Hold 7 days before cleanup	Full dashboard	Manual only

The 1% step catches code that throws on import. The 5% step catches code that breaks under realistic concurrency. The 25% step is the first where statistical-significance tests on conversion metrics start working. The 50% step catches capacity issues (database pools, queue depth, third-party rate limits) that only surface near full load.

Soak times are minimums; if your traffic is uneven, let each step run through a full daily cycle. Never skip a step because the previous one looked clean. Skipping is how you learn that your 5% cohort happened to exclude all your enterprise customers.

Kill-switch design: faster than your CI

A kill switch is a flag that turns the feature off in under 30 seconds, from any device, without a deploy. It is the single most important piece of feature-flag infrastructure, and most teams build it wrong.

Three rules:

The kill switch is a separate flag. Do not roll back by editing the ramp percentage. You want a binary off-switch that overrides every other rule. Name it kill_<feature> and document that flipping it is always safe.
It must propagate in under 30 seconds globally. This rules out polling-based SDKs with 5-minute intervals. Use a streaming SDK (LaunchDarkly's streaming endpoint, Statsig's WebSocket, or your own SSE channel). Test propagation time monthly.
It must work without dependencies. If your flag service is down, the kill switch should still kill. Cache the kill-state locally with a short TTL and default the feature to off when the cache is stale and the network is unreachable.

The clearest test of your kill switch is the 2am test: can your on-call engineer, on their phone, on hotel wifi, kill a feature in under a minute? If not, fix that before you ship anything else.

Dependency tagging: the flag graph

Once you have more than 20 flags, you have a graph. Flag A enables a new checkout. Flag B enables a new payment provider used by the new checkout. Flag C kills the old checkout. If you ramp A without B, you ship broken code. If you ramp C without A reaching 100%, you ship nothing.

Tag every flag with its dependencies at creation time. Most flag services support metadata fields; use them. A minimal schema:

flag: new_checkout
requires: [new_payment_provider >= 100%]
blocks: [old_checkout cleanup]
owner: payments-team
cleanup_by: 2026-08-15

Then write a CI check that fails if any flag's ramp percentage exceeds the minimum of its dependencies. This catches the most common production incident in flag-heavy teams: someone bumps the dependent flag without realizing the parent isn't ready. Treating flag rollouts like a small version of managing technical debt in a startup keeps the graph from rotting.

Sticky vs non-sticky bucketing

Sticky bucketing means once a user is in the treatment group, they stay there. Non-sticky means each evaluation is independent and a user might see the new feature on one page load and the old one on the next.

Use sticky when	Use non-sticky when
The feature touches UI a user will navigate back to	The feature is a backend optimization invisible to users
You're running an A/B test on conversion	You're rolling out a bug fix
The feature involves writes that change state	The feature is read-only and stateless
Users could screenshot or share what they see	Cohort assignment doesn't affect user trust

For 80% of feature rollouts, sticky is the right default. Non-sticky bucketing produces strange user experiences: a checkout button that moves between page loads, a pricing page that flickers between two tiers, a dashboard whose layout shuffles. Users notice and they file support tickets you cannot reproduce.

Implementation-wise, sticky bucketing means hashing a stable user ID (not session ID, not IP) into your cohort assignment. Most flag SDKs do this for you if you pass a userKey. The bug is almost always that the team forgets to pass it, and the SDK silently falls back to a per-request random number.

Server vs client evaluation

Server-side evaluation runs the flag check inside your API. Pros: the value never leaks to the client, you can change it instantly without a client refresh, and you can use server-only context (user role, subscription tier) in the rule. Cons: every request hits the flag service, which means latency and a new dependency.

Client-side evaluation runs in the browser or mobile app. Pros: zero added latency, works offline, no extra cost. Cons: flag rules are visible in the JS bundle (security), the user must refresh to see a change, and you cannot keep secret features secret.

Most production systems use both. The flag service ships a bootstrap payload to the client on page load, and the server has its own evaluator for sensitive decisions. The tools that handle this well include LaunchDarkly, Statsig, Unleash, and Flagsmith; we cover the trade-offs in LaunchDarkly vs Statsig and the broader category in best feature flag services.

Rule of thumb: any flag that gates a write path or a paid feature must be evaluated server-side. Layout or copy flags can be client-side. Mixed flags need server-side authoritative evaluation, with the client reading the same flag for layout but never trusting it for authorization.

Observability hooks: error rate per cohort

A rollout without observability is a deploy with extra steps. The instrumentation:

Tag every log, metric, and trace with the flag cohort. Most tools (Datadog, Honeycomb, Grafana Cloud) let you attach a tag via middleware. Add it at the top of your request pipeline.
Build a "by cohort" dashboard before the rollout. Error rate, latency p50/p95/p99, throughput, and two or three business metrics, broken down by cohort. If you cannot build it in 10 minutes, your tagging is wrong.
Alert on cohort divergence, not absolute thresholds. Fire when the treatment cohort's error rate exceeds the control's by more than 1.5x for 5 minutes. This catches changes that break something the rest of the system handles fine.
Capture cohort context in error reports. Sentry, Bugsnag, and Rollbar all support tagging. A 500 in a ramped feature should tell you instantly whether the user was in treatment or control. Pair with end-to-end testing for SaaS running against the treatment cohort in CI to catch regressions before the ramp moves.

Cleanup dates and gate decay

Every flag should have an expiration date set at creation. The default we recommend is 90 days. If the flag is still in code on that date, it gets reviewed; if it cannot be removed, the date renews for another 30 days with a written reason.

Stale flags are a security and reliability liability. Each one is a dead branch of code that nobody tests and that any engineer (or AI agent writing code in your repo) might trigger by accident. A 2024 audit of one mid-size fintech found 312 of their 400 flags had not been touched in over a year; half had no owner.

A workable gate-decay process:

Tag every flag at creation with an owner, a feature, and a cleanup_by date.
Run a weekly cron that opens a GitHub issue for any flag past its cleanup date.
Assign the issue to the owning team automatically.
Block new flag creation by an owner who has more than 3 overdue cleanups.

Most teams skip the last bullet because it feels harsh. It is the only one that actually works.

Common pitfalls

Flag value cached in a CDN. If your edge layer caches the page calling the flag, ramp percentages will be wildly wrong. Vary by user ID or skip the cache for ramped paths.
Different values across services. Service A evaluates new_checkout as true, service B as false, request fails halfway through. Pass the evaluated value down the chain explicitly.
Rolling out during a deploy window. Two changes at once turns one debugging session into two. Ramp during quiet windows.
No baseline period. Always capture 24 hours of pre-rollout baseline; you cannot tell if error rate "doubled" without it.
Treating cleanup as optional. A repo with 200 stale flags is harder to reason about than 200 lines of well-commented legacy code.

When you can skip this entirely

If you are two founders pre-revenue shipping to 20 beta users, you do not need a five-step ramp. You need a kill switch and a Slack channel. The full playbook earns its keep at roughly the point where you have more than 5,000 daily active users, more than two engineers, and at least one paying customer who would notice an outage.

Below that bar, a single boolean environment variable and a deploy is faster than a flag service and meets your safety needs. Above it, the playbook saves you a 2am incident roughly every quarter.

What to do next

Three concrete next steps. First, pick a flag service; see the trade-offs in LaunchDarkly vs Statsig. Second, write your ramp curve and kill-switch spec into a runbook. Third, instrument cohort tagging in your observability tool before your next rollout.

If wiring this up has been on the backlog for a quarter, a Cadence senior engineer ($1,500/week) typically gets the first three flags ramped, the kill switch tested, and the cohort dashboard live inside one week. Every Cadence engineer is AI-native, vetted on Cursor and Claude Code fluency before they unlock bookings, which matters here because the work is mostly SDK calls, observability tags, and CI checks (pattern-heavy work AI tools shorten by 3x). You can audit your stack with our Ship-or-Skip tool first to see whether the rollout pipeline is actually the bottleneck.

Want a second pair of eyes on the rollout plan before you ship? Book a senior Cadence engineer for a week. Two-day free trial, weekly billing, replace any week. The first commit usually lands within 27 hours.

FAQ

How long does a safe rollout take from 1% to 100%?

Minimum 7 days if you respect the soak times in the ramp table. Most teams take 10 to 14 days for changes that touch revenue paths. Faster than 7 days is only safe for read-only, idempotent changes.

LaunchDarkly, Statsig, Unleash, or build my own?

For most teams under 50 engineers, buy. LaunchDarkly is the safe default; Statsig is stronger if you also want experimentation; Unleash is the best open-source option if data residency matters. Building your own makes sense above $50k/month in flag-service spend.

Do I need a kill switch for every flag?

Yes for any flag that touches a write path, a paid feature, or user trust. Read-only UI flags can share a single global kill switch. The rule we use: if killing the feature requires a code deploy, you do not have a kill switch, you have wishful thinking.

Can AI-generated code be safely behind a flag?

Yes, and it should be. The same playbook applies, with one addition: tag the flag with the model and prompt used to generate the code, so you can correlate bugs back to a generation pattern. Teams that do this catch systematic AI-code issues (specific failure modes that show up across multiple flags) months earlier.

What's the right cleanup cadence for old flags?

Default 90 days from flag creation, with a hard review at that date. After review, either remove the flag, promote it to permanent config (with a different naming convention), or renew for 30 days with a written reason. Auto-block new flag creation by owners with more than 3 overdue cleanups.

How do I bucket users for sticky rollouts without a stable user ID?

You can't, reliably. The closest substitute is a long-lived first-party cookie set on first visit, hashed into your cohort assignment. Accept that you'll get drift (users clearing cookies, multi-device) and avoid sticky bucketing for features where that drift would be visible.

Madhuban Mukherjee

Graphic & Web Design Expert

Web design lead at withRemote. Writes on landing-page conversion craft, design systems, and the engineering-design handoff.

All posts