
To roll out a feature flag safely, ramp on a fixed curve (1% → 5% → 25% → 50% → 100%), gate every step behind an error-rate check per cohort, and ship a kill switch that flips faster than your CI can deploy. Add a cleanup date to the flag at creation. Skip any of these four and you will eventually wake up at 2am.
The hard part is not turning a flag on. The hard part is turning it off cleanly, knowing which cohort broke, and not having 400 dead flags rotting in your codebase a year later. This playbook covers the ramp curve, the kill switch, sticky vs non-sticky bucketing, server vs client evaluation, the observability hooks that make a rollout debuggable, and the gate-decay process that keeps flag count under control.
Five years ago, flags were a deploy-decoupling trick. In 2026, they are how you ship anything serious. The shift came from three things: trunk-based development became the norm, AI agents started writing 40 to 60% of production code on most teams, and observability tools got cheap enough to attribute errors to flag cohorts in real time.
When you can plot error rate per cohort live, a 5% ramp is no longer a guess; it is a controlled experiment. The flag service is now part of your safety stack, alongside CI and monitoring.
The default rollout: a developer opens LaunchDarkly or Statsig, creates a boolean flag called new_checkout, points it at 100% of users on Friday afternoon, and waits to see what Sentry says.
This works when the change is small and the surface area is one endpoint. It breaks the moment the change touches a write path, interacts with a cached value, depends on another flag, or the cohort split is not random. The fix is a rollout playbook the team runs the same way every time, with dial movements and rollback path written down before the first user sees the code.
Pick a curve. Stick to it. Most teams converge on something close to this:
| Step | Cohort size | Min soak time | What you watch | Auto-rollback trigger |
|---|---|---|---|---|
| 1 | 1% | 1 hour | Error rate, latency p95, custom event volume | Error rate > 2x baseline |
| 2 | 5% | 4 hours | Same + conversion / completion rate | Error rate > 1.5x or conversion drop > 10% |
| 3 | 25% | 24 hours | Same + downstream service load | Any SLO breach |
| 4 | 50% | 24 hours | Same + cost / queue depth | Any SLO breach |
| 5 | 100% | Hold 7 days before cleanup | Full dashboard | Manual only |
The 1% step catches code that throws on import. The 5% step catches code that breaks under realistic concurrency. The 25% step is the first where statistical-significance tests on conversion metrics start working. The 50% step catches capacity issues (database pools, queue depth, third-party rate limits) that only surface near full load.
Soak times are minimums; if your traffic is uneven, let each step run through a full daily cycle. Never skip a step because the previous one looked clean. Skipping is how you learn that your 5% cohort happened to exclude all your enterprise customers.
A kill switch is a flag that turns the feature off in under 30 seconds, from any device, without a deploy. It is the single most important piece of feature-flag infrastructure, and most teams build it wrong.
Three rules:
kill_<feature> and document that flipping it is always safe.The clearest test of your kill switch is the 2am test: can your on-call engineer, on their phone, on hotel wifi, kill a feature in under a minute? If not, fix that before you ship anything else.
Once you have more than 20 flags, you have a graph. Flag A enables a new checkout. Flag B enables a new payment provider used by the new checkout. Flag C kills the old checkout. If you ramp A without B, you ship broken code. If you ramp C without A reaching 100%, you ship nothing.
Tag every flag with its dependencies at creation time. Most flag services support metadata fields; use them. A minimal schema:
flag: new_checkout
requires: [new_payment_provider >= 100%]
blocks: [old_checkout cleanup]
owner: payments-team
cleanup_by: 2026-08-15
Then write a CI check that fails if any flag's ramp percentage exceeds the minimum of its dependencies. This catches the most common production incident in flag-heavy teams: someone bumps the dependent flag without realizing the parent isn't ready. Treating flag rollouts like a small version of managing technical debt in a startup keeps the graph from rotting.
Sticky bucketing means once a user is in the treatment group, they stay there. Non-sticky means each evaluation is independent and a user might see the new feature on one page load and the old one on the next.
| Use sticky when | Use non-sticky when |
|---|---|
| The feature touches UI a user will navigate back to | The feature is a backend optimization invisible to users |
| You're running an A/B test on conversion | You're rolling out a bug fix |
| The feature involves writes that change state | The feature is read-only and stateless |
| Users could screenshot or share what they see | Cohort assignment doesn't affect user trust |
For 80% of feature rollouts, sticky is the right default. Non-sticky bucketing produces strange user experiences: a checkout button that moves between page loads, a pricing page that flickers between two tiers, a dashboard whose layout shuffles. Users notice and they file support tickets you cannot reproduce.
Implementation-wise, sticky bucketing means hashing a stable user ID (not session ID, not IP) into your cohort assignment. Most flag SDKs do this for you if you pass a userKey. The bug is almost always that the team forgets to pass it, and the SDK silently falls back to a per-request random number.
Server-side evaluation runs the flag check inside your API. Pros: the value never leaks to the client, you can change it instantly without a client refresh, and you can use server-only context (user role, subscription tier) in the rule. Cons: every request hits the flag service, which means latency and a new dependency.
Client-side evaluation runs in the browser or mobile app. Pros: zero added latency, works offline, no extra cost. Cons: flag rules are visible in the JS bundle (security), the user must refresh to see a change, and you cannot keep secret features secret.
Most production systems use both. The flag service ships a bootstrap payload to the client on page load, and the server has its own evaluator for sensitive decisions. The tools that handle this well include LaunchDarkly, Statsig, Unleash, and Flagsmith; we cover the trade-offs in LaunchDarkly vs Statsig and the broader category in best feature flag services.
Rule of thumb: any flag that gates a write path or a paid feature must be evaluated server-side. Layout or copy flags can be client-side. Mixed flags need server-side authoritative evaluation, with the client reading the same flag for layout but never trusting it for authorization.
A rollout without observability is a deploy with extra steps. The instrumentation:
Every flag should have an expiration date set at creation. The default we recommend is 90 days. If the flag is still in code on that date, it gets reviewed; if it cannot be removed, the date renews for another 30 days with a written reason.
Stale flags are a security and reliability liability. Each one is a dead branch of code that nobody tests and that any engineer (or AI agent writing code in your repo) might trigger by accident. A 2024 audit of one mid-size fintech found 312 of their 400 flags had not been touched in over a year; half had no owner.
A workable gate-decay process:
cleanup_by date.Most teams skip the last bullet because it feels harsh. It is the only one that actually works.
new_checkout as true, service B as false, request fails halfway through. Pass the evaluated value down the chain explicitly.If you are two founders pre-revenue shipping to 20 beta users, you do not need a five-step ramp. You need a kill switch and a Slack channel. The full playbook earns its keep at roughly the point where you have more than 5,000 daily active users, more than two engineers, and at least one paying customer who would notice an outage.
Below that bar, a single boolean environment variable and a deploy is faster than a flag service and meets your safety needs. Above it, the playbook saves you a 2am incident roughly every quarter.
Three concrete next steps. First, pick a flag service; see the trade-offs in LaunchDarkly vs Statsig. Second, write your ramp curve and kill-switch spec into a runbook. Third, instrument cohort tagging in your observability tool before your next rollout.
If wiring this up has been on the backlog for a quarter, a Cadence senior engineer ($1,500/week) typically gets the first three flags ramped, the kill switch tested, and the cohort dashboard live inside one week. Every Cadence engineer is AI-native, vetted on Cursor and Claude Code fluency before they unlock bookings, which matters here because the work is mostly SDK calls, observability tags, and CI checks (pattern-heavy work AI tools shorten by 3x). You can audit your stack with our Ship-or-Skip tool first to see whether the rollout pipeline is actually the bottleneck.
Want a second pair of eyes on the rollout plan before you ship? Book a senior Cadence engineer for a week. Two-day free trial, weekly billing, replace any week. The first commit usually lands within 27 hours.
Minimum 7 days if you respect the soak times in the ramp table. Most teams take 10 to 14 days for changes that touch revenue paths. Faster than 7 days is only safe for read-only, idempotent changes.
For most teams under 50 engineers, buy. LaunchDarkly is the safe default; Statsig is stronger if you also want experimentation; Unleash is the best open-source option if data residency matters. Building your own makes sense above $50k/month in flag-service spend.
Yes for any flag that touches a write path, a paid feature, or user trust. Read-only UI flags can share a single global kill switch. The rule we use: if killing the feature requires a code deploy, you do not have a kill switch, you have wishful thinking.
Yes, and it should be. The same playbook applies, with one addition: tag the flag with the model and prompt used to generate the code, so you can correlate bugs back to a generation pattern. Teams that do this catch systematic AI-code issues (specific failure modes that show up across multiple flags) months earlier.
Default 90 days from flag creation, with a hard review at that date. After review, either remove the flag, promote it to permanent config (with a different naming convention), or renew for 30 days with a written reason. Auto-block new flag creation by owners with more than 3 overdue cleanups.
You can't, reliably. The closest substitute is a long-lived first-party cookie set on first visit, hashed into your cohort assignment. Accept that you'll get drift (users clearing cookies, multi-device) and avoid sticky bucketing for features where that drift would be visible.