
To estimate software development time accurately, get a 3-point estimate (best, likely, worst) from the engineer who will do the work, run it through PERT = (best + 4 × likely + worst) / 6, then multiply by 1.6 to convert reported hours into productive calendar time. In 2026, also split the work into routine code (AI compresses 2-4x) and novel architecture (AI compresses 1.1-1.3x) before you add the numbers up.
The rest is the structured version: why estimates fail, the formulas that fix most of it, and the 2026 shifts that change the math.
Estimates miss for three reasons that have nothing to do with the engineer being lazy.
The first is the planning fallacy. Kahneman and Tversky documented in 1979 that humans underestimate task duration even when they have prior data showing how long similar tasks took. The brain treats the new task as the easy version of the old one. It almost never is.
The second is Hofstadter's law: it always takes longer than you expect, even when you account for Hofstadter's law. The recursion is the joke and also the truth. Once you start padding your estimates, you start padding the padding, and you still come up short.
The third is the surprise integration. The Stripe webhook signature is wrong in the docs. The OAuth refresh token expires in 14 days, not 30. The third-party API rate-limits at 5 requests per second, not 50. None of these show up in the spec; all of them show up on day three. The same dynamic shows up in API versioning decisions, where one wrong assumption about a third-party contract turns a 1-day estimate into a 5-day refactor.
Most estimates miss by 20-100%, not 5-10%. If your team is consistently inside 20%, you are either estimating very small things or you have built the discipline this post is about.
The biggest shift in the last two years is that engineering work no longer has one productivity curve. It has two.
Routine code (CRUD endpoints, form validation, dependency upgrades, test coverage, integrations with documented APIs, the "I have written this six times" work) compresses 2-4x when an engineer is fluent in Cursor, Claude Code, and Copilot. A 40-hour stretch of routine work shrinks to 10-15 hours.
Novel architecture (sharded data models, event-sourcing, real-time consensus, anything where the answer is not in the model's training data) compresses closer to 1.1-1.3x. The AI helps with typing, not with thinking. A 40-hour novel-architecture stretch shrinks to about 32 hours, and most of the savings come from the AI catching syntax mistakes you'd have hit in the linter anyway.
The implication for estimation is loud: estimate the two buckets separately, then add. If you average them, you will be 50% off in 2026, and you won't be able to tell which half was wrong.
For every task, ask the engineer for three numbers, not one.
Then compute the PERT estimate:
PERT = (best + 4 × likely + worst) / 6
The formula weights the likely case 4x but pulls the average toward the worst when the worst is far away. That asymmetry is the entire point.
Worked example. An engineer says "1 best, 3 likely, 10 worst" for a webhook integration. Single-point says 3 days. PERT says (1 + 12 + 10) / 6 = 3.83 days. Stack 30 of those across a quarter and the PERT version is two weeks more honest.
Use PERT for anything between 1 and 10 days. Below 1 day, the overhead exceeds the variance. Above 10 days, you should be decomposing instead.
A salaried engineer reports 40 hours per week and ships closer to 25 productive coding hours.
Where the 15 hours go: standups, code review, Slack, deploy debugging, demo prep, on-call rotations, meeting tax, and the cognitive cost of context switching. None of it is wasted; all of it is real work; none of it is the work you estimated.
The working multiplier is 1.6x. Take any raw coding estimate and multiply it by 1.6 to get calendar time. A "10-hour task" is a day and a half on the calendar, not 10 hours wedged into a Tuesday.
If you skip this step, every 2-week sprint becomes 3 weeks and nobody can explain why. The retro will blame Slack. The retro will be wrong; the estimation will be wrong. Slack is a constant.
Reference-class forecasting is Kahneman's most useful idea for engineering managers. Instead of estimating from first principles, find the closest project you've actually shipped and use its real duration.
The discipline:
Three reference projects beat one expert opinion every time. The reason is boring: the expert remembers the version where it went well. The repo remembers the version where it didn't.
If you're maintaining production systems where reliability matters, you'll also want to read our guide to handling database migrations safely in production, since migration windows are one of the most consistently underestimated parts of any data-layer feature.
Any single estimate over 2 days is hiding three smaller estimates that haven't been thought about.
"Build the billing flow, 2 weeks." Decomposed, that becomes:
Sum is 10.5 days, not 10. Through PERT and the 1.6x shrinkage factor, the calendar number is closer to 17 days. Three weeks, not two. The same decomposition discipline applies to ops work; if you're scoping observability, a microservices monitoring stack decomposes into 8-12 sub-tasks the moment you stop pretending OpenTelemetry "just works."
Decomposition feels like overhead. It is the cheap, boring step that catches the surprise integration before you commit to a date. Skip it once and you'll skip it forever; do it twice and you'll be the only team in the building who hits dates.
A 2026-native estimation step that no top-ranking guide currently covers: paste your spec into Claude or Cursor and ask for an estimate.
The discipline:
Where they diverge by more than 50%, the spec is incomplete. That's the signal to clarify, not to argue. The most useful output of this step is rarely the number; it's the "list the unknowns" reply, which surfaces the questions you should have asked the founder before you typed the spec.
AI estimates correlate roughly 0.7 with senior estimates on routine work and much weaker on novel architecture. They are a sanity check, not a replacement.
This is also the moment where doing code reviews effectively starts to matter, because the AI-generated portion of the work needs the same scrutiny as the human-written portion, and the time you save on writing you'll spend on reviewing.
After 4-6 sprints, every team has a personal multiplier. Maybe you ship at 1.4x your estimates. Maybe 2x. Whatever it is, it's stable, and it's the most useful number you own.
Track shipped vs estimated for each completed task. Not story points; calendar days. After six sprints, compute the ratio. Most teams' real multiplier sits between 1.5x and 2.2x.
Then apply it to new estimates as the last step, even when it stings. The CFO does not care that your raw estimate was 4 weeks; they care when the feature ships. The 1.8x multiplier converts the first number into the second.
Velocity averaging is the only estimation technique that gets more accurate over time without anyone learning anything new. The team gets better at the work; the multiplier captures the rest.
If you want a quick outside read on whether your current stack and rituals are slowing your estimates down, Cadence's Ship or Skip audit grades the parts of your engineering setup that compound into estimate misses (CI flakiness, review backlog, deploy friction, monitoring gaps).
| Estimation method | Best for | Accuracy band | AI-native adjustment in 2026 |
|---|---|---|---|
| Single-point gut | Tasks under 1 day | +/- 50-100% | None; not worth running through AI |
| 3-point PERT | 1-10 day tasks | +/- 20-30% | Halve the likely value for routine code |
| Reference-class | Whole features | +/- 15-25% | Adjust comparable by current AI tooling |
| Spec-as-prompt | Sanity check on PERT | +/- 25% routine; weak on novel | Native to AI workflow |
| Team-velocity multiplier | Long-run planning | +/- 10-15% after 6 sprints | Multiplier shrinks ~30% within 2 sprints of adopting Cursor |
These are the patterns that look like estimates and are actually something else:
If two or more of these are happening on your team, the answer is not better engineers. It's a different estimation ritual.
Estimation has an ROI curve, and not every task is on the right side of it.
Pure exploration (research spikes, novel ML, performance investigation) should get a time-box, not an estimate. "Spend 3 days on this and tell me what you found" is honest. "Estimate how long it'll take you to figure out a thing nobody has figured out" is not.
Tasks under 1 day aren't worth a 3-point estimate; the overhead exceeds the variance. Use a gut number and move on.
If the team is two founders pre-revenue and the feature is the next obvious thing, ship and re-plan after. Estimation is for coordination across people; if there are no other people, the rituals are theatre.
If you're a founder reading this and thinking "I need an engineer who already runs this discipline," that's the bet Cadence makes by default. Every engineer on the platform is AI-native, vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings, and senior tier ($1,500/week) has done enough sprints to walk in with a personal velocity multiplier instead of guessing yours.
The 48-hour free trial is the cheapest way to test an estimate's honesty. You hand over a 2-week spec; if the engineer can decompose it, PERT it, and surface the unknowns by hour 48, the estimation discipline is real. If they hand back a single-point gut number, you keep your money.
Across the 12,800-engineer pool, 67% of trials convert to active bookings, and the median time to first commit is 27 hours. Both numbers are downstream of the estimation discipline above.
Try it. If you want an honest grade on the parts of your engineering setup that distort estimates, run the free Ship or Skip audit. Five minutes, no signup, and you'll see exactly where your team's velocity multiplier is leaking.
Multiply the engineer's raw coding estimate by 1.6 to convert reported hours into productive calendar time, then run it through PERT for the variance. Most teams' real multiplier sits between 1.5x and 2.2x of the original estimate after 4-6 sprints of data, and once you have your number it stays roughly stable.
PERT = (best + 4 × likely + worst) / 6. The formula weights the likely case 4x and pulls the average toward the worst case when the worst is far from the likely. Use it for any task between 1 and 10 days; below 1 day the overhead exceeds the variance, above 10 days you should be decomposing into smaller pieces instead.
Yes for routine work (CRUD, integrations, refactors compress 2-4x with Cursor and Claude Code), no for novel architecture (compression is closer to 1.1-1.3x). Estimate the two buckets separately or you will be 50% off. The "spec-as-prompt" step (have the AI estimate from your written spec) is the single most useful new technique.
Three reasons: planning fallacy (humans underestimate even with prior data), Hofstadter's law (it takes longer than you expect even when you account for Hofstadter's law), and surprise integrations that don't appear in the spec. The fix is structured estimation (3-point PERT, decomposition, shrinkage factor, velocity multiplier), not better guessing.
Hours for tasks under 2 days, story points for anything larger and recurring. Hours give precision when variance is low; story points let the team build a velocity baseline that gets more accurate over time without anyone trying harder. Most teams over-rotate on story points and then can't translate them back into calendar dates the CEO cares about.