
Adding voice AI to a customer-facing product in 2026 typically costs $5,000 to $90,000 in engineering time, plus $0.08 to $0.31 per minute of runtime. The build range depends entirely on scope: simple TTS narration ships in 1-2 weeks, a support voice agent on Vapi or Retell takes 4-8 weeks, and a full telephony stack on Twilio runs 8-16 weeks. Per-minute infra is a separate line item from build, and at high call volumes it can overtake build cost inside year one.
Most "voice AI cost" guides skip the engineering side entirely and stop at per-minute vendor math. That's fine if you already have an engineer. If you're scoping a budget and a timeline, you need both numbers. This post gives you the build estimate in engineer-weeks, the runtime estimate at three call volumes, and the hidden costs that make founders blow past their first quote.
"Voice AI" is a category, not a feature. Before you can price it, you need to pick which of these four scopes you're shipping. The cost shape is wildly different across them.
1. TTS narration. A button that reads your content aloud, an audio version of an article, an AI-generated voiceover for a product walkthrough. No microphone input, no real-time round trip. You're sending text to ElevenLabs, Cartesia, or Deepgram Aura-2 and streaming back audio.
2. Voice agent (asynchronous). User leaves a message or sends a voice note; your app transcribes it, reasons over it, and responds with text or audio. Used in support inboxes and async workflows. No latency pressure.
3. Real-time conversation. User talks, the agent talks back, with sub-second turn-taking. Powers in-app voice copilots, AI receptionists, voice tutors. This is where Vapi, Retell, and ElevenLabs Agents live. Latency is the whole game.
4. Full telephony. Real-time conversation, but over the phone network. Twilio or Telnyx routes the call; your stack handles speech, language, and voice. Outbound dialers, inbound receptionists, AI sales agents. Adds carrier complexity, recording laws, DTMF handling, and call-quality engineering.
Each scope draws on the same four primitives: STT (speech-to-text), an LLM, TTS, and (for telephony) a SIP carrier. But the engineering and cost weights are not the same.
Here's the build estimate, mapped to actual weekly engineer rates. We use Cadence's locked tiers (junior $500/wk, mid $1,000/wk, senior $1,500/wk, lead $2,000/wk) because they're the cleanest reference point for budgeting weekly engineering work in 2026.
| Scope | Right tier | Engineer-weeks | Build cost |
|---|---|---|---|
| TTS narration | Mid | 1-2 weeks | $1,000 - $2,000 |
| Voice agent (async) | Mid + senior pairing | 3-6 weeks | $4,500 - $13,500 |
| Real-time conversation (managed) | Senior | 4-8 weeks | $6,000 - $12,000 |
| Real-time conversation (best-of-breed) | Senior + occasional lead | 6-12 weeks | $9,000 - $24,000 |
| Full telephony (Twilio/Telnyx + custom) | Lead-anchored | 8-16 weeks | $16,000 - $32,000 |
| Self-hosted (Whisper + Llama + XTTS) | Lead + senior | 12-20 weeks | $40,000 - $80,000+ |
A few notes on why the tier matters. TTS narration is a fetch() against an HTTP API; a mid-level engineer ships it in a sprint. Real-time conversation, in contrast, involves WebRTC streams, partial transcripts, barge-in handling, and tool calls fired mid-utterance. That's senior work. Full telephony adds SIP, recording compliance, and call-quality tuning, which is why we tag it lead.
If you're sizing a team rather than a single engineer: the typical real-time voice agent project ships with one senior anchor plus a mid-level engineer for integrations and admin UI. That pairing is what gets you the 6-12 week ship window without the senior burning out on busywork.
For context, comparable scoping shows up in adjacent build-cost categories. The same shape (mid-level for the boring half, senior for the hard half) drives our estimate to add RAG to a SaaS app, which lands in a similar $8K-$100K range.
Build is one line item. Runtime is the other. Here's the all-in cost per minute, broken out by component, for the four vendors most teams shortlist in 2026.
| Component | ElevenLabs | Cartesia | Deepgram | OpenAI Realtime |
|---|---|---|---|---|
| STT | (uses Whisper $0.006/min) | (uses Deepgram or Whisper) | Nova-3: $0.0043/min | Built-in to Realtime |
| LLM | BYO (GPT-4o, Claude, etc.) | BYO | BYO | Built-in (gpt-4o-realtime) |
| TTS | $0.08-$0.12/min (Agents) | $0.038/1K chars (~$0.05/min) | Aura-2: $0.030/1K chars (~$0.04/min) | Built-in |
| Total all-in | $0.10-$0.18/min | $0.07-$0.13/min | $0.06-$0.12/min | $0.06 input + $0.24 output (audio) |
A few takeaways from these numbers:
Scaled out, the gap matters. Here's what the same workload costs across volume tiers, using middle-of-the-band assumptions for each vendor.
| Vendor stack | 1k min/mo | 100k min/mo | 1M min/mo |
|---|---|---|---|
| Vapi (managed, all-in $0.20/min avg) | $200 | $20,000 | $200,000 |
| Retell ($0.07/min flat + LLM) | $100 | $10,000 | $100,000 |
| ElevenLabs Agents ($0.10/min) | $100 | $10,000 | $100,000 |
| Cartesia + Deepgram + Claude (~$0.10/min) | $100 | $10,000 | $100,000 |
| OpenAI Realtime (~$0.18/min blended) | $180 | $18,000 | $180,000 |
| Self-hosted (~$0.05/min once amortized) | $50 + GPU floor | $5,000 | $50,000 |
The crossover is real. Below 50,000 minutes per month, managed wins because engineering cost dominates. Above 200,000 minutes per month, best-of-breed pays for itself in three months. Self-hosting only makes sense if you also have a dedicated infra engineer who can babysit GPUs.
If you're still picking the platform layer, our guide to the best AI voice services walks through vendor selection in depth. This post is the engineering companion to that one.
The per-minute and engineer-week numbers above assume the happy path. Here's what actually inflates the budget.
Voice AI feels broken above 1.2 seconds of round-trip latency. Below 800ms it feels alive. Getting from 1.5s to 800ms takes 1-2 senior-engineer weeks of work: streaming partial transcripts, prefetching the LLM call, switching to a faster TTS voice, optimizing the WebSocket layer. None of this is in the vendor docs. Budget the time.
When a user starts talking while the agent is mid-sentence, the agent should stop, listen, and respond to the new utterance. Implementing this correctly (without false triggers from background noise, without cutting off the user) is 1-2 weeks of senior work. Vapi and Retell give you the primitives, but the tuning is yours.
Real voice agents need to remember what was said three turns ago, fire tool calls mid-conversation (book the appointment, look up the order), and recover gracefully when a tool call fails. State design adds 1-3 weeks to the build, depending on how many tools you're wiring up.
Voice is hard to test. You can't write a unit test for "did the agent sound natural." Teams that ship serious voice products build a voice eval harness: a corpus of recorded user inputs, automated transcription comparison, latency metrics, and human-graded subjective scores. That's 1-3 senior weeks of build, plus ongoing maintenance. Skip this and you'll regress every time you swap voices or update the LLM.
Standard APM tools don't cover voice. You need to log every transcript, every LLM response, every TTS chunk, and every latency segment. Helicone, LangSmith, or a custom Postgres + S3 setup runs $50-$500/mo and takes a mid-level engineer about a week to wire up. Without it, debugging a bad call is impossible.
Add it all up and the hidden engineering tax is 30-50% on top of the base build estimate. The voice agent quoted at 6 weeks is realistically 9 weeks once you account for latency, interruption, evals, and observability.
Build cost and runtime cost are two different shapes. Here's what year one actually looks like for three real product types.
Onboarding voice tour, 1,000 minutes/month. A button on your homepage that reads users through your product. TTS narration only.
Customer support voice agent, 100,000 minutes/month. Real-time conversation, handles tier-1 tickets, escalates to humans for complex cases. Best-of-breed stack.
Outbound sales dialer, 1,000,000 minutes/month. Full telephony stack on Twilio, custom routing, recording, compliance.
The pattern is clear. At low volume, build dominates. At high volume, runtime dominates by 10-50x. Scope your decision accordingly: if you're shipping a 1k-min/mo onboarding tour, do not over-engineer with a custom Twilio stack. If you're shipping a million-minute dialer, do not pay Vapi's $0.20/min managed rate. (For verticalized voice products in regulated spaces, the budget shape shifts again; see our healthcare app cost breakdown for HIPAA-adjacent context.)
Three rules, in order:
1. Pick the smallest scope that proves the value. If voice narration on a single page proves users want voice, ship that first. Don't scope the full real-time agent before you know users want it.
2. Use a managed platform for the first 30 days. ElevenLabs Agents, Vapi, or Retell ship in days. Get user feedback. Then decide if you need to move to best-of-breed.
3. Hire engineers who have shipped voice before. Voice has unusual failure modes (latency, interruption, audio quality) that don't show up in standard web work. Every engineer on Cadence is AI-native by default, vetted on Cursor / Claude Code / Copilot fluency before they unlock bookings, and the senior tier specifically covers real-time and audio scope. You can book a senior engineer for a 48-hour free trial and ship a voice agent prototype inside week one.
For context, the same "right-tier-for-the-job" pattern applies to other infra-heavy adds. We see it in Stripe payment integrations, where mid tier is enough for the basics but webhook reliability needs senior eyes. Voice is the same shape, just with audio instead of money.
Want a budget for your specific scope? Book a senior engineer on Cadence for the 48-hour free trial. Two days is enough to scope the work, pick the vendor stack, and ship a working voice prototype before you commit to a 12-week roadmap. Weekly billing, replace any week.
A simple TTS narration feature ships in 1-2 weeks with a mid-level engineer. A real-time voice agent on a managed platform like Vapi or Retell takes 4-8 weeks with a senior engineer. A full telephony stack on Twilio runs 8-16 weeks and needs lead-tier ownership for the SIP, recording, and compliance work.
Use ElevenLabs Agents or Cartesia + Deepgram with the lowest model tier and a single voice. Expect $0.08 to $0.13 per minute of runtime and a 1-2 week integration with one mid-level engineer. For pure TTS narration, the cost can drop below $50/month at typical content-site volumes.
For TTS narration, no. A mid-level engineer is enough. For real-time conversation with interruption handling, tool calls, and sub-second latency targets, yes. The senior tier exists for exactly this kind of scope where edge cases dominate the work and a junior or mid will spend three weeks debugging a problem a senior solves in one.
Latency tuning (1-2 weeks), interruption handling (1-2 weeks), eval infrastructure (1-3 weeks), voice observability (1 week + $50-$500/mo), and prompt iteration. Add 30-50% on top of the base build estimate. A voice agent quoted at 6 weeks is realistically 9 weeks once you account for the production-readiness work.
Use a managed platform under 50,000 minutes per month, where the per-minute markup is small relative to your engineering time. Move to best-of-breed (Deepgram + Claude + Cartesia + Twilio) above 200,000 minutes, where the savings pay for the senior engineering hours in a few months. Self-host only if you already have a dedicated infrastructure engineer.