
Adding transcription to a video or audio app costs $500 to $30,000 in engineer time, plus $0.0025 to $0.0077 per minute of audio in API fees. The exact number depends on whether you ship batch upload-and-transcribe (1 engineer-week), speaker diarization with timestamps (2-3 weeks), or live WebSocket streaming (4-6 weeks). Vendor choice (AssemblyAI, Deepgram Nova-3, OpenAI Whisper API, or self-hosted Whisper on a GPU) shifts your monthly bill by 5-10x at high volume but barely matters under 100k minutes per month.
Most pricing posts on this topic stop at the per-minute API rate. That's a small slice of your real bill. Here's the full stack you'll write checks for:
At <100k minutes/month, the engineer-weeks dominate. At >1M minutes/month, the per-minute rate dominates and self-hosting starts to pencil. We'll cost-model both ends below.
Here are the real numbers as of mid-2026, pulled from each vendor's public pricing page and verified against scaled-customer pricing in their docs.
| Provider | Batch (per min) | Streaming (per min) | Diarization | Notes |
|---|---|---|---|---|
| AssemblyAI Universal-3 | $0.0035 - $0.0061 | $0.0075 | Included | Bundled intelligence (PII, summaries, sentiment) |
| Deepgram Nova-3 | $0.0043 | $0.0077 | +$0.002/min | Streaming-first, lowest p95 latency |
| OpenAI Whisper API | $0.006 | not supported | Not native | 25MB upload cap, batch only |
| Self-host Whisper large-v3 | ~$0.0008 | ~$0.0015 | OSS pyannote | Amortized on g5.xlarge ($1.006/hr) at 80% GPU load |
| Google Cloud Speech | $0.016 | $0.024 | +$0.006/min | Higher accuracy on some languages, expensive |
| AWS Transcribe | $0.024 | $0.024 | Included | Easiest if you're already on AWS, expensive otherwise |
The per-minute spread looks dramatic until you do the volume math. At 1,000 minutes/month (a small podcast app), the entire vendor decision is worth $4 to $24 per month. At 1 million minutes/month (a meeting recorder with 5,000 daily-active users), the same decision is worth $1,500 to $24,000 per month, and self-hosting becomes the cheapest option by a wide margin.
A word on Whisper self-hosting: you can run whisper-large-v3 or faster-whisper on a single A10G (g5.xlarge on AWS at ~$725/month on-demand, or ~$1,500/year reserved). At 80% GPU load that's ~$0.0008 per audio-minute amortized. The catch: you now own a GPU service. That's an extra Dockerfile, a queue, a model warm-up routine, autoscaling, and a 2 AM pager. Most apps under 500k minutes/month should not touch this.
The integration work is the part vendor blogs underprice. Here's what each transcription feature actually takes when an experienced engineer ships it.
Basic batch upload-and-transcribe (1 engineer-week). User uploads an MP3 or MP4. You pull the audio out (ffmpeg if it's video), POST to the vendor's batch API, poll or webhook for the result, store the transcript JSON in your database, render it as plain text in the UI. One week including auth, error handling, and a queue (BullMQ or SQS) so a 90-minute podcast doesn't time out a request thread.
Speaker diarization + word-level timestamps (2-3 weeks). Now the transcript needs "Speaker A said this at 00:14, Speaker B said that at 00:17." This adds: a labeled-speaker UI, audio playback synced to the transcript line, a way for users to rename "Speaker A" to "Jane," and edge cases like overlapping speech. Most teams underestimate the UI work here by 2x.
Live transcription with WebSocket streaming (4-6 weeks). A whole different category. You need a long-lived WebSocket from browser to vendor (or via your server as a relay), partial-result rendering with stable text + final commits, mic permissions UX, network reconnect logic, and infrastructure that keeps a process pinned per session. If you proxy the audio through your own server (which you usually want for auth and rate-limiting), you're now running a stateful WebSocket service in production. Add 1-2 weeks if you've never run one.
Multilingual + on-the-fly translation (+2 weeks). Both AssemblyAI and Deepgram detect language automatically. Translation is a separate call (DeepL, GPT-4o-mini, or Google Translate). The engineering is mostly about UI: language picker, RTL support for Arabic/Hebrew, and choosing whether to display translation inline or as a toggle.
PII redaction + retention policies (+1 week). AssemblyAI redacts PII in-line for ~$0.08/hour. Deepgram has a redact parameter at +$0.002/min. Even with vendor-side redaction, you still need: a retention policy (delete originals after N days), audit logs, and a privacy-policy update.
For context, the median time-to-first-commit for a Cadence engineer (we run an on-demand engineering marketplace) is 27 hours after the booking is confirmed. That's not the time to ship the feature, that's just the first PR. The 1-week scope above assumes a focused engineer with no other obligations.
Here's what the all-in costs look like across hiring options. Numbers assume the "speaker diarization + timestamps" scope (2-3 weeks of work) plus 6 months of API at 50,000 minutes/month.
| Approach | Cost | Timeline | Pros | Cons |
|---|---|---|---|---|
| US full-time engineer | $14,000+/week loaded | 4-8 weeks to ship | Owns the integration long-term, deep context for v2 | Slow to recruit (4-12 weeks of hiring), monthly burn keeps running after ship |
| Dev agency (US/EU) | $15,000-$30,000 flat | 6-10 weeks | Predictable scope, project manager included | Slow start (SOW, legal), agency overhead, you don't own the institutional knowledge |
| Freelancer (Upwork) | $30-$120/hr, $2k-$10k total | 2-6 weeks | Cheap if you find a strong one | High variance on quality, no replacement if they disappear |
| Toptal | $100-$200/hr, $8k-$25k total | 1-2 weeks to start | Vetted bench, faster than agency | Hourly billing, monthly minimums on some contracts |
| Cadence | $500-$2,000/week | 48-hour trial, ship in week 1 | Every engineer is AI-native (Cursor, Claude Code, Copilot vetted via voice interview), weekly billing, replace any week, no notice period | Less suited to enterprise procurement workflows that require multi-week MSAs |
A few honest notes. Toptal wins if you need a specific seasoned voice-engineer profile and you're willing to pay 2-4x to get it. Agencies win if you need someone else to manage scope. Full-time hiring wins if transcription is a permanent surface-area pillar of your product. We also work with founders who use Cadence for the integration phase (1-6 weeks) and then promote that engineer to full-time after the trial. Same approach as a 48-hour try-before-you-buy.
The line items vendor blogs skip:
tsvector is free and fine to ~1M docs. Beyond that: OpenSearch ($150-$500/mo on AWS) or Meilisearch ($30-$300/mo on Cloud). Semantic search via pgvector + text-embedding-3-small adds ~$0.00002 per 1k tokens, roughly $0.10 per 1,000 transcript-minutes embedded.presidio-analyzer, but you own the false-positive tuning.This work doesn't show up in pricing pages. It shows up in your sprint planning two months after you ship V1. We've seen the same shape in our internal data on integrations like the cost to add RAG to a SaaS app and the cost to integrate Stripe payments into your app, where the API line item is the smallest part of the real total.
Concrete numbers. Assumes batch transcription with diarization, no live streaming.
| Volume | AssemblyAI | Deepgram + diar | Whisper API | Self-host Whisper |
|---|---|---|---|---|
| 1,000 min/mo | $4 | $6 | $6 | $725 (you're paying for a GPU you don't need) |
| 100,000 min/mo | $350 | $620 | $600 | $725 (break-even point) |
| 1,000,000 min/mo | $3,500 | $6,200 | $6,000 | ~$1,500 (multiple GPUs or reserved capacity) |
A few takeaways:
For founders who want a more general framework, the same shape (API rounding error early, infrastructure dominates late) shows up in the cost to migrate from Heroku to AWS trade-off.
Five things that work, ranked by ROI:
If you're a non-technical founder and you don't already have an engineer, the fastest path here is to book a mid-tier engineer for 2 weeks ($2,000) to ship the batch flow, then evaluate whether you need diarization or live streaming based on user feedback. You can book a mid-tier engineer in 2 minutes on Cadence if you want a vetted, AI-native engineer paired in 48 hours.
Three steps:
The same scoping math applies to other "I need a feature shipped fast" categories. The cost to migrate from Firebase to Supabase and the cost to build a custom WordPress plugin follow the same engineer-week-times-rate model that this post does.
Batch upload-and-transcribe ships in 1 engineer-week if the engineer is focused. Adding speaker diarization with word-level timestamps takes 2-3 weeks. Live WebSocket streaming is 4-6 weeks because you're now running a stateful service. Multilingual support and translation add about 2 weeks on top of any of these.
AssemblyAI if you need bundled audio intelligence (diarization, PII redaction, summaries) without stitching three APIs together. Deepgram if you need live streaming with sub-300ms latency. Self-host Whisper only above ~500k minutes/month, when your API bill crosses what an A10G GPU costs ($725-$1,500/month) and you have an engineer who's comfortable owning a GPU service.
Use batch unless your product literally requires sub-second captions (live meetings, voice agents, accessibility tooling for live events). Batch is 50-70% cheaper per minute, 4x faster to build, and doesn't require an always-on stateful service in your infrastructure. Most "transcription" features (podcast tools, meeting recorders, voice memos) are batch use cases that founders accidentally over-engineer into streaming.
Transcript storage (cheap), search indexing (modest), PII redaction (low cost but high compliance overhead), and the always-on streaming service if you ship live (one more thing to page for at 2 AM). Together these typically add 20-40% to the per-minute API cost at scale. The much bigger hidden cost is the engineer-weeks to integrate, which is usually 5-50x the first year of API fees.
Yes. AssemblyAI and Deepgram both support 30+ languages with auto-detection at modest premiums (10-30% over English-only rates). Translation is a separate API call, typically DeepL ($0.005-$0.025 per 1k characters), GPT-4o-mini (~$0.0015 per 1k input tokens), or Google Translate ($20 per 1M characters). Engineering work is mostly UI: language picker, RTL support, and choosing inline-vs-toggle display.