OpenRouter vs Together AI vs Replicate

OpenRouter is the right pick if you want one API key and live access to 200+ closed and open models with smart routing. Together AI wins when you're running open-source LLMs at scale and care about latency, fine-tuning, and dedicated endpoints. Replicate is the call when your workload is multimodal (image, video, audio, voice) and you want a serverless GPU you don't have to operate. All three are good. They're built for different jobs.

The one-line verdict for each

If you only read one paragraph, read this one.

OpenRouter is a router and aggregator. You pay a small markup over the upstream provider price and get instant access to GPT-5, Claude Sonnet 4.6, Gemini 2.5, Llama 4, DeepSeek V3, Qwen, Mistral, and dozens more behind a single OpenAI-compatible endpoint. It's the lazy-correct choice for app builders who want optionality without juggling six SDKs and six billing accounts.

Together AI is an inference host with a serious infrastructure team. They run open-source models on their own GPU fleet, expose them via an OpenAI-compatible API, and let you fine-tune, deploy dedicated endpoints, and even spin up your own clusters. If your usage is enough to warrant a real conversation about cost-per-token, Together is usually where you land.

Replicate is a model-as-API marketplace with a serverless GPU underneath. It excels at the long tail of community models, especially image, video, audio, and voice (Flux, SDXL, Whisper, MusicGen, video models). You can also push your own model with cog and get a billable API in an afternoon. If you're shipping anything that isn't text, Replicate is the first place to look.

What each one actually is (and isn't)

These three get lumped together because they all hand you a hosted-inference URL, but the underlying businesses are different.

OpenRouter is a router, not a host

OpenRouter doesn't run the GPUs. They sit between your app and the upstream providers (OpenAI, Anthropic, Google, Together, Fireworks, DeepInfra, Groq, Mistral, etc.) and forward requests. The value is unification: one API key, one OpenAI-compatible schema, one invoice, automatic fallbacks if a provider 429s. You can also pin specific upstream providers per request, which matters if your data residency policy bans certain regions.

The trade-off is that they take a small cut on each call (usually around 5% over published list price for paid models, free models stay free). You don't get fine-tuning. You don't get dedicated capacity. You get breadth.

Together AI is a host with a research team

Together runs its own clusters and ships an OpenAI-compatible API for open models: Llama 4, DeepSeek V3, Qwen, Mistral, FLUX (some sizes), and a long list of community-tuned variants. They published the original FlashAttention work, they sponsor RedPajama, they ship Together Code Interpreter. The infrastructure isn't a thin wrapper around someone else's GPUs; it's the product.

You can fine-tune (LoRA and full), deploy a dedicated endpoint with reserved GPUs (good for latency-sensitive workloads), and even rent multi-node GPU clusters by the hour if you're training. That depth is what Replicate and OpenRouter don't have.

Replicate is a marketplace plus serverless GPU

Replicate's superpower is the long tail. Anyone can package a model with their open-source cog tool, push it, and get a paid API. As of early 2026 the catalog has thousands of community models across image generation (Flux, SDXL, Ideogram), upscaling, segmentation, video (Wan, LTX, Mochi), audio (Whisper, MusicGen, RVC voice), and the usual LLMs. Pricing is per-second of GPU time, which is great for spiky workloads and brutal for long-context LLM streaming.

The platform also handles the boring parts: webhooks for async completions, prediction history, versioned models, a decent web UI for non-engineers to try things. If your team includes a designer who needs to test a new image model on Tuesday morning, Replicate's playground is the path of least resistance.

Head-to-head comparison table

	OpenRouter	Together AI	Replicate
Best for	App builders who want every model behind one key	High-volume open-source LLM inference, fine-tuning	Multimodal (image, video, audio) and long-tail community models
Models offered	200+ (GPT, Claude, Gemini, Llama, DeepSeek, Qwen, Mistral, etc.)	~100 open-source LLMs, image, code, embedding	Thousands; deep in image / video / audio
API style	OpenAI-compatible	OpenAI-compatible	Custom REST + async webhooks (OpenAI-compatible for chat)
Pricing model	Per-token, passthrough + ~5% markup	Per-token, set rates per model	Per-second of GPU time
Free tier	Free models available (Llama, DeepSeek free variants)	$1 credit on signup	$0.10/mo equivalent on starter, pay-as-you-go
Llama 4 Maverick (~$/M tokens, early 2026)	~$0.25 in / $0.80 out (passthrough)	~$0.20 in / $0.60 out	Not the primary use case
Closed-model access (GPT-5, Claude)	Yes, the headline feature	No	No
Fine-tuning	No	Yes (LoRA + full)	Limited (some model authors expose it)
Dedicated endpoints	No	Yes	Yes (deployments)
Multimodal depth	Routes to providers that offer it	Image + code; not deep on video / audio	Deep on image, video, audio, voice
Cold starts	Inherits from upstream	Low (warm pools, dedicated)	Variable; serverless cold starts can be 5–60s on niche models
Streaming	Yes (SSE)	Yes (SSE)	Yes (SSE for chat, async for everything else)
Self-host option	No	No, but offers GPU cluster rental	Yes (`cog` runs anywhere)
Where they win	Optionality, fallback, one bill	Cost-per-token at scale, training	Long-tail and non-text workloads
Where they lose	No infra control, no fine-tune	No closed models	Cold starts, pricier per-token for chat

Prices are list rates as of early 2026 and shift constantly. Check the live pricing page before you commit to a model.

Pricing in practice (the part the docs hide)

The docs all show clean per-token or per-second numbers. Real bills get weird.

OpenRouter charges what the upstream charges, plus a small fee. For Llama 4 Maverick that's roughly $0.25 per million input tokens and $0.80 per million output tokens (passthrough rate, early 2026). For Claude Sonnet 4.6, expect Anthropic's list price with a few cents on top. The free models are genuinely free; rate limits are stricter and quality drifts when providers swap their backing.

Together AI publishes per-model rates. Llama 4 Maverick sits around $0.20 in / $0.60 out per million tokens. DeepSeek V3 is cheaper. Dedicated endpoints flip the model: you pay for the GPU-hour (roughly $1.80 to $4.80 per H100-hour depending on commitment), not per token, which becomes the cheaper option once you're sustained above a few million tokens per day.

Replicate bills GPU-seconds. An A100 is roughly $0.001400 per second, an H100 is ~$0.001525 per second, a T4 is ~$0.000225 per second (early 2026 list). For a Flux image generation that takes 4 seconds on an A100, you're spending ~$0.006 per image. For a 10-minute video generation on H100, you're spending ~$0.92. For chat, this gets expensive fast because you pay for the GPU even when the model is idle waiting for the next token.

The mental model: OpenRouter is per-token + small markup, Together is per-token at near-cost (or per-hour if you go dedicated), Replicate is per-GPU-second. Match the billing model to the workload shape.

Strongest features on each platform

OpenRouter wins on optionality

Every new frontier model lands on OpenRouter within days, usually hours. When GPT-5 dropped, OpenRouter had it routable behind the same key the same morning. The fallback system (define a list of acceptable models, OpenRouter walks the list on 429 or 5xx) saves outages that would otherwise force a deploy. The "auto" model picks the cheapest provider for a given model at request time.

The Bring Your Own Key feature also matters: if you have a discounted contract with Anthropic, you can route through OpenRouter using your own API key and only pay OpenRouter for the routing layer (a few cents per million). Most aggregators don't allow this.

Together AI wins on infra depth

Together's dedicated endpoints have measurably lower p99 latency than any serverless equivalent we've tested. For agentic workloads where a single user turn might call the model 10–20 times, latency compounds. Cutting the p99 from 800ms to 250ms on a dedicated Llama 4 endpoint can cut perceived response time by 3x.

The fine-tuning flow is also clean. LoRA fine-tunes on a 70B model start at a few dollars and finish in under an hour for small datasets. The result deploys to the same OpenAI-compatible endpoint your app already calls. If you're doing any structured generation (function calling, JSON mode, custom-grammar prompts), having your own fine-tune is a real lever.

Replicate wins on the long tail

When you need a model that isn't in the top 50 most-popular, Replicate is often the only commercial host. Want Wan 2.2 video generation? It's there. Want a community RVC voice clone? There. Want the latest Flux LoRA somebody posted yesterday? Almost always there within 48 hours.

The cog tool is also genuinely useful as a packaging format. We've seen teams use cog internally to deploy their own models on Kubernetes, completely separate from Replicate's hosted service. It's one of the few open-source contributions in this space that has reach beyond the original company.

The weaknesses you only notice in production

OpenRouter's weakness is that it's a middleman. If OpenRouter goes down, your app goes down even if the upstream provider is healthy. The 5% markup also adds up: at $50k/mo in inference, that's $2,500/mo to a routing layer. Most teams reach a point where they pin one or two production models directly and use OpenRouter for experimentation.

Together's weakness is the closed-model gap. No GPT-5. No Claude. No Gemini. If your product needs frontier reasoning for the hard prompts, you're running a second provider alongside Together regardless. The dedicated endpoint minimum commitments (typically a few hundred dollars per month per endpoint) also rule it out for hobby projects.

Replicate's weakness is cold starts and chat economics. A niche model that hasn't been called in a while can take 30–60 seconds to wake up. Replicate has worked on this (always-on deployments, faster cold starts on popular models), but it's still the architecture's biggest tax. For chat specifically, per-GPU-second billing is almost always more expensive than per-token billing on Together or via OpenRouter.

For deeper context on the chat-app side, our best AI agent platforms for developers breakdown covers the orchestration layer that sits on top of these inference providers.

Who should pick what

Pick OpenRouter if you're building a product that needs to support "use whatever model is best for this task" today and switch when a better one ships next month. Also pick it for prototyping: one signup, every model.

Pick Together AI if your dominant workload is open-source LLM inference, you have enough volume to justify a few thousand a month, or you want to fine-tune. Also if you have a research bent: Together's team publishes, their endpoints expose the things researchers want (logprobs, custom grammars).

Pick Replicate if you're shipping image, video, audio, or voice features, or you need a model that isn't in the top 50. Also if your team includes non-engineers who need to try models from a UI.

Use two or three. Plenty of production stacks run OpenRouter for closed-model access (GPT-5, Claude), Together for the bulk Llama traffic, and Replicate for the Flux image generation. The OpenAI-compatible API on the first two makes the dual-stack setup cheap to maintain.

How this fits the Cadence stack

Every engineer on Cadence is AI-native by default. That means picking the right inference provider for the workload (not "let's use OpenAI for everything") is part of the baseline; we vet it on the voice interview before an engineer unlocks bookings.

If you're a founder shipping an AI feature and you've been stuck between picking a router, a host, or a marketplace, a Senior engineer on Cadence ($1,500/week) will wire all three behind a clean adapter, set up cost tracking per provider, and tell you honestly which one wins for your specific traffic shape. The 48-hour free trial means you can test the work before paying. For pure integration work against well-documented APIs (Replicate's REST, Together's SDK), a Mid engineer at $1,000/week is usually enough.

If your decision is more strategic ("we want to add an image-gen feature, what stack?"), our best deployment platforms for startups ranking and best low-code admin panels overview help frame the surrounding decisions.

What to do next

List your 3 biggest AI workloads by monthly token (or GPU-second) volume.
Map each to its natural home: closed-model reasoning → OpenRouter, bulk open-source chat → Together, multimodal → Replicate.
Spin up a free or starter account on each you need, port one workload, and measure latency and cost-per-1k-requests over a week.
If the answer is two providers, build a thin adapter so swapping is a config change, not a refactor.

If you want help running the experiment without losing a sprint, book a Cadence engineer for a week. We'll wire the providers, run the cost comparison, and leave you with a one-page recommendation.

Want an honest grade on your current AI stack? Cadence's Ship or Skip tooling audit walks through your inference setup, billing surface, and observability in 20 minutes and tells you where you're overpaying. No sales call.

For complementary tooling decisions in the AI app stack, see our Drizzle ORM review on the database side and Resend review on the transactional email side.

FAQ

Is OpenRouter better than calling OpenAI directly?

For experimentation and multi-model apps, yes: one key, every model, automatic fallbacks. For a single-model production app at scale, calling the upstream directly saves the 5% markup and removes a single point of failure. Many teams use OpenRouter in dev and direct calls in prod.

Can Together AI run GPT-5 or Claude?

No. Together hosts open-source models on their own infrastructure. For closed frontier models (GPT, Claude, Gemini), you need OpenRouter, the upstream provider, or a cloud reseller like AWS Bedrock or Azure AI Foundry.

Why is Replicate expensive for chat?

Replicate bills per GPU-second, including time the GPU spends idle between tokens. For chat with bursty user-driven traffic, you're paying for hardware sitting around. Per-token billing on OpenRouter or Together is almost always cheaper for chat workloads. Replicate's economics shine on shorter, deterministic jobs (one image, one transcription, one video) where wall-clock time is short.

Which one is cheapest for Llama 4?

Together AI is typically the cheapest direct host for Llama 4 in early 2026 (~$0.20 in / $0.60 out per million tokens for Maverick). OpenRouter passes through near-identical rates with a small markup, but lets you fall back to Fireworks, DeepInfra, or others if Together is saturated. Replicate is not the right tool for Llama at scale.

Can I fine-tune a model on Replicate?

Some model authors expose fine-tuning (notably for image models like SDXL and Flux LoRAs), but it's not a first-class platform feature the way it is on Together. If fine-tuning is core to your roadmap, Together is the right pick. OpenRouter doesn't host fine-tunes at all.

Do any of these offer a self-hosted option?

Together rents you bare GPU clusters by the hour if you want to run your own stack, but they don't ship a self-hostable inference server. Replicate's cog packaging tool is open source and you can run cog-packaged models on your own Kubernetes, which is the closest thing to a self-hosted option among the three. OpenRouter is router-only and can't be self-hosted.

Japish Thind

Backend Developer

Backend developer at withRemote. Writes on API design, observability, and database trade-offs.

All posts