How to build agentic SaaS features

Agentic SaaS features are product surfaces where an LLM holds a multi-step loop on the user's behalf: scoped tools, a planner, user-confirmation rules, cost guardrails, and an action history the user can audit and undo. You build them by replacing one webhook handler (the "trigger → fixed code path" pattern) with a planner loop that picks from 4 to 8 scoped tools, asks before risky writes, and writes every action to an event log.

That is the short answer. The long answer is what most teams get wrong, which is treating "agentic" as a model upgrade instead of a feature-design discipline. Below is the shape of an agentic feature that actually ships and survives 90 days in production.

What "agentic" actually means inside a SaaS product

A normal SaaS feature is a function: input event, fixed code, output. A webhook, a form submit, a cron job. The path is hard-coded.

An agentic feature is a loop. The model reads context, picks a tool, reads the result, picks the next tool, and continues until it stops or hits a guardrail. The control flow is dynamic.

The implication: you stop writing branch logic ("if email contains X then Y") and start writing tool specs ("a function called search_tickets(query, limit) that returns up to 50 tickets"). The loop assembles the workflow at runtime.

Examples in the wild:

Linear's auto-triage assigns labels, priority, and team via 5 tools (get_similar_issues, assign_label, set_priority, assign_team, add_comment).
Intercom's Fin holds a support conversation calling search_help_center, lookup_order, escalate_to_human. Each tool has explicit auth scope.
Hex's Magic answers a data question by writing SQL, running it, inspecting the result, iterating or rendering a chart. Undo is the cell history.

If your "agentic feature" does not have a tool list you can write down in 30 seconds, it is not agentic. It is a chat box with a prompt.

The 6 design choices every agentic feature has to make

You can ship a working agentic SaaS feature by answering 6 questions before you write a line of code. Skipping any of them is how teams end up with a demo that wows in the keynote and gets disabled by week 3.

1. Scoped tools, not raw API access

Give the model a small, intent-shaped tool list. Not "execute_sql". Instead: get_revenue_by_month(start, end), get_top_customers(limit), compare_periods(period_a, period_b).

Why: a tool with a narrow signature constrains the model's failure modes. execute_sql lets the agent drop tables. get_revenue_by_month cannot. The same logic applies to writes. Replace update_record(table, id, json) with mark_ticket_resolved(ticket_id) and assign_ticket(ticket_id, user_id). The blast radius of any mistake is bounded by the tool surface.

A good rule of thumb: 4 to 8 tools per feature. Fewer than 4 and you are doing branching in the prompt (badly). More than 8 and the model starts picking the wrong one. If you need more, split the feature into two agents that hand off.

2. Confirmation tiers: auto-approve safe, ask for risky

Every tool needs an auth_level: safe, confirm, or escalate.

Safe runs immediately. Reads, internal-only writes, anything reversible in one click. The user sees it in the action history but is not interrupted.
Confirm pops a one-tap approval. Anything that touches money, sends external comms, or changes a permission. The model proposes; the user approves.
Escalate kicks to a human. Refunds over a threshold. Deletes. Anything regulated.

The mistake is making everything confirm. You will train users to one-tap-approve without reading, which is worse than safe because it manufactures false consent. Reserve confirm for genuinely consequential writes.

Stripe's agent-mode for refunds is a good reference: refunds under $100 are safe, $100 to $1,000 are confirm, over $1,000 are escalate. The thresholds are configurable per merchant.

3. Cost guardrails per session

LLM cost is a runaway risk most teams underprice. A naive support agent can spend $4 of Claude or GPT credits on a single ticket if it loops on a confused query. Multiply by 10,000 tickets per day and you have a P1.

The guardrail is a per-session budget: max tokens, max tool calls, max wall-clock seconds. When any limit trips, the loop halts and the agent hands the session to a human with the partial transcript.

Sensible defaults to start with:

Feature type	Max tokens	Max tool calls	Max seconds	Cost cap (Sonnet 4.5 at $3/MTok in, $15/MTok out)
Inbox triage (per email)	4,000	4	20	~$0.04
Customer support (per conversation)	20,000	12	90	~$0.20
Analyst (per question)	40,000	20	180	~$0.50
Code agent (per task)	200,000	50	600	~$2.50

Track cost-per-completed-task as a top-line metric next to success rate. If the ratio drifts (cost up, success flat), the loop is thrashing and the prompt or tool set needs a tighter spec, as covered in the AI-assisted refactoring playbook for similar token-discipline patterns.

4. Transparent action history

Every tool call gets logged to an immutable event stream the user can open. Tool name, arguments, result summary, timestamp, cost in cents.

This is not a "nice to have." It is the feature. Users will not trust agentic surfaces without it, and your support team cannot debug failures without it. Build the action history before you build the chat UI. Most production agents fail on trust, not on capability.

A working schema:

type AgentAction = {
  sessionId: string;
  step: number;
  toolName: string;
  args: Record<string, unknown>;
  result: { ok: boolean; summary: string; raw: unknown };
  costCents: number;
  startedAt: Date;
  durationMs: number;
};

Render it as a collapsible timeline next to the chat. Linear, Vercel's v0, and Cursor all expose something like this. Users open it twice a week and trust the agent 10x more for the rest of the month.

5. Undo affordance for every state-changing tool

Every tool that writes state needs a corresponding undo. Not a "rollback button" in the UI; an actual inverse tool the loop or the user can invoke.

assign_ticket has unassign_ticket. apply_discount has revoke_discount. send_email cannot be undone, so it goes in the confirm tier.

When an action is undoable, the action-history row gets an "Undo" button that calls the inverse tool with the original arguments. When it is not undoable, the row gets a "View consequence" link (the sent email, the charged invoice). Users will tolerate a 95% accurate agent if they trust the 5% is recoverable. They will reject a 99% accurate agent if it isn't.

6. Eval suite tied to features, not models

The model will change every 3 months. Your evals have to be product-shaped, not model-shaped, so they survive the upgrade.

Write 30 to 100 "golden traces" per feature: real tickets, real emails, real questions, with the expected tool calls and the acceptable answer ranges. Run the suite on every prompt change and every model upgrade. Track pass rate, average cost, average latency. Promote a change to production only when pass rate holds and cost does not regress more than 15%.

This is the discipline covered in building autonomous coding agents in 2026, which describes the same eval loop applied to coding tasks. The shape is identical for support, analytics, or triage.

The agent loop is replacing the webhook handler

Here is the architectural shift worth naming. For 15 years, the default SaaS pattern was: event arrives → webhook fires → fixed handler runs → side effects happen. Stripe charge succeeds → mark order paid → email receipt. Predictable, debuggable, narrow.

The agentic pattern looks like: event arrives → loop wakes up → model reads context → picks a tool → reads result → picks another tool → eventually stops. Less predictable, harder to debug, much wider.

The trade is flexibility for determinism. A webhook handler that does 4 things forever is replaced by a loop that can do 40 things conditionally. For features where the "right next step" depends on context the developer cannot predict at write-time (a customer support reply, a triage decision, a research question), the loop wins. For features where the right step is always the same (charge succeeded → mark paid), the webhook still wins. Do not replace what is not broken.

A comparison of agentic feature shapes

Different SaaS features need different loop shapes. Here are the four most common, with the trade-offs.

Feature shape	Tools	Confirmation	Cost per session	Best for
Triage agent (Linear auto-triage, inbox sorting)	4 to 6 read + small write	Mostly `safe`, occasional `confirm`	$0.02 to $0.05	High-volume classification with reversible writes
Conversational agent (Intercom Fin, support copilot)	6 to 10 mixed	Mix of `safe` and `confirm`; `escalate` for refunds	$0.10 to $0.30	Multi-turn user dialogue with bounded actions
Analyst agent (Hex Magic, Glean queries)	3 to 5 read-only	Mostly `safe` (read-only)	$0.20 to $0.50	One-shot questions against structured data
Builder agent (Cursor, v0, Replit Agent)	10 to 20 read + write	`safe` for sandbox, `confirm` for deploy	$1 to $5	Iterative artifact creation with eval-able outputs

The mistake teams make is using one shape for all features. A triage agent that asks for confirmation on every label change is unusable. A builder agent that auto-deploys without confirmation is a security incident. Pick the shape per feature, not per product.

Three concrete builds

AI inbox triage

An auto-sort toggle in settings. Every incoming email is read by an agent with 5 tools: get_email_thread, find_similar_past_tickets, assign_label, assign_priority, assign_team. All safe because labels and assignments are reversible.

The action history shows: "Read thread, found 3 similar past tickets, assigned label billing, priority medium. Cost: 2 cents." A "Disagree" button reverses the actions and adds the trace to the eval set. This is the highest-ROI agentic feature to ship first: reversible, narrow, measurable.

AI customer support

A support copilot inside the help-center widget. Tools: search_help_center, lookup_customer_account, lookup_order, apply_credit_under_$10, escalate_to_human. The over-$10 credit tool does not exist, which forces escalate.

The agent runs for up to 12 tool calls or 90 seconds. If unresolved, it escalates with full transcript. Track deflection rate, CSAT on agent-resolved tickets, and false-resolution rate (user re-opens within 7 days). The last one is the only metric that matters.

AI analyst

A revenue ops "ask anything about your pipeline" box. Tools: query_opportunities(filters), query_accounts(filters), compare_periods(a, b), render_chart(spec). All read-only, all safe.

The agent answers in 3 to 7 tool calls, renders a chart, shows the SQL. Cost-per-question lands around $0.30. Cheaper than a junior analyst by 1,000x, faster by 100x, wrong about 8% of the time. The transparent action history is how the user catches the 8%.

What to do next

If you are designing your first agentic feature this quarter, the highest-ROI move is to write the tool list before the prompt. Eight tools, signatures included, auth tier marked. Then write 30 golden traces. Then write the prompt. Most teams reverse this and burn a quarter on a working demo with no eval coverage.

If you need an engineer who has shipped this exact shape, every engineer on Cadence is AI-native, vetted on Cursor, Claude Code, and Copilot fluency before they unlock bookings; many have shipped production agent loops. A senior at $1,500/week or a lead at $2,000/week is the right tier for the planner-loop and eval-suite work. Mid at $1,000/week is right for the tool implementations and action-history UI once the architecture is set. If you are still deciding whether to build, buy, or book, run the Build/Buy/Book decision tool for a 3-minute recommendation against your actual feature spec.

The pattern is the same whether you are wiring Cursor agent mode into a production codebase or building a customer-facing copilot: small scoped tools, explicit confirmation tiers, cost caps, transparent history, undo paths, evals. Skip any one and the feature will not survive contact with real users.

If you want a 48-hour trial with an engineer who has shipped agentic features in production, the fastest path is booking a senior on Cadence. Two days at no cost, weekly billing after that, replace any week.

FAQ

What is the difference between an AI feature and an agentic feature?

An AI feature uses a model to transform an input (summarize this, classify that, autocomplete). The control flow is fixed. An agentic feature gives the model a tool list and a loop, so the model decides which tools to call and in what order. The control flow is dynamic, decided at runtime.

How do I price an agentic SaaS feature?

Track cost-per-completed-task (LLM tokens plus tool execution). A support agent at $0.20 per resolved ticket costing the customer $5 per ticket in seat fees is healthy at 4% COGS. A triage agent at $0.04 per email is essentially free. Price the feature on outcomes (resolved tickets, triaged inbox) and budget LLM spend at 5 to 15% of the per-outcome revenue.

How do I prevent prompt injection in tool calls?

Treat every tool argument as untrusted, even if the model produced it. Validate it like a user-submitted form: type, range, allowlist. Never let the model construct raw SQL, shell commands, or URLs to fetch. Use the scoped-tool pattern (small intent-shaped functions) so the worst a successful injection can do is one valid tool call with bad arguments, which your guardrails should already handle.

Should I use a framework like LangChain or build my own loop?

For a single feature with 4 to 8 tools, build your own loop. It is 200 lines of code and you will need to understand every line when it misbehaves. Reach for a framework only when you have 3+ agentic features sharing tools and eval infrastructure. The OpenAI and Anthropic SDKs both ship tool-calling primitives that get you 80% of the way without a framework.

How do I know my agentic feature is working?

Three metrics, tracked weekly: task success rate (did the user accept the action), cost per completed task (LLM plus tools), and undo rate (how often users reverse the agent). Success rate above 85%, cost stable or dropping, undo rate below 5%, and the feature is healthy. Any of those drifting, the eval suite tells you which tool or prompt to fix.

What roles do I need on the team to ship this?

One senior or lead engineer for the planner-loop, tool design, and eval suite. One mid engineer for the tool implementations and action-history UI. One designer who understands confirmation flows. Three people for 6 to 10 weeks gets you a shipped, evaluated, monitored agentic feature in production. If you do not have those people on staff, the AI-native engineering ROI numbers for 2026 show why booking them by the week beats a 12-week hiring loop.

Deeksha Durgesh

Senior Automation Developer

Senior automation engineer at withRemote. Writes on CI/CD, test pyramids, and removing toil from engineering pipelines.

All posts