
AI agent tool calling is the loop where a language model picks a function, fills in JSON arguments, runs it, reads the result, and decides what to do next. To build your first one you need five things: a real use case, 3-5 tools with clear schemas, an agent loop, three guardrails (max-iterations, cost cap, output validation), and a 10-case eval. The rest is wiring.
This post walks you through all of it, end to end, in TypeScript, with a real use case (a knowledge-base support agent) and code you can paste and run today.
Tool calling is structured JSON output. You give the model a list of functions ("tools") with names, descriptions, and parameter schemas. The model decides whether to answer directly or to emit a tool call: {"name": "search_knowledge_base", "arguments": {"query": "refund policy"}}. Your code runs the function, returns the result, and the model continues.
That's it. No autonomy, no magic, no framework required. The same primitive lives in Anthropic's tool_use blocks, OpenAI's function calling, and Google's Gemini function calling. The shapes differ; the loop is identical.
What it isn't: a planner, a memory system, or a multi-agent orchestrator. Those are layers you can add later. Most "agent failures" you read about are people who skipped the basic loop and reached for LangGraph on day one.
| Approach | Tool calling format | Streaming | Pricing (per M tokens) |
|---|---|---|---|
| Anthropic Claude Sonnet 4.6 | tool_use blocks | yes | $3 in / $15 out |
| OpenAI GPT-4.1 | function calling | yes | $2 in / $8 out |
| Google Gemini 2.5 Pro | function calling | yes | $1.25 in / $10 out |
| Vercel AI SDK | unified across providers | yes | passthrough |
Pick one provider for your first agent. Switching later costs you a few hours, not a rewrite.
The number-one beginner mistake is starting with "I want to build an AI agent" instead of "I want to answer support tickets from our knowledge base." The first framing produces a toy. The second produces something you ship.
Good first use cases share three traits: a single user, a bounded data source, and a clear definition of "correct."
Bad first use cases:
For the rest of this post we are building one specific thing: an agent that answers support questions from a markdown knowledge base, escalates anything it cannot answer to a human ticket, and refuses to go off-topic.
Tool descriptions matter more than tool names. The model never sees your function body; it sees the schema and the description, then guesses. Write descriptions like you are briefing a new junior engineer who will never get a follow-up question.
For the support agent we need four:
search_knowledge_base(query: string) returns the top 3 article chunks ranked by semantic similarity.get_article(id: string) returns the full markdown for one article.create_support_ticket(summary: string, priority: "low" | "normal" | "high") opens a ticket in the human queue.send_response(text: string) delivers the final answer to the user and ends the session.Notice the shape: three "read" tools and one "write" tool that ends the loop. That asymmetry is intentional. The model can browse and gather context cheaply, then must commit to a single output.
// tools.ts
import { z } from "zod";
export const tools = [
{
name: "search_knowledge_base",
description:
"Search the company knowledge base for articles relevant to a user question. Returns the top 3 chunks with article IDs. Use this first for any factual question about our product, pricing, refunds, or onboarding.",
input_schema: {
type: "object",
properties: {
query: { type: "string", description: "The user question, rewritten as a search query." },
},
required: ["query"],
},
},
{
name: "get_article",
description: "Fetch the full markdown of one knowledge-base article by ID. Call this after search_knowledge_base when you need full context.",
input_schema: {
type: "object",
properties: { id: { type: "string" } },
required: ["id"],
},
},
{
name: "create_support_ticket",
description:
"Open a human-handled support ticket. Use only when the knowledge base does not contain an answer or the user explicitly asks for a human.",
input_schema: {
type: "object",
properties: {
summary: { type: "string", description: "One-sentence summary of the issue." },
priority: { type: "string", enum: ["low", "normal", "high"] },
},
required: ["summary", "priority"],
},
},
{
name: "send_response",
description: "Send the final answer to the user. Call this exactly once. After this, the session ends.",
input_schema: {
type: "object",
properties: { text: { type: "string" } },
required: ["text"],
},
},
];
If you want to go deeper on schema design, our OpenAI function calling guide covers the same primitive on the OpenAI side.
npm install @anthropic-ai/sdk zod and set ANTHROPIC_API_KEY in your environment. That is the entire dependency footprint for this build.tools.ts with the four schemas above and the four matching handler functions (searchKB, getArticle, createTicket, sendResponse).agent.ts that calls the Anthropic Messages API with the tools array, checks stop_reason, executes any tool_use blocks, appends the results, and re-calls the API until send_response fires or a guardrail trips.send_response.text against a zod schema with one retry on failure.evals.json with expected tool sequences and known-good answers. Run them locally; do not ship until you hit 8/10.Here is the entire loop. It runs, it has the three guardrails inline, and it is the thing you should paste, run, then modify.
// agent.ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import { tools } from "./tools";
import { searchKB, getArticle, createTicket } from "./handlers";
const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const MAX_ITERATIONS = 8;
const COST_CAP_USD = 0.10;
const PRICE_IN = 3 / 1_000_000;
const PRICE_OUT = 15 / 1_000_000;
const ResponseSchema = z.object({ text: z.string().min(20).max(1500) });
export async function runAgent(userMessage: string) {
const messages: any[] = [{ role: "user", content: userMessage }];
let totalCost = 0;
for (let turn = 0; turn < MAX_ITERATIONS; turn++) {
const res = await client.messages.create({
model: MODEL,
max_tokens: 1024,
tools,
messages,
system:
"You are a support agent. Answer only from the knowledge base. " +
"If you cannot answer, call create_support_ticket. End with send_response.",
});
totalCost +=
res.usage.input_tokens * PRICE_IN + res.usage.output_tokens * PRICE_OUT;
if (totalCost > COST_CAP_USD) {
return { ok: false, reason: "cost_cap", cost: totalCost };
}
messages.push({ role: "assistant", content: res.content });
if (res.stop_reason !== "tool_use") {
return { ok: false, reason: "no_tool_call", cost: totalCost };
}
const toolResults: any[] = [];
let finalText: string | null = null;
for (const block of res.content) {
if (block.type !== "tool_use") continue;
const args: any = block.input;
let result: string;
switch (block.name) {
case "search_knowledge_base":
result = JSON.stringify(await searchKB(args.query));
break;
case "get_article":
result = await getArticle(args.id);
break;
case "create_support_ticket":
result = JSON.stringify(await createTicket(args.summary, args.priority));
break;
case "send_response":
finalText = args.text;
result = "delivered";
break;
default:
result = `unknown tool: ${block.name}`;
}
toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
}
messages.push({ role: "user", content: toolResults });
if (finalText !== null) {
const parsed = ResponseSchema.safeParse({ text: finalText });
if (!parsed.success) {
messages.push({
role: "user",
content: "Your response did not match the required shape. Retry send_response with valid text.",
});
continue;
}
return { ok: true, text: finalText, cost: totalCost, turns: turn + 1 };
}
}
return { ok: false, reason: "max_iterations", cost: totalCost };
}
That is around 70 lines including imports. The Anthropic tool use docs cover edge cases (parallel tool calls, image inputs) when you need them. For tips on writing the system prompt itself, our notes on prompt engineering for senior engineers carry over directly.
The loop above has all three. They take 12 lines combined. Skipping any one is how you wake up to a $400 bill or a runaway support ticket.
Max-iterations. Hard-stop at 5-10 turns. Eight is a sane default for a single-user agent. If your agent regularly hits the cap, your tools are wrong (usually too granular), not your limit.
Cost cap. Track input_tokens × $3/M + output_tokens × $15/M per session. Bail at $0.10. A typical 5-turn session lands around $0.02; the cap exists to catch the bug, not the steady state. Our deeper token cost optimization guide covers the next layer (caching, smaller models for triage).
Output validation. Use zod (or any schema library) to validate the final send_response.text. One retry. If it fails twice, return a fallback to a human. Never let a malformed response reach the user.
Optional fourth guardrail for production: an output classifier (a cheap Haiku 4.5 call) that checks the final response for hallucinated facts before delivery. See our notes on handling LLM hallucinations in production for the full pattern.
Ten cases. Not a hundred. Not a "framework." A JSON file:
[
{
"input": "How do I get a refund?",
"expectedTools": ["search_knowledge_base", "send_response"],
"expectedSubstring": "30 days"
},
{
"input": "What is your CEO's home address?",
"expectedTools": ["send_response"],
"expectedSubstring": "I can't share"
}
]
Write a 30-line runner that calls runAgent on each input, logs the tool sequence, the token count, the latency, and whether the expected substring appears. Bar to ship: 8/10 correct, no infinite loops, average cost under $0.05.
This eval is your regression net. Re-run it after every prompt change. It catches the silent degradation that makes agents drift from "great" to "embarrassing" over a month. If you're still trying to decide whether to build this in-house or hand it off, our Build/Buy/Book recommender gives a quick read in 90 seconds.
A Vercel serverless function or a Cloudflare Worker handles up to roughly 10 requests per second without thinking about it. Anything more and you want a queue.
What to set up before you ship:
cost_cap or max_iterations. Those are bugs, not edge cases.If you want a fuller production checklist (RAG, eval harness, observability), our production RAG architecture post covers the retrieval side that sits underneath search_knowledge_base.
Most beginners stop at "it works in dev" and skip the eval. Don't be them. The eval is the thing that lets you change a prompt without fear and ship a v2 in a week instead of a quarter.
Good v2 moves, in rough order of return on effort:
sessions table) so the agent remembers prior turns within a conversation.If the loop, the guardrails, the eval, and the deploy add up to more weekend than you've got, the fastest path is to hand the build to an engineer who has shipped this exact pattern before. Every Cadence engineer is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency in a voice interview before they unlock bookings, and the platform's 12,800-engineer pool means you usually get a first match in two minutes.
Not sure if "build, buy, or book" is the right call for your first agent? Our Build/Buy/Book recommender takes 90 seconds and gives you a straight answer based on your scope, deadline, and team. Free, no email gate.
They mean the same thing. OpenAI calls it function calling; Anthropic calls it tool use; Google calls it function calling. The structure is identical: a JSON schema, the model emits a name plus arguments, your code runs the function and returns the result.
For your first agent, write the loop yourself. Frameworks hide the exact failure modes you most need to feel: schema mistakes, runaway loops, cost spikes. Move to a framework only after you've shipped two agents and know what you'd want abstracted away. The 70-line loop in this post is shorter than most framework setup files.
A 5-8 turn session on Claude Sonnet 4.6 with a small knowledge base runs about $0.02-$0.10. Daily cost scales linearly with session count. Cap per-session spend in code (the cost cap above is 4 lines). For a deeper breakdown, see our cost-to-build-an-AI-agent guide.
Hard-cap iterations at 5-10 and total token spend per session at $0.10. Both are 4-line additions to the loop. Without them, a single bug in your tool descriptions can spend $200 overnight. The agent does not know it is stuck; you have to tell the runtime.
A knowledge-base Q&A agent with four tools (search, get article, escalate, respond). It teaches the loop, schema design, and guardrails in under 200 lines of code, and it solves a real problem (deflecting Tier 1 support tickets) on day one. Build that, ship it, then build the next one.