Building your first AI agent with tool calling

AI agent tool calling is the loop where a language model picks a function, fills in JSON arguments, runs it, reads the result, and decides what to do next. To build your first one you need five things: a real use case, 3-5 tools with clear schemas, an agent loop, three guardrails (max-iterations, cost cap, output validation), and a 10-case eval. The rest is wiring.

This post walks you through all of it, end to end, in TypeScript, with a real use case (a knowledge-base support agent) and code you can paste and run today.

What tool calling actually is (and what it isn't)

Tool calling is structured JSON output. You give the model a list of functions ("tools") with names, descriptions, and parameter schemas. The model decides whether to answer directly or to emit a tool call: {"name": "search_knowledge_base", "arguments": {"query": "refund policy"}}. Your code runs the function, returns the result, and the model continues.

That's it. No autonomy, no magic, no framework required. The same primitive lives in Anthropic's tool_use blocks, OpenAI's function calling, and Google's Gemini function calling. The shapes differ; the loop is identical.

What it isn't: a planner, a memory system, or a multi-agent orchestrator. Those are layers you can add later. Most "agent failures" you read about are people who skipped the basic loop and reached for LangGraph on day one.

Approach	Tool calling format	Streaming	Pricing (per M tokens)
Anthropic Claude Sonnet 4.6	`tool_use` blocks	yes	$3 in / $15 out
OpenAI GPT-4.1	function calling	yes	$2 in / $8 out
Google Gemini 2.5 Pro	function calling	yes	$1.25 in / $10 out
Vercel AI SDK	unified across providers	yes	passthrough

Pick one provider for your first agent. Switching later costs you a few hours, not a rewrite.

Pick a use case before you write a single line

The number-one beginner mistake is starting with "I want to build an AI agent" instead of "I want to answer support tickets from our knowledge base." The first framing produces a toy. The second produces something you ship.

Good first use cases share three traits: a single user, a bounded data source, and a clear definition of "correct."

Knowledge-base Q&A (this post's example)
Internal data Q&A over a SQL view
GitHub issue triage (label, route, draft a reply)
Scheduled report generator that pulls from 2-3 APIs

Bad first use cases:

Anything multi-user with shared state
Anything financial (you do not want your first agent moving money)
Anything long-running with async waits
Anything where you cannot define "correct" in one sentence

For the rest of this post we are building one specific thing: an agent that answers support questions from a markdown knowledge base, escalates anything it cannot answer to a human ticket, and refuses to go off-topic.

Define your 4 tools

Tool descriptions matter more than tool names. The model never sees your function body; it sees the schema and the description, then guesses. Write descriptions like you are briefing a new junior engineer who will never get a follow-up question.

For the support agent we need four:

search_knowledge_base(query: string) returns the top 3 article chunks ranked by semantic similarity.
get_article(id: string) returns the full markdown for one article.
create_support_ticket(summary: string, priority: "low" | "normal" | "high") opens a ticket in the human queue.
send_response(text: string) delivers the final answer to the user and ends the session.

Notice the shape: three "read" tools and one "write" tool that ends the loop. That asymmetry is intentional. The model can browse and gather context cheaply, then must commit to a single output.

// tools.ts
import { z } from "zod";

export const tools = [
  {
    name: "search_knowledge_base",
    description:
      "Search the company knowledge base for articles relevant to a user question. Returns the top 3 chunks with article IDs. Use this first for any factual question about our product, pricing, refunds, or onboarding.",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "The user question, rewritten as a search query." },
      },
      required: ["query"],
    },
  },
  {
    name: "get_article",
    description: "Fetch the full markdown of one knowledge-base article by ID. Call this after search_knowledge_base when you need full context.",
    input_schema: {
      type: "object",
      properties: { id: { type: "string" } },
      required: ["id"],
    },
  },
  {
    name: "create_support_ticket",
    description:
      "Open a human-handled support ticket. Use only when the knowledge base does not contain an answer or the user explicitly asks for a human.",
    input_schema: {
      type: "object",
      properties: {
        summary: { type: "string", description: "One-sentence summary of the issue." },
        priority: { type: "string", enum: ["low", "normal", "high"] },
      },
      required: ["summary", "priority"],
    },
  },
  {
    name: "send_response",
    description: "Send the final answer to the user. Call this exactly once. After this, the session ends.",
    input_schema: {
      type: "object",
      properties: { text: { type: "string" } },
      required: ["text"],
    },
  },
];

If you want to go deeper on schema design, our OpenAI function calling guide covers the same primitive on the OpenAI side.

Steps

Install dependencies. Run npm install @anthropic-ai/sdk zod and set ANTHROPIC_API_KEY in your environment. That is the entire dependency footprint for this build.
Define your tools. Create tools.ts with the four schemas above and the four matching handler functions (searchKB, getArticle, createTicket, sendResponse).
Write the agent loop. Create agent.ts that calls the Anthropic Messages API with the tools array, checks stop_reason, executes any tool_use blocks, appends the results, and re-calls the API until send_response fires or a guardrail trips.
Add the three guardrails. Cap iterations at 8, track cumulative tokens × price and bail at $0.10 per session, and validate the final send_response.text against a zod schema with one retry on failure.
Run a 10-case eval. Write 10 representative tickets in evals.json with expected tool sequences and known-good answers. Run them locally; do not ship until you hit 8/10.
Ship it. Deploy to a Vercel serverless function or a Cloudflare Worker, set a $20/day budget alert in the Anthropic console, log every tool call and token count, and review the first 100 sessions by hand.

Write the agent loop in TypeScript

Here is the entire loop. It runs, it has the three guardrails inline, and it is the thing you should paste, run, then modify.

// agent.ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import { tools } from "./tools";
import { searchKB, getArticle, createTicket } from "./handlers";

const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const MAX_ITERATIONS = 8;
const COST_CAP_USD = 0.10;
const PRICE_IN = 3 / 1_000_000;
const PRICE_OUT = 15 / 1_000_000;

const ResponseSchema = z.object({ text: z.string().min(20).max(1500) });

export async function runAgent(userMessage: string) {
  const messages: any[] = [{ role: "user", content: userMessage }];
  let totalCost = 0;

  for (let turn = 0; turn < MAX_ITERATIONS; turn++) {
    const res = await client.messages.create({
      model: MODEL,
      max_tokens: 1024,
      tools,
      messages,
      system:
        "You are a support agent. Answer only from the knowledge base. " +
        "If you cannot answer, call create_support_ticket. End with send_response.",
    });

    totalCost +=
      res.usage.input_tokens * PRICE_IN + res.usage.output_tokens * PRICE_OUT;
    if (totalCost > COST_CAP_USD) {
      return { ok: false, reason: "cost_cap", cost: totalCost };
    }

    messages.push({ role: "assistant", content: res.content });

    if (res.stop_reason !== "tool_use") {
      return { ok: false, reason: "no_tool_call", cost: totalCost };
    }

    const toolResults: any[] = [];
    let finalText: string | null = null;

    for (const block of res.content) {
      if (block.type !== "tool_use") continue;
      const args: any = block.input;
      let result: string;

      switch (block.name) {
        case "search_knowledge_base":
          result = JSON.stringify(await searchKB(args.query));
          break;
        case "get_article":
          result = await getArticle(args.id);
          break;
        case "create_support_ticket":
          result = JSON.stringify(await createTicket(args.summary, args.priority));
          break;
        case "send_response":
          finalText = args.text;
          result = "delivered";
          break;
        default:
          result = `unknown tool: ${block.name}`;
      }

      toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
    }

    messages.push({ role: "user", content: toolResults });

    if (finalText !== null) {
      const parsed = ResponseSchema.safeParse({ text: finalText });
      if (!parsed.success) {
        messages.push({
          role: "user",
          content: "Your response did not match the required shape. Retry send_response with valid text.",
        });
        continue;
      }
      return { ok: true, text: finalText, cost: totalCost, turns: turn + 1 };
    }
  }

  return { ok: false, reason: "max_iterations", cost: totalCost };
}

That is around 70 lines including imports. The Anthropic tool use docs cover edge cases (parallel tool calls, image inputs) when you need them. For tips on writing the system prompt itself, our notes on prompt engineering for senior engineers carry over directly.

The three guardrails every first agent needs

The loop above has all three. They take 12 lines combined. Skipping any one is how you wake up to a $400 bill or a runaway support ticket.

Max-iterations. Hard-stop at 5-10 turns. Eight is a sane default for a single-user agent. If your agent regularly hits the cap, your tools are wrong (usually too granular), not your limit.

Cost cap. Track input_tokens × $3/M + output_tokens × $15/M per session. Bail at $0.10. A typical 5-turn session lands around $0.02; the cap exists to catch the bug, not the steady state. Our deeper token cost optimization guide covers the next layer (caching, smaller models for triage).

Output validation. Use zod (or any schema library) to validate the final send_response.text. One retry. If it fails twice, return a fallback to a human. Never let a malformed response reach the user.

Optional fourth guardrail for production: an output classifier (a cheap Haiku 4.5 call) that checks the final response for hallucinated facts before delivery. See our notes on handling LLM hallucinations in production for the full pattern.

Set up a 10-case eval before you deploy

Ten cases. Not a hundred. Not a "framework." A JSON file:

[
  {
    "input": "How do I get a refund?",
    "expectedTools": ["search_knowledge_base", "send_response"],
    "expectedSubstring": "30 days"
  },
  {
    "input": "What is your CEO's home address?",
    "expectedTools": ["send_response"],
    "expectedSubstring": "I can't share"
  }
]

Write a 30-line runner that calls runAgent on each input, logs the tool sequence, the token count, the latency, and whether the expected substring appears. Bar to ship: 8/10 correct, no infinite loops, average cost under $0.05.

This eval is your regression net. Re-run it after every prompt change. It catches the silent degradation that makes agents drift from "great" to "embarrassing" over a month. If you're still trying to decide whether to build this in-house or hand it off, our Build/Buy/Book recommender gives a quick read in 90 seconds.

Deploy and watch the first week

A Vercel serverless function or a Cloudflare Worker handles up to roughly 10 requests per second without thinking about it. Anything more and you want a queue.

What to set up before you ship:

Log every tool call, every input/output token count, every final answer to a database (Supabase, Neon, anything cheap).
Set a $20/day budget alert in the Anthropic console. Stripe billing is monthly; the alert is your circuit breaker.
Wire a Slack webhook for any session that hits cost_cap or max_iterations. Those are bugs, not edge cases.
Manually review the first 100 sessions. This is the single highest-impact thing you will do in week one. You will find at least three prompt fixes nobody could have predicted.

If you want a fuller production checklist (RAG, eval harness, observability), our production RAG architecture post covers the retrieval side that sits underneath search_knowledge_base.

What to do next

Most beginners stop at "it works in dev" and skip the eval. Don't be them. The eval is the thing that lets you change a prompt without fear and ship a v2 in a week instead of a quarter.

Good v2 moves, in rough order of return on effort:

Add streaming so the user sees text appear (Vercel AI SDK makes this one line).
Put a cheaper model in front for triage. Use Claude Haiku for routing and Sonnet for the answer; cuts your bill 40-60% on read-heavy workloads.
Add a memory store (Postgres + a sessions table) so the agent remembers prior turns within a conversation.
Add a second eval set drawn from real production logs every week.

If the loop, the guardrails, the eval, and the deploy add up to more weekend than you've got, the fastest path is to hand the build to an engineer who has shipped this exact pattern before. Every Cadence engineer is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency in a voice interview before they unlock bookings, and the platform's 12,800-engineer pool means you usually get a first match in two minutes.

Not sure if "build, buy, or book" is the right call for your first agent? Our Build/Buy/Book recommender takes 90 seconds and gives you a straight answer based on your scope, deadline, and team. Free, no email gate.

FAQ

What is the difference between function calling and tool calling?

They mean the same thing. OpenAI calls it function calling; Anthropic calls it tool use; Google calls it function calling. The structure is identical: a JSON schema, the model emits a name plus arguments, your code runs the function and returns the result.

Should I use a framework like LangChain or build the loop myself?

For your first agent, write the loop yourself. Frameworks hide the exact failure modes you most need to feel: schema mistakes, runaway loops, cost spikes. Move to a framework only after you've shipped two agents and know what you'd want abstracted away. The 70-line loop in this post is shorter than most framework setup files.

How much does it cost to run an AI agent?

A 5-8 turn session on Claude Sonnet 4.6 with a small knowledge base runs about $0.02-$0.10. Daily cost scales linearly with session count. Cap per-session spend in code (the cost cap above is 4 lines). For a deeper breakdown, see our cost-to-build-an-AI-agent guide.

How do I stop the agent from looping forever?

Hard-cap iterations at 5-10 and total token spend per session at $0.10. Both are 4-line additions to the loop. Without them, a single bug in your tool descriptions can spend $200 overnight. The agent does not know it is stuck; you have to tell the runtime.

What's the smallest useful first agent project?

A knowledge-base Q&A agent with four tools (search, get article, escalate, respond). It teaches the loop, schema design, and guardrails in under 200 lines of code, and it solves a real problem (deflecting Tier 1 support tickets) on day one. Build that, ship it, then build the next one.

All posts