How to use Claude tool use in production

Q: Can Claude call multiple tools in parallel?

Yes. Claude can emit multiple tool_use blocks in a single response when the calls are independent. You execute them concurrently and return all results in one follow-up user message, with one tool_result block per tool_use_id. Miss any ID and the next request fails.

Q: What is tool_choice and when should I force a tool?

tool_choice has four modes: auto (default, Claude decides), any (must call some tool), tool (must call a specific named tool), and none (no tools). Force a specific tool for structured extraction or mechanical next steps. Stay on auto for assistants where the user expects a natural-language reply.

Q: How do I handle tool errors?

Return a tool_result with is_error: true and a short, model-readable message. For transient errors (5xx, timeouts, 429), retry inside your tool wrapper with backoff before returning to the model. Add a circuit breaker on consecutive same-tool failures and a cost cap per session.

Claude tool use in production is a four-step loop on the Messages API: define your tools with a JSON input_schema, send a request that includes the tools array, watch for a response with stop_reason: "tool_use", then run the tool and reply with a tool_result block. You repeat until you see stop_reason: "end_turn". That is the whole pattern. Everything else is the engineering around it.

This guide is specifically about the Anthropic Messages API tool use primitive. It is not a guide to Claude Code (the CLI agent) or to OpenAI function calling. Both are useful, but they solve different problems. We cover what Claude tool use is, how to define tools, the agent loop, parallel tool calls, tool_choice, error handling, fine-grained streaming, and the production checklist that turns a working prototype into something you can put on a credit card.

What Claude tool use actually is

Tool use in the Messages API is structured output plus a contract. You declare what functions Claude can call. Claude reads the user's request and decides whether to answer directly or to emit a tool_use block with the function name and JSON arguments. Your code runs the function and posts the result back as a tool_result block. The model picks up where it left off.

There are two flavors. Client tools (functions you define, plus Anthropic-schema tools like bash and text_editor) run in your app. You handle execution. Server tools (web_search, code_execution, web_fetch, tool_search) run on Anthropic's infrastructure. You see the results inline.

Distinguish this from two related things. Claude Code is a CLI tool for engineers that wraps the same primitives plus a file-system harness; you don't embed it in your product. OpenAI's function calling guide covers the same idea on the OpenAI side, but the schemas, stop reasons, and streaming model are different enough that you can't copy-paste between them.

The stop_reason values you actually care about: tool_use (run the tool, loop), end_turn (Claude is done, return the answer), max_tokens (the response was truncated, decide whether to retry with a larger budget), and refusal (Claude declined, surface gracefully). Branching on stop_reason is the first thing your loop does.

Defining tools with input_schema

A tool definition has three required fields:

name matches the regex ^[a-zA-Z0-9_-]{1,64}$.
description is plain text, ideally 3-4 sentences explaining what the tool does, when to use it, what it returns, and any limits.
input_schema is a JSON Schema object describing the arguments.

Here is a TypeScript example using the official SDK:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const tools = [
  {
    name: "get_order_status",
    description:
      "Look up the current status of an order by order ID. Returns one of pending, shipped, delivered, cancelled, refunded, plus the most recent tracking event. Use this when a customer asks where their order is or whether it shipped. Does not return line items or pricing.",
    input_schema: {
      type: "object",
      properties: {
        order_id: {
          type: "string",
          description: "Order ID, format ORD- followed by 8 digits.",
        },
      },
      required: ["order_id"],
    },
  },
];

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools,
  messages: [{ role: "user", content: "Where is order ORD-12345678?" }],
});

The same in Python:

import anthropic

client = anthropic.Anthropic()

tools = [{
    "name": "get_order_status",
    "description": (
        "Look up the current status of an order by order ID. "
        "Returns one of pending, shipped, delivered, cancelled, refunded, "
        "plus the most recent tracking event."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "order_id": {
                "type": "string",
                "description": "Order ID, format ORD- followed by 8 digits.",
            }
        },
        "required": ["order_id"],
    },
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Where is order ORD-12345678?"}],
)

Two upgrades to know. For complex nested arguments, add an input_examples array to your tool definition; Claude uses those as concrete patterns. For schemas you do not want Claude to violate (date formats, enums, nested IDs), set strict: true. Strict tool use validates the call before your code runs, which matters when a malformed call causes a database write you cannot undo. This is the same kind of prompt-as-spec discipline that senior engineers apply everywhere else: precise inputs, predictable outputs.

Two cost notes. Tool use adds a system prompt under the hood: 346 tokens for auto or none, 313 tokens for any or tool. The tools array itself is also tokenized. If your tool surface is large, cache it.

The agent loop, end to end

The loop is the part that bites. The model's response is one turn. If stop_reason is tool_use, you must:

Append the assistant message to your history verbatim, including all tool_use blocks.
Execute every tool_use block (in parallel if independent).
Build a single user message whose content is one tool_result block per tool_use_id.
Send the next request and repeat.

Here is a minimal agent loop in TypeScript with a max-turn cap:

async function runAgent(userInput: string, maxTurns = 10) {
  const messages: any[] = [{ role: "user", content: userInput }];

  for (let turn = 0; turn < maxTurns; turn++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      tools,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") {
      return response.content.find((b: any) => b.type === "text")?.text ?? "";
    }

    if (response.stop_reason !== "tool_use") {
      throw new Error(`Unhandled stop_reason: ${response.stop_reason}`);
    }

    const toolUses = response.content.filter((b: any) => b.type === "tool_use");
    const results = await Promise.all(
      toolUses.map(async (tu: any) => {
        try {
          const output = await runTool(tu.name, tu.input);
          return {
            type: "tool_result",
            tool_use_id: tu.id,
            content: JSON.stringify(output),
          };
        } catch (err: any) {
          return {
            type: "tool_result",
            tool_use_id: tu.id,
            content: `Error: ${err.message}`,
            is_error: true,
          };
        }
      }),
    );

    messages.push({ role: "user", content: results });
  }

  throw new Error(`Agent exceeded ${maxTurns} turns`);
}

Three things this loop gets right. It treats end_turn and tool_use as the only happy-path stop reasons and screams on anything else. It runs tools concurrently with Promise.all. It caps turns; without that cap, a single bug in a tool can spend $40 of credits in a minute.

Parallel tool calls and tool_choice

Claude can emit multiple tool_use blocks in one response when the calls are independent. A weather agent asked "compare SF and Tokyo" might call get_weather twice in parallel. The contract on your side: every tool_use_id gets exactly one tool_result block in the next user message, in any order. Miss one and the next request fails.

The tool_choice parameter has four modes:

Mode	Behavior	When to use
`auto` (default)	Claude decides whether to call a tool	Open-ended assistants
`any`	Claude must call some tool, its choice	Routers where every input maps to a tool
`tool`	Claude must call a specific named tool	Structured extraction, mechanical next step
`none`	Claude cannot call tools	Pure-text follow-up turns

any and tool work by prefilling the assistant message, so Claude skips the natural-language preface and goes straight to the tool call. That's exactly what you want for extraction. It is exactly what you do not want for a chat assistant where the user expects a sentence first.

One gotcha: any and tool are not compatible with extended thinking. If you want Claude to think before picking a tool, you stay on auto and put the constraint in your system prompt.

Error handling, retries, and the production traps

Tool failures happen. Set is_error: true on the tool_result and put a short, model-readable message in content. Claude will see the error, often retry with different arguments, or surface it to the user.

The pattern that works in production:

Transient errors (5xx, network timeouts, 429 rate limits) stay inside your tool wrapper. Retry with exponential backoff. Only surface to Claude after N attempts, because Claude cannot fix a flaky downstream service.
Permanent errors (404 on order ID, validation failures) go straight back to Claude as is_error. Claude will usually re-prompt the user or pick a different tool.
Side-effecting tools (charge a card, send an email, write to a DB) need an idempotency key. Pass it as a tool argument or generate it from the tool_use_id. If the agent loop retries, you do not double-charge.
Circuit breaker. Track consecutive failures of the same tool. After 3, return a hard failure to Claude with instructions to stop trying that tool. Otherwise you can burn 10 turns on a dead service.
Cost cap. Add up usage.input_tokens and usage.output_tokens across the loop. Hard-stop above a per-session ceiling. This is the difference between a $0.05 conversation and a $4 one.

Set ANTHROPIC_LOG=info in development to see the underlying request/response. In production, span every tool call with whatever tracer you already use (OpenTelemetry, Datadog, Sentry). The thing you want to debug six weeks from now is "which tool, with which args, on which turn, broke the loop." This is the same discipline you'd apply to handling LLM hallucinations in production: structured logs, replayable inputs, hard ceilings.

Fine-grained streaming and multi-turn state

Streaming gives you tool_use blocks token by token, which sounds incremental until you realize that JSON arguments stream as a string and only become valid mid-way through. The Anthropic SDKs handle the parsing for you and emit content_block_start, input_json_delta, and content_block_stop events.

Two production wins from streaming:

Early cancel. If you can tell from the first 50 tokens of arguments that Claude picked the wrong tool, cancel the request and re-prompt. Saves output tokens.
Show progress. Surface "Looking up order..." in the UI as soon as the tool_use block starts, not after the tool returns.

Multi-turn state is its own discipline. Persist the full message history (user, assistant, tool_result) keyed by session ID. Don't reconstruct from a summary; tool_use_ids must match. As the history fills, compact older turns by replacing tool_result content with a one-line summary, but keep the structure intact.

If your tools array is large or stable, use Sonnet 4.6 with prompt caching on the tools array. At $3 input and $15 output per million tokens, caching tool definitions cuts repeat-request input cost by roughly 90% on cache hits. For deeper agent design (when to call a tool, when to think first), read our piece on building your first AI agent with tool calling.

Shipping it: the engineering reality

Most tool-using prototypes ship as a 60-line script and stall there. The 60 lines call the API, run the tool once, and print the result. The next 600 lines are the loop, the retries, the idempotency, the tracing, the cost cap, the circuit breaker, the streaming UX, and the test harness that replays a recorded session against your tool stubs.

That second 600 lines is where most teams underestimate the work. It is not hard, but it is unglamorous, and it is the difference between a demo and a system you can sleep through.

A pragmatic production checklist:

Loop with max_turns cap and stop_reason exhaustive switch.
Per-tool wrapper with timeout, retry policy, and idempotency key.
Circuit breaker on consecutive same-tool failures.
Cost cap per session, computed from usage after each turn.
Structured logs with session_id, turn, tool_name, tool_use_id, latency, error.
Streaming UI for tool_use block starts.
Prompt caching on the tools array if it exceeds 1k tokens.
A replay harness with recorded sessions and stubbed tools.

If you have an engineer who has shipped this before, the second 600 lines is a week. If you are figuring it out from the docs, it is a month. Every engineer on Cadence is AI-native by default, vetted on Claude, Cursor, and Copilot fluency through a voice interview before they unlock bookings, and that includes the agent-loop patterns above. Founders book by the week starting at $500 (junior) for cleanup work, $1,000 (mid) for end-to-end features, $1,500 (senior) for architecture and edge cases, $2,000 (lead) for systems design and fractional CTO work. The 48-hour free trial means you see the loop running on your codebase before you pay.

If you want a build-or-book gut-check first, run your agent feature through our Decide tool and get a Build / Buy / Book recommendation in two minutes.

Across the platform, that's a 12,800-engineer pool, 27-hour median time to first commit, and 67% trial-to-active conversion. Whichever path you pick, the loop is the work. Get it right and tool use becomes the boring, reliable part of your product.

If you are hiring for a tool-using agent right now and the in-house option is still six weeks away from a production loop, the fastest path is to skip the recruiter cycle. Book a senior engineer on Cadence in 2 minutes, get a 48-hour trial, and pay weekly with no notice period.

FAQ

How is Claude tool use different from Claude Code?

Claude Code is a CLI agent that uses Claude plus a file-system and shell harness to do engineering work at your terminal. Claude tool use is the Messages API primitive that lets Claude call functions you define inside your own application. You build with tool use; you use Claude Code.

Can Claude call multiple tools in parallel?

Yes. Claude can emit multiple tool_use blocks in a single response when the calls are independent. You execute them concurrently and return all results in one follow-up user message, with one tool_result block per tool_use_id. Miss any ID and the next request fails.

What is tool_choice and when should I force a tool?

tool_choice has four modes: auto (default, Claude decides), any (must call some tool), tool (must call a specific named tool), and none (no tools). Force a specific tool for structured extraction or mechanical next steps. Stay on auto for assistants where the user expects a natural-language reply.

How do I handle tool errors?

Return a tool_result with is_error: true and a short, model-readable message. For transient errors (5xx, timeouts, 429), retry inside your tool wrapper with backoff before returning to the model. Add a circuit breaker on consecutive same-tool failures and a cost cap per session.

Which Claude model should I use for tool-using agents?

Opus 4.7 for ambiguous, multi-tool agents that benefit from clarifying questions. Sonnet 4.6 for most production workloads at $3 input / $15 output per million tokens. Haiku 4.5 for high-volume routing and classification where the tools are simple and the inputs are well-formed.

All posts