How to use OpenAI function calling correctly

OpenAI function calling lets the model decide which of your functions to invoke and what arguments to pass, returning structured JSON that your code executes. To use it correctly in 2026: define tools with JSON schema, set strict: true, validate arguments before you run anything, support parallel calls, cap your agent loop at a fixed number of turns, and return errors back to the model as data instead of raising exceptions.

Most public guides still teach the 2023 API. Things have shifted: tools replaced functions, strict mode is the default for serious code, the Responses API exists, and parallel calls now work alongside strict. This post is the playbook a senior engineer writes after shipping three production agents on GPT-5.5: paired Python and TypeScript code, real cost math, and a decision rule for when not to use function calling at all.

What function calling actually is (and is not)

Function calling does not execute anything. The model reads your prompt, decides a tool would help, and returns JSON describing which function to invoke and with what arguments. Your code parses that JSON, runs the function, and feeds the result back. The model then uses the result to write a response or request another tool.

If all you want is typed JSON output (a parsed invoice, a structured product list, a normalized address), skip function calling. Use Structured Outputs via response_format. It is cheaper, has no loop, and avoids the tool-result handshake.

Reach for function calling when the model needs to take actions with side effects: querying a database, hitting an external API, writing a file, sending a Slack message. Anything where the next model turn depends on a real-world result.

The minimum viable tool definition

A tool definition has five fields that matter: type, name, description, parameters, and strict. Get any of them wrong and the model will misroute, fabricate arguments, or refuse to call the tool at all.

{
  "type": "function",
  "function": {
    "name": "get_invoice",
    "description": "Fetch a customer invoice by ID. Returns line items, total, and payment status.",
    "parameters": {
      "type": "object",
      "properties": {
        "invoice_id": {
          "type": "string",
          "description": "The invoice ID, e.g. 'inv_01HZAB12CD34'"
        }
      },
      "required": ["invoice_id"],
      "additionalProperties": false
    },
    "strict": true
  }
}

Three rules earn their weight every project:

Always set strict: true. Without it, the model can hallucinate arguments, skip required fields, or return invalid JSON. With it, OpenAI compiles your schema server-side and constrains decoding to match. First request takes 1 to 2 extra seconds while the schema compiles; every later request hits the cache.
Set additionalProperties: false. Strict mode requires it. It also stops the model from inventing helpful extra fields you never asked for.
Mark every property required. For genuinely optional fields, use type: ["string", "null"] and let the model pass null.

Descriptions are the single biggest lever on accuracy. "location" will get you "the user's house" half the time. "City and 2-letter US state, e.g. Seattle, WA" will get you "Seattle, WA" reliably. Adding a concrete example bumps argument accuracy by roughly 30%.

Tool definitions are sent on every request and count as input tokens. Ten well-described tools run about 1,500 tokens per turn before the user even speaks. A 5-turn agent loop is 7,500 tokens of pure tool overhead per session. Keep descriptions tight and tool count under 20.

A working example in Python and TypeScript

Here is the smallest correct loop in Python using Chat Completions, with parallel-tool support, argument validation, and an error-as-data pattern.

import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_invoice",
            "description": "Fetch a customer invoice by ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_id": {"type": "string", "description": "Invoice ID, e.g. 'inv_01HZ...'"}
                },
                "required": ["invoice_id"],
                "additionalProperties": False,
            },
            "strict": True,
        },
    }
]

def get_invoice(invoice_id: str) -> dict:
    # your real DB call here
    return {"id": invoice_id, "total_cents": 12000, "status": "paid"}

TOOL_REGISTRY = {"get_invoice": get_invoice}

def run_agent(user_input: str, max_turns: int = 10):
    messages = [{"role": "user", "content": user_input}]

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-5.5",
            messages=messages,
            tools=tools,
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            return msg.content

        for call in msg.tool_calls:
            name = call.function.name
            try:
                args = json.loads(call.function.arguments)
                result = TOOL_REGISTRY[name](**args)
                content = json.dumps(result)
            except Exception as e:
                content = json.dumps({"error": str(e), "tool": name})

            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": content,
            })

    raise RuntimeError(f"Agent exceeded {max_turns} turns without finishing.")

The TypeScript version follows the same shape. Use the Zod helper for compile-time types on tool inputs.

import OpenAI from "openai";
import { z } from "zod";
import { zodFunction } from "openai/helpers/zod";

const client = new OpenAI();

const GetInvoiceArgs = z.object({
  invoice_id: z.string().describe("Invoice ID, e.g. 'inv_01HZ...'"),
});

const tools = [zodFunction({
  name: "get_invoice",
  description: "Fetch a customer invoice by ID.",
  parameters: GetInvoiceArgs,
})];

const registry = {
  get_invoice: async ({ invoice_id }: { invoice_id: string }) =>
    ({ id: invoice_id, total_cents: 12000, status: "paid" }),
};

export async function runAgent(userInput: string, maxTurns = 10) {
  const messages: any[] = [{ role: "user", content: userInput }];

  for (let turn = 0; turn < maxTurns; turn++) {
    const res = await client.chat.completions.create({ model: "gpt-5.5", messages, tools });
    const msg = res.choices[0].message;
    messages.push(msg);
    if (!msg.tool_calls?.length) return msg.content;

    const results = await Promise.all(msg.tool_calls.map(async (call) => {
      try {
        const args = GetInvoiceArgs.parse(JSON.parse(call.function.arguments));
        const result = await registry[call.function.name as "get_invoice"](args);
        return { id: call.id, content: JSON.stringify(result) };
      } catch (e: any) {
        return { id: call.id, content: JSON.stringify({ error: e.message }) };
      }
    }));

    for (const r of results) messages.push({ role: "tool", tool_call_id: r.id, content: r.content });
  }
  throw new Error(`Agent exceeded ${maxTurns} turns.`);
}

Four production guarantees are baked in: strict-mode tools, JSON-parsing wrapped in try/catch, argument validation via Zod, and a hard turn cap. None are optional.

Parallel tool calls and the agent loop

Recent models (gpt-4.1 onward, all gpt-5.x) can return multiple tool_calls in one assistant message. Ask "What is the weather and time in Tokyo, Paris, and SF?" and the model returns six tool calls at once. Run them concurrently.

In the Python version, swap the sequential loop for asyncio.gather. In TypeScript, Promise.all already does the work. A 4-call sequence that takes 8 seconds serially drops to about 2 seconds in parallel. That is the whole user-experience gap between "snappy" and "broken".

Two failure modes to watch:

Forgetting the model can return zero tool calls. That is the agent saying "I'm done." Treat it as the loop exit.
Running interdependent tools in parallel. If tool_b needs tool_a's result, the model usually serializes across turns, but not always. Set parallel_tool_calls: false for those requests.

A loop with no cap is a billing event waiting to happen. Cap at 8 to 12 turns for production agents. Log every turn with tool names and durations; you will need that data the first time something loops badly. The habit of logging the full message array to a side store so you can replay deterministically is part of what we call the verification habit, covered in what we mean by 'AI-native engineer'.

Error handling: return errors as data

The single biggest mistake in production agents is letting tool errors propagate as exceptions. The model has no idea your database timed out. It sees its tool call vanish and retries the same broken call forever.

The fix is one line of discipline: wrap every tool execution in try/except and return the error as a JSON message.

try:
    result = TOOL_REGISTRY[name](**args)
    content = json.dumps(result)
except Exception as e:
    content = json.dumps({"error": type(e).__name__, "message": str(e)[:500]})

Now the model sees {"error": "ConnectionTimeout", "message": "..."} and can retry, ask the user to clarify, or give up gracefully. This pattern eliminates roughly 80% of "agent stuck in a loop" tickets.

Three more habits worth adopting:

Truncate tool outputs above 4KB. Dumping 50KB of HTML into context burns tokens and degrades focus. Summarize first; keep the raw output in your own store.
Exponential backoff for OpenAI API errors. A 429 or 5xx should retry with 2^attempt seconds of delay, capped around 30s. The official SDKs do this if you set max_retries.
Validate parsed arguments before execution. Pydantic in Python, Zod in TypeScript. Strict mode prevents most violations; defense-in-depth costs nothing.

Token cost: what nobody tells you

The cost story the docs gloss over. Three numbers to memorize:

What	Cost impact
Each tool definition	100 to 200 input tokens per request, every turn
Strict-mode schema compile	One-time 1 to 2 second latency on first use, then cached
Tool result message	Whatever you put in `content`, counted as input next turn

Two tactical wins. First, enable OpenAI's prompt caching when your tool list is stable; first request pays full price, later requests get a 50% discount on cached input tokens. Second, do not pass 20 tools every request. Either split into role-specific sub-agents (4 to 6 tools each) or use tool-search on gpt-5.4 and above, which lets the model query a tool index instead of receiving the full list upfront.

Function calling typically adds 15 to 30% to per-request token cost vs a vanilla chat completion. If you are sizing an AI budget, the cost to integrate OpenAI API into your app post breaks the math down end to end.

Chat Completions vs Responses API vs MCP

Three surfaces, three different times to use them.

Approach	When to use	Pros	Cons
Chat Completions + tools	Simple agents, you control state	Stateless, easy to debug, works on every OpenAI model	You manage the message array manually
Responses API + tools	Multi-turn agents on gpt-5 and above	Reasoning items preserved, server-managed state	Newer surface, fewer third-party libs
Structured Outputs only	Typed JSON, no side effects	Simpler, cheaper, no loop	Cannot trigger actions
MCP server	Same tools across Claude, Cursor, ChatGPT	One implementation, many clients	Extra infra to host the server

The Responses API is OpenAI's bet on stateful agents. It accepts an input array, returns an output array, and preserves the reasoning items that gpt-5 and above generate while thinking. If you are using a reasoning model and do not pass those items back next turn, you lose context and quality drops noticeably. Responses API does this for you; Chat Completions does not.

Use Chat Completions for a stateless, debuggable surface that works on every model from gpt-4o to gpt-5.5. Use Responses API for real multi-step agents on a reasoning model where you want the platform to handle state.

MCP (Model Context Protocol) sits a layer above. It is a standard way to expose tools as a server any client can talk to: Claude Desktop, Cursor, ChatGPT, your own app. If you want the same get_invoice tool from three different LLM clients, write it once as an MCP server instead of three separate function-call definitions.

Anthropic's tool use mirrors OpenAI almost field-for-field. Anthropic returns tool_use blocks instead of tool_calls, you reply with tool_result blocks instead of role:tool messages, and the field is input instead of arguments. The mental model is identical, which is why MCP bridges them cleanly.

How to evaluate whether your team gets this right

If you are a founder or eng lead reviewing function-calling code that someone else wrote, here are the signals that separate a working agent from a ticking time bomb.

Green flags:

strict: true on every tool, additionalProperties: false, every param required
Argument validation with Pydantic or Zod after JSON parsing
All tool execution wrapped in try/except, errors returned to the model as JSON
Hard turn cap with logging on every iteration
Parallel tools executed via asyncio.gather or Promise.all, not in a serial for-loop
Prompt caching enabled when the tool list is stable

Red flags:

Regex parsing of model output to extract function names ("the model said it wanted to call X")
No max-turn guard
Try/except that swallows the error and returns nothing
Tool list larger than 20 with no tool search or sub-agent split
Dumping unstructured HTML or full database rows into tool result content

This is exactly the kind of code review every Cadence engineer is vetted on. The platform's voice interview specifically scores Cursor / Claude / Copilot fluency, prompt-as-spec discipline, and verification habits, and tool-calling competence shows up across all three. Every engineer on Cadence is AI-native by default; there is no opt-in tier, because there is no version of shipping a 2026 backend without these reflexes. We wrote about how the interview works in our voice-interview hiring deep-dive.

What to do this week

Three concrete steps if you have an existing OpenAI integration:

Open your tool definitions and add strict: true, additionalProperties: false, and the full required array. Test that nothing broke. Most code paths will not need changes; the model already obeys your schema, this just makes it guaranteed.
Wrap every tool execution in try/except and return errors to the model as JSON. Add a turn cap if you do not have one. This single change kills the most common production failure mode.
Decide between Chat Completions and the Responses API for your next agent. If you are on gpt-5 or above and doing more than two turns, switch to Responses API and stop managing message state yourself.

If you are starting from scratch and you want a senior engineer to architect this for you, the fastest path is to skip the recruiter loop entirely and book a senior on Cadence for a week. Senior tier is $1,500 for the week, with a 48-hour free trial, and every senior on the platform has shipped production agents on top of either OpenAI or Anthropic. You can also pull a Build, Buy, or Book recommendation from our /tools/decide tool if you want a sanity check before committing engineering time.

Function calling is one of those APIs where the difference between toy demos and shipped product is about 200 lines of error handling and 5 hours of careful schema design. If you want a Cadence senior to do that work with you, book a 48-hour trial and have a working agent in your repo by Friday.

FAQ

When should I use OpenAI function calling vs Structured Outputs?

Use Structured Outputs when you only need typed JSON back from the model with no side effects, like parsing an email into structured fields. Use function calling when the model needs to trigger real actions (database writes, API calls, file I/O) and incorporate the results into its next response. Function calling is a strict superset, but it costs more tokens and adds a loop, so do not reach for it unnecessarily.

Does strict mode work with parallel function calls?

Yes. OpenAI shipped support for strict-mode parallel calls in mid-2025, and every model from gpt-5 onward supports the combination natively. Each tool call in the parallel batch independently adheres to its own schema. Fine-tuned models still have a small caveat where strict mode can disable across simultaneous calls; check the docs for your specific deployment.

How do I stop an OpenAI agent loop from running forever?

Hard-cap the loop at 8 to 12 turns, log every iteration with tool names and timing, and break when the model returns a message with no tool_calls. Always assume the model can return zero, one, or many tool calls per turn, and write the loop accordingly. Add a per-tool timeout (5 to 30 seconds depending on the tool) so a single hung tool cannot freeze the whole agent.

Should I use Chat Completions or the Responses API for tool calling?

Use Chat Completions if you need broad model support and stateless control, especially if you are still on gpt-4o or earlier. Use the Responses API on gpt-5 and above when you want reasoning items preserved across turns and server-managed state. The Responses API is the better default for any new agent you build today; Chat Completions remains the right pick for simple one-shot tool calls on older models.

How does OpenAI function calling compare to Anthropic tool use?

Conceptually identical. You define a tool with a JSON schema, the model returns a tool_use block (Anthropic) or a tool_calls array (OpenAI), and you respond with a tool_result block (Anthropic) or a role: "tool" message (OpenAI). Field names differ (input vs arguments), but the agent-loop pattern is the same. MCP standardizes both so you can write one tool implementation and call it from either provider.

All posts