Building agentic workflows with Claude in production

Building agentic workflows with Claude in production means wiring a tool-use loop around a model, capping its blast radius with guardrails (max iterations, cost ceilings, tool allowlists), inserting human-in-the-loop checkpoints before any irreversible action, and running every change against an eval harness like Langfuse or Braintrust. The model is the easy part. The loop, the brakes, and the evals are where production lives or dies.

Most teams ship a demo in a weekend and spend three months making it not embarrass them on Monday. This post walks through the hardening with Claude 4.7: the patterns that work, the failure modes that bite, and the eval setup you need before any agent touches a paying customer.

What "agentic" actually means in production

An agentic workflow is a loop, not a prompt. The model receives an objective, picks a tool from a defined set, calls it, reads the result, decides whether to call another tool or finish, and repeats until done or stopped. Claude 4.7's tool_use API and the Agent SDK give you this loop out of the box, but the SDK is a starting point, not a production system.

A production agent has six things a demo doesn't:

A bounded loop (max iterations, max wall-clock, max tokens).
A cost cap that kills runs mid-flight if spend crosses a threshold.
A tool allowlist that the agent literally cannot escape.
Human-in-the-loop checkpoints for any action that costs money, mutates production data, or sends external communication.
Structured logging into an eval platform.
A regression test suite that runs on every prompt change.

If your agent is missing any of those, it is a prototype. Ship it to staging only.

The tool-use loop, in detail

The minimal Claude tool-use loop looks like this pseudocode:

messages = [{"role": "user", "content": objective}]
for step in range(MAX_ITERATIONS):
    response = client.messages.create(
        model="claude-4-7-opus",
        tools=ALLOWED_TOOLS,
        messages=messages,
        max_tokens=4096,
    )
    if response.stop_reason == "end_turn":
        return response
    if response.stop_reason == "tool_use":
        tool_result = run_tool(response.content, allowlist=ALLOWED_TOOLS)
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_result})
        if spend_tracker.exceeded(MAX_COST_USD):
            raise CostCapExceeded()

The four production-critical lines are the iteration cap, the cost check, the tool allowlist passed to run_tool, and the explicit handling of stop_reason. Teams that skip the allowlist and let the model "discover" tools at runtime get agents that decide to rm -rf node_modules because a hallucinated tool name looked plausible. Patterns from the AI-assisted refactoring playbook 2026 apply here: scope down before you scale up.

Iteration caps

Set MAX_ITERATIONS between 8 and 25 depending on task. Support agents rarely need more than 5 turns. Code agents (Cursor, Claude Code, Aider) often need 15 to 30. If your agent hits the cap repeatedly the problem is upstream: under-specified objective, wrong toolset, or a stuck verification loop. Don't raise the cap blindly.

Cost caps

Every iteration runs a token counter. We use a tracker that knows Claude 4.7's price ($15 / 1M input, $75 / 1M output as of May 2026) and aborts with a typed exception when projected spend crosses the ceiling. The ceiling is per-run, not per-day. Per-day is for billing alerts; per-run is for safety. Both belong in your stack.

Guardrails that actually matter

Guardrails are the difference between a Claude-powered SaaS and a Claude-powered breach notification. The five we ship by default:

Guardrail	What it does	Failure mode without it
Tool allowlist	Agent can only call pre-approved functions	Hallucinated tool calls hit your filesystem
Max iterations	Loop dies after N turns	Infinite reasoning, $400 in tokens, no output
Cost cap	Run aborts at threshold	One bad prompt drains your monthly budget
Output validator	Structured output is schema-checked before use	Malformed JSON crashes the next service
Action approval	Mutating actions pause for human OK	Agent emails 12,000 customers a 500 error

The output validator is the one most teams skip. Claude 4.7 follows JSON schemas reliably, but "reliably" is not "always." Pipe every structured response through pydantic or zod before the next step consumes it. If validation fails, append the error to the messages array and let the model retry. Roughly 1 in 200 outputs fail on first generation and almost all self-correct in one retry.

Human-in-the-loop checkpoints

The rule we give engineers building on Cadence: any action the agent takes that costs money, sends a message a human will read, or mutates a production database requires explicit approval. Read actions can run autonomously. Write actions cannot.

Implementation is unglamorous. Each tool gets a requires_approval: bool flag. When the agent picks a flagged tool, the loop pauses, the proposed call renders in a UI or Slack message, and the run resumes only after a thumbs-up. We use a pending_approval state in persistence so runs can resume hours later.

This is not "for now, we will automate later." It is the steady state for high-stakes agents. Claude Code asks before running shell commands by default. That is the design, not a bug.

Claude 4.7 strengths and weaknesses

Knowing the model's edges lets you pick where to deploy it. After about 14 months of shipping Claude-powered agents (the 3.5 → 3.7 → 4 → 4.5 → 4.7 progression), the honest read:

Where Claude 4.7 shines:

Long multi-step plans (handles 20+ sequential tool calls without drift)
Code generation and modification (especially with the Agent SDK + Claude Code)
Following complex output schemas (JSON, XML, custom DSLs)
Honest abstention ("I don't know" beats hallucination most of the time)
Tool selection from a mid-sized allowlist (10-30 tools)

Where it still gets you:

Tool allowlists over ~50 functions degrade selection accuracy. Split into sub-agents.
Long-context retrieval still misses needles ~5-8% of the time past 150k tokens.
Multi-modal reasoning on small UI screenshots remains weaker than on documents.
The model will sometimes try to be "helpful" and skip a verification step you explicitly asked for. Wrap critical checks in tool calls, not natural-language instructions.
Cost spikes on agentic loops: a single 30-turn loop with full conversation context can hit $4-$8.

The cost edge is the one most teams hit first. The fix is context pruning between turns: summarize old tool outputs into a one-line note once they are no longer needed for the current step. This is the same pattern as long-running autonomous coding agents in 2026: summarize, snapshot, prune.

Error recovery patterns

Production agents need to recover from three kinds of failure:

Tool failures. The tool returned a 500, or timed out, or returned a malformed payload. The agent should see the error verbatim and decide whether to retry, switch tools, or surface a partial result. Don't swallow tool errors in your wrapper; pass them back into the message thread. Claude 4.7 is good at adapting when it sees the actual stack trace.

Model failures. The API returned a 529 (overloaded) or a 5xx. Wrap every model call in exponential backoff with jitter, 3-5 retries, then fall back to a smaller model (Sonnet, Haiku) or surface the failure to the user. We default to a 250ms / 1s / 4s / 10s schedule.

Logic failures. The agent looped 8 times "preparing" without producing output. This is where the iteration cap saves you. On hit, snapshot the conversation, write a structured "agent-stuck" event to your eval platform, and notify the on-call engineer for the workflow.

The asymmetry to remember: tool failures are common and recoverable. Logic failures are rare and almost never recoverable inside the loop. Different handlers, different alert thresholds.

Eval setup with Langfuse or Braintrust

You cannot ship a Claude agent without evals and call it production. The minimum viable eval stack:

Trace every run to Langfuse, Braintrust, Helicone, or Phoenix. We use Langfuse for OSS-friendly teams and Braintrust for teams that already use a hosted observability stack. Both capture the full message thread, tool calls, token spend, and latency per step.
Build a golden set of 30-100 reference inputs with expected behaviors (not necessarily expected outputs, since agents have many valid paths). For a support agent, golden set entries look like "user asks about refund, agent must call get_order_status, must not call issue_refund without approval, must end with a refund-policy citation."
Run the golden set on every prompt change. This is the CI step. Diffs in pass rate trigger a manual review before merge.
Sample-and-grade production runs. Pull 1-2% of live runs daily into the eval UI. Have a human grade them on 3 axes: did the agent achieve the objective, did it use the right tools, was the final output safe to surface. Trends in those scores are your real signal.
Regression-watch tool-call distributions. If your agent suddenly starts calling send_email 4x more than yesterday, something changed (in your prompt, your tool descriptions, your input mix, or the model's behavior on a silent point update). Alert on distribution drift, not just on errors.

Braintrust's hill-climbing eval workflow (try a variant, score it, keep if better) is a forcing function we recommend for prompt iteration. Langfuse's session-level traces are unbeatable for debugging multi-turn weirdness. Pick one. The wrong one is the one you don't actually use.

Cost monitoring

Per-run cost caps are safety. Per-feature cost dashboards are economics. Tag every agent run with the customer, the feature, and the workflow version, then aggregate. We watch four metrics:

p50 and p95 cost per run (drift means the agent is taking longer paths)
Cost per successful outcome (the real unit economic)
Token efficiency: useful output tokens / total tokens (catches verbose model behavior)
Tool-call efficiency: successful tool calls / total tool calls (catches retry storms)

A reasonable target for a customer-facing agent in 2026 is $0.05-$0.30 per successful interaction. Internal-tool agents (code review, data analysis) tolerate $1-$5 per run because the human-time savings dwarf the API spend. If your numbers are 10x higher, the agent is either over-scoped, looping unnecessarily, or carrying too much context per turn.

What to do this week

If you are at zero and you want to be in production in 30 days:

Pick one narrow workflow with clear success criteria. Not "an assistant," something like "categorize incoming support tickets and propose a draft reply."
Wire the minimum loop: 5 tools max, 10 iterations max, $1 per-run cap, output schema validation, human approval on the draft-reply step.
Stand up Langfuse or Braintrust. Trace 100% of runs.
Build a 30-input golden set in a spreadsheet. Run it daily.
Ship to internal users first. Sample-grade for two weeks before any external traffic.
Add cost dashboards before week 4. Set the per-feature alert at 2x expected p95.

If you do not have an engineer who has shipped one of these before, you will spend 6 weeks rediscovering patterns that already exist. The fastest path is to bring in someone who has done it. Every engineer on Cadence is AI-native by default (vetted on Cursor / Claude / Copilot fluency, prompt-as-spec discipline, and multi-step ladder thinking via a voice interview before they unlock bookings), so a senior at $1,500/week or lead at $2,000/week can land your agent loop, guardrails, and eval scaffolding inside a 2-week trial. If you want to feel the working style before committing, the 48-hour free trial is the fastest way; book a senior, hand them the workflow spec, and see what the first commit looks like.

Pair this post with our companion guide on Claude MCP servers explained when you are ready to expose your tools to multiple agents (Claude Desktop, Cursor, your own runtime) through a single protocol. MCP is the right substrate when your tool surface grows past one app.

For deeper context on the economic side of agents, AI-native engineering ROI: 2026 numbers covers what successful teams actually report and where the math falls apart.

Comparison: agent runtimes in 2026

Runtime	Best for	Pros	Cons
Claude Agent SDK	Greenfield production agents	First-party, tight tool-use loop, MCP support	Newer; smaller community than LangChain
LangGraph	Complex graph-based workflows	Mature, explicit state machine, multi-model	Heavier; more abstraction to learn
CrewAI	Multi-agent orchestration	Good for role-based teams of agents	Higher token spend; harder to debug
Vercel AI SDK	Web app integrations	Streaming, React hooks, edge-friendly	Less suited to long-running server-side loops
OpenAI Assistants	Quick OpenAI-native agents	Hosted threads + retrieval	Vendor lock; Claude users need a separate stack

If you are starting fresh with Claude in May 2026, the Agent SDK plus Langfuse plus your own thin guardrail wrapper is the most boring, most reliable combination. Boring is the goal.

Want this stack standing up by Friday? Book a senior or lead engineer on Cadence and get a guardrailed Claude agent in staging within the 48-hour trial window. Weekly billing, replace any week, no notice period.

FAQ

What is an agentic workflow?

An agentic workflow is an LLM driving a loop: pick a tool, call it, read the result, decide the next step, repeat until done or stopped. The agent has autonomy over which tools to call and in what order, within a bounded set you define.

How is Claude 4.7 different from earlier versions for agents?

Claude 4.7 holds plans across 20+ tool calls without drifting, follows output schemas more reliably than 4.5, and shows clearer "I don't know" behavior instead of hallucinating tool calls. The trade-off is higher per-token cost ($15/$75 per 1M), so context pruning matters more than it did with Sonnet-class models.

Do I need MCP to build production agents?

No. MCP (Model Context Protocol) is useful when you want the same tool surface available to multiple agent runtimes (Claude Desktop, Cursor, your own service). For a single agent in a single app, direct tool definitions in the Claude API are simpler. Reach for MCP when your tool count grows or when more than one client needs access.

What is the biggest production failure mode?

Missing iteration and cost caps. A loop without a hard ceiling will, given enough time, find a way to burn $200 in tokens producing no output. The second-biggest is tool allowlist drift: someone adds a new tool, forgets to mark it requires_approval, and an agent sends 4,000 unintended emails. Treat both as Sev-1 design errors, not nice-to-haves.

How much should a working Claude agent cost per interaction?

For customer-facing agents in 2026, target $0.05 to $0.30 per successful interaction. Internal-tool agents tolerate $1 to $5 because the time savings are worth it. If you are 10x higher, your context is bloated or your loop is retrying too aggressively. Prune between turns and tighten your tool selection.

Harsh Shuddhalwar

Fullstack Developer

Fullstack developer at withRemote. Ships across the stack — TypeScript, Node, Postgres, Vercel. Writes on shipping speed and pragmatic architecture.

All posts