
Building agentic workflows with Claude in production means wiring a tool-use loop around a model, capping its blast radius with guardrails (max iterations, cost ceilings, tool allowlists), inserting human-in-the-loop checkpoints before any irreversible action, and running every change against an eval harness like Langfuse or Braintrust. The model is the easy part. The loop, the brakes, and the evals are where production lives or dies.
Most teams ship a demo in a weekend and spend three months making it not embarrass them on Monday. This post walks through the hardening with Claude 4.7: the patterns that work, the failure modes that bite, and the eval setup you need before any agent touches a paying customer.
An agentic workflow is a loop, not a prompt. The model receives an objective, picks a tool from a defined set, calls it, reads the result, decides whether to call another tool or finish, and repeats until done or stopped. Claude 4.7's tool_use API and the Agent SDK give you this loop out of the box, but the SDK is a starting point, not a production system.
A production agent has six things a demo doesn't:
If your agent is missing any of those, it is a prototype. Ship it to staging only.
The minimal Claude tool-use loop looks like this pseudocode:
messages = [{"role": "user", "content": objective}]
for step in range(MAX_ITERATIONS):
response = client.messages.create(
model="claude-4-7-opus",
tools=ALLOWED_TOOLS,
messages=messages,
max_tokens=4096,
)
if response.stop_reason == "end_turn":
return response
if response.stop_reason == "tool_use":
tool_result = run_tool(response.content, allowlist=ALLOWED_TOOLS)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_result})
if spend_tracker.exceeded(MAX_COST_USD):
raise CostCapExceeded()
The four production-critical lines are the iteration cap, the cost check, the tool allowlist passed to run_tool, and the explicit handling of stop_reason. Teams that skip the allowlist and let the model "discover" tools at runtime get agents that decide to rm -rf node_modules because a hallucinated tool name looked plausible. Patterns from the AI-assisted refactoring playbook 2026 apply here: scope down before you scale up.
Set MAX_ITERATIONS between 8 and 25 depending on task. Support agents rarely need more than 5 turns. Code agents (Cursor, Claude Code, Aider) often need 15 to 30. If your agent hits the cap repeatedly the problem is upstream: under-specified objective, wrong toolset, or a stuck verification loop. Don't raise the cap blindly.
Every iteration runs a token counter. We use a tracker that knows Claude 4.7's price ($15 / 1M input, $75 / 1M output as of May 2026) and aborts with a typed exception when projected spend crosses the ceiling. The ceiling is per-run, not per-day. Per-day is for billing alerts; per-run is for safety. Both belong in your stack.
Guardrails are the difference between a Claude-powered SaaS and a Claude-powered breach notification. The five we ship by default:
| Guardrail | What it does | Failure mode without it |
|---|---|---|
| Tool allowlist | Agent can only call pre-approved functions | Hallucinated tool calls hit your filesystem |
| Max iterations | Loop dies after N turns | Infinite reasoning, $400 in tokens, no output |
| Cost cap | Run aborts at threshold | One bad prompt drains your monthly budget |
| Output validator | Structured output is schema-checked before use | Malformed JSON crashes the next service |
| Action approval | Mutating actions pause for human OK | Agent emails 12,000 customers a 500 error |
The output validator is the one most teams skip. Claude 4.7 follows JSON schemas reliably, but "reliably" is not "always." Pipe every structured response through pydantic or zod before the next step consumes it. If validation fails, append the error to the messages array and let the model retry. Roughly 1 in 200 outputs fail on first generation and almost all self-correct in one retry.
The rule we give engineers building on Cadence: any action the agent takes that costs money, sends a message a human will read, or mutates a production database requires explicit approval. Read actions can run autonomously. Write actions cannot.
Implementation is unglamorous. Each tool gets a requires_approval: bool flag. When the agent picks a flagged tool, the loop pauses, the proposed call renders in a UI or Slack message, and the run resumes only after a thumbs-up. We use a pending_approval state in persistence so runs can resume hours later.
This is not "for now, we will automate later." It is the steady state for high-stakes agents. Claude Code asks before running shell commands by default. That is the design, not a bug.
Knowing the model's edges lets you pick where to deploy it. After about 14 months of shipping Claude-powered agents (the 3.5 → 3.7 → 4 → 4.5 → 4.7 progression), the honest read:
Where Claude 4.7 shines:
Where it still gets you:
The cost edge is the one most teams hit first. The fix is context pruning between turns: summarize old tool outputs into a one-line note once they are no longer needed for the current step. This is the same pattern as long-running autonomous coding agents in 2026: summarize, snapshot, prune.
Production agents need to recover from three kinds of failure:
Tool failures. The tool returned a 500, or timed out, or returned a malformed payload. The agent should see the error verbatim and decide whether to retry, switch tools, or surface a partial result. Don't swallow tool errors in your wrapper; pass them back into the message thread. Claude 4.7 is good at adapting when it sees the actual stack trace.
Model failures. The API returned a 529 (overloaded) or a 5xx. Wrap every model call in exponential backoff with jitter, 3-5 retries, then fall back to a smaller model (Sonnet, Haiku) or surface the failure to the user. We default to a 250ms / 1s / 4s / 10s schedule.
Logic failures. The agent looped 8 times "preparing" without producing output. This is where the iteration cap saves you. On hit, snapshot the conversation, write a structured "agent-stuck" event to your eval platform, and notify the on-call engineer for the workflow.
The asymmetry to remember: tool failures are common and recoverable. Logic failures are rare and almost never recoverable inside the loop. Different handlers, different alert thresholds.
You cannot ship a Claude agent without evals and call it production. The minimum viable eval stack:
Trace every run to Langfuse, Braintrust, Helicone, or Phoenix. We use Langfuse for OSS-friendly teams and Braintrust for teams that already use a hosted observability stack. Both capture the full message thread, tool calls, token spend, and latency per step.
Build a golden set of 30-100 reference inputs with expected behaviors (not necessarily expected outputs, since agents have many valid paths). For a support agent, golden set entries look like "user asks about refund, agent must call get_order_status, must not call issue_refund without approval, must end with a refund-policy citation."
Run the golden set on every prompt change. This is the CI step. Diffs in pass rate trigger a manual review before merge.
Sample-and-grade production runs. Pull 1-2% of live runs daily into the eval UI. Have a human grade them on 3 axes: did the agent achieve the objective, did it use the right tools, was the final output safe to surface. Trends in those scores are your real signal.
Regression-watch tool-call distributions. If your agent suddenly starts calling send_email 4x more than yesterday, something changed (in your prompt, your tool descriptions, your input mix, or the model's behavior on a silent point update). Alert on distribution drift, not just on errors.
Braintrust's hill-climbing eval workflow (try a variant, score it, keep if better) is a forcing function we recommend for prompt iteration. Langfuse's session-level traces are unbeatable for debugging multi-turn weirdness. Pick one. The wrong one is the one you don't actually use.
Per-run cost caps are safety. Per-feature cost dashboards are economics. Tag every agent run with the customer, the feature, and the workflow version, then aggregate. We watch four metrics:
A reasonable target for a customer-facing agent in 2026 is $0.05-$0.30 per successful interaction. Internal-tool agents (code review, data analysis) tolerate $1-$5 per run because the human-time savings dwarf the API spend. If your numbers are 10x higher, the agent is either over-scoped, looping unnecessarily, or carrying too much context per turn.
If you are at zero and you want to be in production in 30 days:
If you do not have an engineer who has shipped one of these before, you will spend 6 weeks rediscovering patterns that already exist. The fastest path is to bring in someone who has done it. Every engineer on Cadence is AI-native by default (vetted on Cursor / Claude / Copilot fluency, prompt-as-spec discipline, and multi-step ladder thinking via a voice interview before they unlock bookings), so a senior at $1,500/week or lead at $2,000/week can land your agent loop, guardrails, and eval scaffolding inside a 2-week trial. If you want to feel the working style before committing, the 48-hour free trial is the fastest way; book a senior, hand them the workflow spec, and see what the first commit looks like.
Pair this post with our companion guide on Claude MCP servers explained when you are ready to expose your tools to multiple agents (Claude Desktop, Cursor, your own runtime) through a single protocol. MCP is the right substrate when your tool surface grows past one app.
For deeper context on the economic side of agents, AI-native engineering ROI: 2026 numbers covers what successful teams actually report and where the math falls apart.
| Runtime | Best for | Pros | Cons |
|---|---|---|---|
| Claude Agent SDK | Greenfield production agents | First-party, tight tool-use loop, MCP support | Newer; smaller community than LangChain |
| LangGraph | Complex graph-based workflows | Mature, explicit state machine, multi-model | Heavier; more abstraction to learn |
| CrewAI | Multi-agent orchestration | Good for role-based teams of agents | Higher token spend; harder to debug |
| Vercel AI SDK | Web app integrations | Streaming, React hooks, edge-friendly | Less suited to long-running server-side loops |
| OpenAI Assistants | Quick OpenAI-native agents | Hosted threads + retrieval | Vendor lock; Claude users need a separate stack |
If you are starting fresh with Claude in May 2026, the Agent SDK plus Langfuse plus your own thin guardrail wrapper is the most boring, most reliable combination. Boring is the goal.
Want this stack standing up by Friday? Book a senior or lead engineer on Cadence and get a guardrailed Claude agent in staging within the 48-hour trial window. Weekly billing, replace any week, no notice period.
An agentic workflow is an LLM driving a loop: pick a tool, call it, read the result, decide the next step, repeat until done or stopped. The agent has autonomy over which tools to call and in what order, within a bounded set you define.
Claude 4.7 holds plans across 20+ tool calls without drifting, follows output schemas more reliably than 4.5, and shows clearer "I don't know" behavior instead of hallucinating tool calls. The trade-off is higher per-token cost ($15/$75 per 1M), so context pruning matters more than it did with Sonnet-class models.
No. MCP (Model Context Protocol) is useful when you want the same tool surface available to multiple agent runtimes (Claude Desktop, Cursor, your own service). For a single agent in a single app, direct tool definitions in the Claude API are simpler. Reach for MCP when your tool count grows or when more than one client needs access.
Missing iteration and cost caps. A loop without a hard ceiling will, given enough time, find a way to burn $200 in tokens producing no output. The second-biggest is tool allowlist drift: someone adds a new tool, forgets to mark it requires_approval, and an agent sends 4,000 unintended emails. Treat both as Sev-1 design errors, not nice-to-haves.
For customer-facing agents in 2026, target $0.05 to $0.30 per successful interaction. Internal-tool agents tolerate $1 to $5 because the time savings are worth it. If you are 10x higher, your context is bloated or your loop is retrying too aggressively. Prune between turns and tighten your tool selection.
Fullstack developer at withRemote. Ships across the stack — TypeScript, Node, Postgres, Vercel. Writes on shipping speed and pragmatic architecture.