I am a...
Learn more
How it worksPricingFAQ
Account
May 14, 2026 · 11 min read · Cadence Editorial

When NOT to use AI to write code

when not to use ai for code — When NOT to use AI to write code
Photo by [Christina Morillo](https://www.pexels.com/@divinetechygirl) on [Pexels](https://www.pexels.com/photo/man-standing-infront-of-white-board-1181345/)

When NOT to use AI to write code

Don't use AI to write code when the answer isn't already in its training data: novel architecture, security-critical paths (cryptography, auth, signature verification, prompt injection design), heavy concurrency, deep Heisenbug debugging, regulated code requiring auditable provenance, and performance-critical systems work. AI is autocomplete, not architect. Senior engineers know when to put it down.

The rest of this post explains the six categories, with the failure modes named, the data behind them, and what to do instead.

AI as autocomplete versus AI as architect

Large language models work by averaging their training data. They are extraordinary at extrapolating inside the distribution and quietly dangerous outside it. That single property explains every category below.

When you use AI as autocomplete, you are asking it to finish a thought you already had: a CRUD endpoint, a Postgres migration, a typed React form. The model has seen ten thousand of those. The risk is small, the speedup is large.

When you use AI as architect, you are asking it to make a decision that compounds across the codebase. It will pattern-match your weird idea to the closest familiar one and quietly normalize it. You will not notice until production.

Veracode's 2025 GenAI Code Security Report tested over 100 LLMs across 80 coding tasks and found that 45% of AI-generated code contains design flaws or known vulnerabilities. That number does not mean AI is useless. It means the cost of misusing it is bigger than most teams admit. Where you draw the line, in code categories, is the senior engineer's call. Here is the working list.

Code categoryAI assist?WhyWhat to do instead
CRUD, glue, well-documented integrationsYesInside training distributionLet AI draft, you review the diff
Refactors with green testsYesTests are the safety netUse Cursor's multi-file refactor
Documentation from real artifactsYesMechanical translationPoint Claude at routes + OpenAPI
Novel architectureNoAI averages, never inventsWhiteboard with a senior engineer
Cryptography and authNo14% to 86% AI failure ratesUse vetted libraries; human owns the design
Concurrency, race conditionsCautionBugs hide between paths AI never sees togetherModel with TLA+ or formal review
Deep Heisenbug debuggingNoAI suggests fixes that mask root causeHypothesis, instrument, prove, then patch
Regulated code (HIPAA, SOC 2, PCI)CautionAuditors want provenanceKeep human reasoning trail in commits
Kernel and performance-criticalNoAI optimizes for readability, not cacheProfile, micro-bench, hand-tune

1. Novel architectural decisions

Greenfield architecture is the place AI fails most quietly. You ask Claude or Cursor to scaffold an event-sourced billing system or a custom consensus protocol, and you get back something that looks idiomatic. It is idiomatic because the model is averaging public examples, almost none of which match your constraints.

The symptom is always the same. The code reads cleanly, follows the right naming conventions, has type hints. Three weeks later, you discover the model picked an event store with no support for replaying past a schema version, because that is what 80% of GitHub examples use. You will spend a quarter ripping it out.

Real heuristic: if you are about to write a system whose closest reference implementation is a conference talk rather than a library, the AI does not have the priors to help you. It will help you write the boilerplate inside a design you already chose. It will not choose the design. For an honest treatment of where AI does belong in production AI work, see our notes on using Claude Sonnet 4.6 for production work; the same discipline applies in reverse to architecture.

2. Security-critical code (crypto, auth, signature verification, prompt injection)

This is the category where the failure rates are easiest to quote. Veracode's same study, broken out by CWE:

  • 14% failure on cryptographic implementations (CWE-327)
  • 20% failure on SQL injection (CWE-89)
  • 86% failure on cross-site scripting (CWE-80)
  • 88% failure on log injection (CWE-117)
  • Java code passes only 29% of the time; Python 62%, JavaScript 57%

A 14% crypto failure rate sounds small until you realize you only need it to fail once. The typical failures: MD5 or SHA1 instead of SHA-256, hardcoded encryption keys, predictable random number generators (random instead of secrets), missing constant-time comparisons in HMAC checks, AES-ECB instead of AES-GCM.

Auth is no better. AI-generated auth flows routinely ship missing CSRF on state-changing routes, accept tokens past their expiry because the comparison is wrong, or skip the audience claim. Signature verification is the worst: the model knows the words "verify" and "signature" but does not know that you must reject on missing fields, not just on bad ones. Length-extension attacks against naive HMAC come back into your codebase every time someone asks ChatGPT to "verify a webhook."

Prompt injection sits in the same bucket because mitigations are designed, not generated. The model cannot reliably defend itself, by definition, and it cannot reliably tell you which of its outputs are tainted. For practical defense patterns, see our writeup on handling hallucinations in production LLM apps, which covers the same trust-boundary mindset.

What to do instead: pick a vetted library (libsodium, Auth.js, Stripe's webhook helper), let AI draft the call site, but the threat model and the verification logic stay human-owned.

3. Heavy concurrency and race conditions

Concurrency bugs are uniquely hostile to LLM assistance because they live in the gaps between code paths the AI never reads at the same time. You ask Cursor to add a rate limiter; it gives you a check_count(); increment_count(); pattern that passes every test you can write, and at p99 production traffic, double-charges 0.2% of customers.

The model does not see the second goroutine. It does not know your Redis client is connection-pooled and that the read-modify-write needs a Lua script or a WATCH/MULTI/EXEC. It does not know your Postgres transaction needs SELECT FOR UPDATE because the row is hot. The default it generates is the one that works on a single thread on a developer laptop.

Senior engineers learn to spot the smell. Every time the AI writes a check followed by an act, you stop and ask: what happens if a second process runs between those two lines? Most of the time, the AI answer is wrong. The fix is a lock, an atomic operation, or a transactional boundary, and the AI will not reach for any of them unless you spell it out.

If you must use AI here, narrow the prompt brutally. "Write a Postgres advisory-lock-protected counter increment using pg_try_advisory_xact_lock keyed on user_id" is a prompt that works. "Add rate limiting" is a prompt that ships bugs. The same precision discipline shows up in our guide to prompt engineering for senior software engineers.

4. Deep debugging of weird Heisenbugs

A Heisenbug is the kind that disappears when you look at it: only happens under load, only in one region, only at midnight UTC. AI's instinct, asked to fix one, is to suggest a plausible patch around the symptom. A try/except around the failing call. A retry with backoff. A time.sleep(0.1) between two operations that should not need it. The bug goes quiet for six weeks. Then it comes back, worse, in a place no one remembers touching.

Senior debugging is hypothesis-driven. You form a theory. You instrument. You prove the theory before you patch. The AI shortcuts straight to the patch because that is what its training data rewards. Stack Overflow answers are mostly fixes. There is almost no public corpus of "I formed a hypothesis, ran tcpdump, found that our load balancer was dropping the SYN-ACK on long-lived connections, and rolled the kernel back."

The right use of AI here is at the evidence layer, not the fix layer. Ask Claude to summarize a 50MB log dump. Ask it to suggest five hypotheses ranked by likelihood given the symptoms. Then put it down and go run the experiments yourself. Our AI-powered debugging post covers the same split: AI for breadth, human for depth.

5. Regulated code requiring auditable provenance

HIPAA, SOC 2, PCI-DSS, and FedRAMP auditors are starting to ask a new question: who wrote this, and why? An AI-generated diff with no human reasoning trail is becoming a compliance liability, not just a code smell.

The EU AI Act, which entered staged enforcement in 2025 and 2026, requires provenance disclosure for safety-critical software under Annex IV. Several US states have followed with sector-specific rules in healthcare and finance. The practical implication: if the auditor asks why a particular access-control check exists and the answer is "Cursor wrote it," you have a problem.

The fix is process, not abstinence. Keep a clean reasoning trail in commit messages and PR descriptions. If AI drafted the code, the human PR description still has to explain why this approach, what alternatives were rejected, and how the test plan covers the requirement. Some teams now require a Co-Authored-By: AI trailer specifically so the audit log distinguishes drafted-by-AI from authored-by-AI commits.

6. Performance-critical kernel and systems code

AI optimizes for surface readability because that is what its training data rewards. It will write a clean Python list comprehension where you needed a NumPy vector op. It will reach for a HashMap in a hot loop where you needed a flat array and a linear scan. It will pick a heap allocation where you needed a stack arena.

A real anecdote from a team running an in-memory recommendation index: a Cursor-suggested switch from a sorted-vector lookup to a HashMap<u64, Vec<u32>> looked like an obvious win on the diff. In production it caused a 30x slowdown on a hot path because the cache miss rate quintupled and the allocator started thrashing. The fix took two days; the rollback took ten minutes.

Kernel modules, lock-free data structures, custom allocators, SIMD intrinsics, JIT codegen, and any code where you care about branch prediction or page faults: AI confidently writes wrong code in all of these. Use it for documentation and test scaffolding around the hand-written core. Never let it write the core.

Where AI absolutely does pay off

Lest this read as a hit piece, the inverse list is real and large:

  • Boilerplate and CRUD: REST endpoints, typed forms, migrations, integration with well-documented APIs (Stripe, Supabase, Vercel). AI nails these.
  • Refactors with green tests: Cursor's multi-file refactor is the killer feature for any 100k-line codebase. Tests are the safety net.
  • Documentation generation: point Claude at your route handlers and OpenAPI spec and you get usable docs in minutes. Our Claude documentation generation writeup covers the workflow.
  • Test scaffolding: AI writes the cases, you write the assertions.
  • Tool selection inside a known stack: "Drizzle or Prisma for Next.js + Postgres?" Useful. "Event sourcing for billing?" Bad question.

Rough heuristic: if a senior engineer would have written the code in 30 minutes, AI is a 3x speedup. If they would have spent two days thinking first, AI will save no time and may cost weeks.

How Cadence vets the instinct to put the AI down

Every engineer on Cadence is AI-native by default. The voice interview tests Cursor, Claude Code, and Copilot fluency, prompt-as-spec discipline, and verification habits before an engineer unlocks bookings. There is no non-AI-native option on the platform.

The interview also tests the inverse skill: when does the candidate put the AI down? We ask each engineer to walk through the last bug AI could not help with. Strong candidates can name the exact moment they closed Cursor and reached for tcpdump, perf, a debugger, or a whiteboard. Weak candidates cannot; they keep prompting.

If you are choosing the right tier for the work:

  • Junior, $500/week: cleanup, dependency hygiene, doc-writing, AI-assisted integrations with good docs. Stay inside the autocomplete zone.
  • Mid, $1,000/week: standard features, end-to-end shipping, refactors, test coverage. AI is a daily tool; the engineer reviews every diff.
  • Senior, $1,500/week: owns scope, calls when to put the AI down, designs the auth flow themselves, picks the concurrency primitive, debugs the Heisenbug.
  • Lead, $2,000/week: architectural decisions, complex systems design, fractional CTO. Almost everything in this post lives at this tier.

What to do next

Pick one category from the list above where your team is currently using AI past the point of safety. The most common offender is "cryptography or auth flow that someone shipped after a 20-minute Cursor session." Audit it this week. Replace the hand-rolled crypto with a vetted library. Add an integration test that proves the threat model.

If you are between hires and need a senior or lead-tier engineer to own one of those categories without spending six weeks recruiting, book a 48-hour free trial on Cadence; you ship code with the engineer for two days, decide if they are right, then move to weekly billing or replace them with no notice.

Get a Build/Buy/Book recommendation in two minutes. Our Decide tool takes a one-line description of the work and tells you whether to build it in-house, buy a SaaS, or book an engineer. It also flags the categories from this post where AI assistance is a liability rather than a speedup.

FAQ

Is AI-generated code less secure than human-written code?

Yes. Veracode's 2025 GenAI Code Security Report found that AI-generated code contains roughly 2.74x more vulnerabilities than human-written code, with cryptography failing 14% of the time, SQL injection 20%, and XSS 86%.

Can AI debug race conditions reliably?

No. Race conditions hide between code paths the AI never sees together, so the model generates plausible code that passes tests on a developer laptop and breaks at p99 in production. Use AI for hypothesis generation; use a human for the fix.

Should I let AI write my auth flow?

Only the boilerplate. Token issuance, session handling, signature verification, CSRF protection, and rate limiting need a human who can reason about the threat model. Pick a vetted library (Auth.js, libsodium, Stripe's webhook helper) and let AI draft the call site, not the verification logic.

Does using AI invalidate SOC 2 or HIPAA compliance?

Not automatically. But auditors increasingly ask for provenance. Keep a clean human reasoning trail in commit messages and PR descriptions for anything that touches PHI, PII, or payment data, and consider a Co-Authored-By: AI trailer to distinguish drafted-by-AI from authored-by-AI commits.

How do I know an engineer has the instinct to put the AI down?

Ask them to walk through the last bug AI could not help with. Strong AI-native engineers can name the exact moment they closed Cursor and reached for tcpdump, a debugger, or a whiteboard. Weak ones keep prompting.

All posts