Everything I Got Wrong About AI Coding Agents
Back to Blog

Everything I Got Wrong About AI Coding Agents

A practical breakdown on what actually goes wrong with AI coding agents in production — from instruction overload and horizontal planning to why humans still need to do the real thinking. Written from experience shipping real systems at scale.

28 Mar 202606 Mins readBy Ashish Kumar Verma
Everything I Got Wrong About AI Coding Agents

I've been using coding agents for about a year now. Like actually using them. Not demos, not toy projects. Production code that people pay for and that pages me at 3am if it breaks.

And I got a lot of things wrong.

I think it's worth talking about what went wrong because everyone's out here posting their wins. Nobody talks about the six months they spent shipping AI generated code that they had to rip out and rewrite.

So here's what I learned.

Shipping 50% more code doesn't mean 50% more progress

This was the first thing that hit me. I started using coding agents and my output went through the roof. PRs flying. Features shipping. It felt amazing.

Then I looked at what was actually happening over a month. About half of those PRs were just cleaning up slop from the previous week. The AI wrote something that looked right, passed basic tests, but had subtle issues that showed up later. So I'd fix those. Which created new issues. Which I'd fix next week.

I was running on a treadmill and calling it progress.

The magic words problem

I was using this Research Plan Implement workflow. Research the codebase, create a plan, then implement it. Sounds clean right?

The planning step had this massive prompt with like 85 instructions. It was supposed to ask me questions, present design options, get my feedback, then write the plan. A proper back and forth.

But half the time it would just skip all that and dump out a 1000 line plan. No questions asked. No alignment. Just "here's what I'm going to build."

And I figured out that if you literally wrote "work back and forth with me starting with your open questions and outline before writing the plan" it would actually do the interactive thing. But if you didn't say those exact words it would just skip it.

I was sitting there telling people "no no you have to say the magic words." That's when I realized the tool was broken, not the user.

Your LLM has an instruction budget

This is when things clicked for me. There's research showing that frontier LLMs can reliably follow about 150 to 200 instructions. After that they start silently dropping some.

So if your system prompt has 85 instructions and then your claude.md adds another 40 and your MCP tools add another 30 and your skill files add another 50... you're at 200+ instructions and the model is basically rolling dice on which ones to follow.

That's why the planning steps were getting skipped. The model wasn't being lazy. It was overloaded.

Instruction Budget

Discrete checkpoint view. Pick a scored prompt size and see how overall pass rate and different instruction types hold up.

Checkpoint: 40 instructions95% mean pass rate
20discrete checkpoints only280
Followed vs skipped at this checkpoint
Followed (38)
Skipped (2)
Which instruction types degrade first
Formatting98%
Hard constraints96%
Style94%
Tool use92%
Instructions
40
Mean pass
95%
Range
92-98%
Runs
52
Checkpoint note
Still solid. Minor omissions start showing up in style and tool constraints.

Drag that slider past 150 and watch instructions start getting silently skipped. That's not a bug. That's just how these models work right now.

The context window has a dumb zone

Related to this. Everyone talks about context windows getting bigger. 200k tokens now. But bigger doesn't mean better.

Around 40% fill you start getting degrading results. By 60% the model is noticeably worse at following your instructions. And it's not just about the amount of information. It's about how many instructions are competing for attention.

Too many MCP tools? The whole context window is full of tool definitions the model doesn't need for this task. Too many files loaded? The actual instructions get drowned out.

Less context, better results. Counterintuitive but true.

The Dumb Zone

Stepped checkpoint view. This chart uses discrete context checkpoints instead of a hand-drawn quality curve.

Task success
Instruction recall
Retrieval accuracy
Checkpoint: 16k / 200k tokens (8%)clean
8kdiscrete checkpoints only160k
What's filling the context at this checkpoint
System prompt (3.5k)
Tools / MCP (2.5k)
claude.md + skills (1.0k)
Loaded files (6.0k)
Conversation (3.0k)
Task success
94%
Range
89-97%
Zone
clean
Runs
46
Secondary behavior metrics
Instruction recall96%
Retrieval accuracy95%
Omission rate4%
Contradiction rate2%
Checkpoint note
Still comfortable. You can load a few files without turning the model into soup.
At this checkpoint, omission and contradiction rates matter more than raw context capacity on paper.

Drag that past 64k tokens. Watch the quality fall off a cliff. That's not a theoretical problem. That's what's happening every time you load 15 files into context and wonder why the model forgot your instructions.

Don't read the plan. Read the code.

This is the one I'm most embarrassed about getting wrong.

For months I was telling people "just review the plan, the plan is the source of truth." I'd even have teammates PR the plans and code review them together.

But here's the thing. A 1000 line plan produces about 1000 lines of code. And the code doesn't always match the plan. The model finds things during implementation, makes different decisions, takes shortcuts.

So you review 1000 lines of plan. Then you review 1000 lines of code to see what actually changed. That's not leverage. That's double the work.

The new rule is simple. Skip the plan review. Read the actual code. That's what ships. That's what pages you at 3am.

And I know some people will say "but so and so ships 300k lines without reading any of it." Cool. Those are open source projects. They're impressive but the stakes are different. If you're shipping production SaaS code in a regulated industry, please read it. We have a profession to uphold. 2026 is the year of no more slop.

Horizontal plans will ruin your week

This is something I noticed after months of debugging AI generated code. Models love to build features horizontally.

What does that mean. They do ALL the database changes first. Then ALL the services. Then ALL the API endpoints. Then ALL the frontend. 1200 lines later you run it and something is broken.

Which layer has the bug? Could be any of them. Good luck.

Compare that to building vertically. You make a mock API endpoint, wire it to the frontend, check it works. Then you add the services layer, check it works. Then you do the database migration and integrate everything, check it works.

Same total code. But you have checkpoints. You catch bugs in 300 line increments instead of debugging 1200 lines at once.

Horizontal vs Vertical Plans

Same feature, same total code. Vertical plans give you testable checkpoints. Horizontal plans make you debug everything at once.

Database
Services
API
Frontend
All Database changes
~300 lines
All Services changes
~300 lines
All API changes
~300 lines
All Frontend changes
~300 lines
No checkpoints. 1200 lines later something is broken. Which layer? Good luck.
Total code
1200 lines
Phases
4
Checkpoints
0

Toggle between horizontal and vertical. Same feature, same code. But vertical gives you three checkpoints where you can test and fix before moving on. Horizontal gives you zero.

I now explicitly tell the agent to build vertically. It fights me on it sometimes because training data is full of horizontal approaches. But it's worth the fight.

So I split the whole thing up

The original workflow was three stages. Research, Plan, Implement. One giant planning prompt with 85 instructions trying to do everything.

I broke it into seven focused stages. Each one has under 40 instructions. Each one has a clear input and a clear output.

Questions first. Just generate the right research questions from the ticket. Don't think about implementation.

Research next. Fresh context window. Doesn't know what we're building. Just finds facts about how the codebase works. Objective. No opinions.

Then a design discussion. This is the big one. Instead of a 1000 line plan the agent brain dumps a 200 line design doc. All the patterns it found, all the options, all the open questions. And I do surgery on it. This is where I catch the model wanting to use the wrong architectural patterns, the wrong approach, the wrong abstractions. 200 lines is actually reviewable. 1000 is not.

Then a structure outline. High level phases, not exact code. Like C header files. Enough to see what's coming without reading the implementation.

Then the actual plan, which at this point is just a tactical doc for the agent. I've already aligned on everything. Just spot check it.

Then implement phase by phase. Each phase is testable.

Then package into a PR.

RPI → CRISPY

One mega-prompt with 85 instructions split into focused stages, each under 40. Click any stage to see what it does.

15instructions
C - Clarify
questions
20instructions
R - Research
facts doc
25instructions
I - Investigate
design discussion
20instructions
S - Structure
outline
35instructions
P - Plan
tactical plan
30instructions
Y - Yield
code + tests
15instructions
PR
pull request
Stages
7
Max instructions/stage
35
Key review doc
200 lines

Click between the before and after. Look at the max instructions per stage. 85 down to 35. And the key document you review went from 1000 lines to 200.

The leverage is not in the coding

This is probably the most important thing I've learned. Everyone obsesses over "AI writes code faster." And it does. But coding was never the bottleneck.

A two day feature breaks down like this. Maybe 4 hours of actual coding. 3 hours of alignment meetings. 2 hours of code review. 1.5 hours of rework. 1.5 hours of testing.

If AI makes the coding part take 20 minutes instead of 4 hours, you saved 3.5 hours on a 12 hour feature. That's nice but it's not transformative.

The real leverage is using AI for the alignment phase. The design discussion. The research. When your team reviews a 200 line design doc before you write code, the code review becomes "yep that's what we agreed on." No surprises. No rework.

Where the time actually goes

Coding speed is only part of the story. Alignment, review, and rework usually dominate the real timeline.

Feature complexityMedium
No AI11.5h
2.0h
5.0h
1.5h
1.5h
1.5h
Alignment / design
Coding
Code review
Rework
Testing / QA
AI (just vibe code it)10.3h
1.8h
3.2h
2.8h
1.6h
Prompting
AI coding
Read / review output
Fix hidden issues
Rework later
Testing / QA
AI (structured)4.5h
1.2h
1.1h
AI research
Review design
AI coding
Read the code
Code review
Testing / QA
Raw agent vs no AI
1.1x
looks faster, often messy
Structured AI vs no AI
2.6x
sustainable speedup
Rework saved
5.8h
vs raw agenting

Drag the complexity slider. Even on simple features, notice how "just vibe code it" actually takes longer overall because of the rework. The sustainable speedup comes from better alignment, not faster typing.

Do not outsource the thinking

If there's one sentence I'd tattoo on the inside of my eyelids it's this. Do not outsource the thinking.

The agent should show you everything it's thinking. Every pattern it found. Every decision it wants to make. Every assumption it's making about your architecture. And you correct it BEFORE it writes 2000 lines of code.

The design discussion is brain surgery on the agent. You're forcing it to dump out its mental model so you can fix the parts that are wrong before they become 50 files of production code.

Every time I skipped this step because I was in a hurry, I paid for it later. Every. Single. Time.

What I actually do now

Here's the honest breakdown.

I use AI for everything. Research, planning, coding, testing. It probably writes 80% of my code by character count.

But I read all of it. I align with it on architecture before it writes anything. I build vertically with checkpoints. I keep my prompts focused and under the instruction budget. I treat the 200 line design doc as the most important artifact in the whole process.

And I ship at maybe 2 to 3x the speed I used to. Not 10x. 2 to 3x.

That might sound disappointing compared to the "10x developer" hype. But the difference is my code doesn't get ripped out next month. It passes code review on the first round. It doesn't have hidden security landmines. It's code I'm proud to have my name on.

10x faster doesn't matter if you're throwing it all away in 6 months.

Shoot for 2 to 3x. Read the code. Don't outsource the thinking. That's actually how you win with AI coding agents.

Related Posts