Everything I Got Wrong About AI Coding Agents

I've been using coding agents for about a year now, like actually using them, not demos or toy projects but production code that people pay for and that pages me at 3am if it breaks. And I got a lot of things wrong. I think it's worth talking about what went wrong because everyone's out here posting their wins, and nobody talks about the six months they spent shipping AI generated code that they had to rip out and rewrite.

Shipping 50% more code doesn't mean 50% more progress

This was the first thing that hit me. I started using coding agents and my output went through the roof, PRs flying and features shipping, and it felt amazing. Then I looked at what was actually happening over a month and about half of those PRs were just cleaning up slop from the previous week. The AI wrote something that looked right, passed basic tests, but had subtle issues that showed up later, so I'd fix those, which created new issues, which I'd fix next week. I was running on a treadmill and calling it progress.

The magic words problem

I was using this Research Plan Implement workflow, which sounds clean, right? Research the codebase, create a plan, then implement it. The planning step had this massive prompt with like 85 instructions and it was supposed to ask me questions, present design options, get my feedback, then write the plan, a proper back and forth.

But half the time it would just skip all that and dump out a 1000 line plan with no questions asked, no alignment, just "here's what I'm going to build." And I figured out that if you literally wrote "work back and forth with me starting with your open questions and outline before writing the plan" it would actually do the interactive thing, but if you didn't say those exact words it would just skip it. I was sitting there telling people "no no you have to say the magic words," and that's when I realized the tool was broken, not the user.

Your LLM has an instruction budget

This is when things clicked for me. There's research showing that frontier LLMs can reliably follow about 150 to 200 instructions, and after that they start silently dropping some. So if your system prompt has 85 instructions and then your claude.md adds another 40 and your MCP tools add another 30 and your skill files add another 50, you're at 200+ instructions and the model is basically rolling dice on which ones to follow. That's why the planning steps were getting skipped; the model wasn't being lazy, it was overloaded.

Instruction Budget

Discrete checkpoint view. Pick a scored prompt size and see how overall pass rate and different instruction types hold up.

Checkpoint: 40 instructions95% mean pass rate

20discrete checkpoints only280

Followed vs skipped at this checkpoint

Followed (38)

Skipped (2)

Which instruction types degrade first

Formatting98%

Hard constraints96%

Style94%

Tool use92%

Instructions

Mean pass

95%

Range

92-98%

Runs

Checkpoint note

Still solid. Minor omissions start showing up in style and tool constraints.

Drag that slider past 150 and watch instructions start getting silently skipped. That's not a bug, that's just how these models work right now.

The context window has a dumb zone

Related to this, everyone talks about context windows getting bigger, 200k tokens now, but bigger doesn't mean better. Around 40% fill you start getting degrading results, and by 60% the model is noticeably worse at following your instructions. And it's not just about the amount of information, it's about how many instructions are competing for attention. Too many MCP tools and the whole context window is full of tool definitions the model doesn't need for this task; too many files loaded and the actual instructions get drowned out. Less context, better results. Counterintuitive but true.

The Dumb Zone

Stepped checkpoint view. This chart uses discrete context checkpoints instead of a hand-drawn quality curve.

Task success

Instruction recall

Retrieval accuracy

Checkpoint: 16k / 200k tokens (8%)clean

8kdiscrete checkpoints only160k

What's filling the context at this checkpoint

System prompt (3.5k)

Tools / MCP (2.5k)

claude.md + skills (1.0k)

Loaded files (6.0k)

Conversation (3.0k)

Task success

94%

Range

89-97%

Zone

clean

Runs

Secondary behavior metrics

Instruction recall96%

Retrieval accuracy95%

Omission rate4%

Contradiction rate2%

Checkpoint note

Still comfortable. You can load a few files without turning the model into soup.

At this checkpoint, omission and contradiction rates matter more than raw context capacity on paper.

Drag that past 64k tokens and watch the quality fall off a cliff. That's not a theoretical problem, that's what's happening every time you load 15 files into context and wonder why the model forgot your instructions.

Don't read the plan. Read the code.

This is the one I'm most embarrassed about getting wrong. For months I was telling people "just review the plan, the plan is the source of truth," and I'd even have teammates PR the plans and code review them together. But a 1000 line plan produces about 1000 lines of code, and the code doesn't always match the plan because the model finds things during implementation, makes different decisions, takes shortcuts. So you review 1000 lines of plan, then you review 1000 lines of code to see what actually changed. That's not leverage, that's double the work.

The new rule is simple: skip the plan review and read the actual code. That's what ships, that's what pages you at 3am. And I know some people will say "but so and so ships 300k lines without reading any of it," and cool, those are open source projects, they're impressive but the stakes are different. If you're shipping production SaaS code in a regulated industry, please read it. We have a profession to uphold. 2026 is the year of no more slop.

Horizontal plans will ruin your week

This is something I noticed after months of debugging AI generated code: models love to build features horizontally. What does that mean? They do ALL the database changes first, then ALL the services, then ALL the API endpoints, then ALL the frontend, and 1200 lines later you run it and something is broken. Which layer has the bug? Could be any of them. Good luck.

Compare that to building vertically, where you make a mock API endpoint, wire it to the frontend, check it works, then you add the services layer, check it works, then you do the database migration and integrate everything, check it works. Same total code, but you have checkpoints and you catch bugs in 300 line increments instead of debugging 1200 lines at once.

Horizontal vs Vertical Plans

Same feature, same total code. Vertical plans give you testable checkpoints. Horizontal plans make you debug everything at once.

Database

Services

API

Frontend

All Database changes

~300 lines

All Services changes

~300 lines

All API changes

~300 lines

All Frontend changes

~300 lines

No checkpoints. 1200 lines later something is broken. Which layer? Good luck.

Total code

1200 lines

Phases

Checkpoints

Toggle between horizontal and vertical and you'll see the same feature, same code, but vertical gives you three checkpoints where you can test and fix before moving on while horizontal gives you zero. I now explicitly tell the agent to build vertically, and it fights me on it sometimes because training data is full of horizontal approaches, but it's worth the fight.

So I split the whole thing up

The original workflow was three stages: Research, Plan, Implement, which meant one giant planning prompt with 85 instructions trying to do everything. I broke it into seven focused stages where each one has under 40 instructions and each one has a clear input and a clear output.

Questions first, where we just generate the right research questions from the ticket without thinking about implementation. Research next with a fresh context window that doesn't know what we're building and just finds facts about how the codebase works, objective, no opinions. Then a design discussion, and this is the big one. Instead of a 1000 line plan the agent brain dumps a 200 line design doc with all the patterns it found, all the options, all the open questions, and I do surgery on it. This is where I catch the model wanting to use the wrong architectural patterns, the wrong approach, the wrong abstractions, and 200 lines is actually reviewable whereas 1000 is not.

Then a structure outline with high level phases instead of exact code, like C header files, enough to see what's coming without reading the implementation. Then the actual plan, which at this point is just a tactical doc for the agent because I've already aligned on everything and I just spot check it. Then implement phase by phase where each phase is testable, and then package into a PR.

RPI → CRISPY

One mega-prompt with 85 instructions split into focused stages, each under 40. Click any stage to see what it does.

15instructions

C - Clarify

→ questions

20instructions

R - Research

→ facts doc

25instructions

I - Investigate

→ design discussion

20instructions

S - Structure

→ outline

35instructions

P - Plan

→ tactical plan

30instructions

Y - Yield

→ code + tests

15instructions

→ pull request

Stages

Max instructions/stage

Key review doc

200 lines

Click between the before and after and look at the max instructions per stage: 85 down to 35. And the key document you review went from 1000 lines to 200.

The leverage is not in the coding

This is probably the most important thing I've learned, and everyone obsesses over "AI writes code faster," which it does, but coding was never the bottleneck. A two day feature breaks down like this: maybe 4 hours of actual coding, 3 hours of alignment meetings, 2 hours of code review, 1.5 hours of rework, 1.5 hours of testing. If AI makes the coding part take 20 minutes instead of 4 hours, you saved 3.5 hours on a 12 hour feature, and that's nice but it's not transformative.

The real leverage is using AI for the alignment phase, the design discussion, the research. When your team reviews a 200 line design doc before you write code, the code review becomes "yep that's what we agreed on" with no surprises and no rework.

Where the time actually goes

Coding speed is only part of the story. Alignment, review, and rework usually dominate the real timeline.

Feature complexityMedium

No AI11.5h

2.0h

5.0h

1.5h

Alignment / design

Coding

Code review

Rework

Testing / QA

AI (just vibe code it)10.3h

1.8h

3.2h

2.8h

1.6h

Prompting

AI coding

Read / review output

Fix hidden issues

Rework later

Testing / QA

AI (structured)4.5h

1.2h

1.1h

AI research

Review design

AI coding

Read the code

Code review

Testing / QA

Raw agent vs no AI

1.1x

looks faster, often messy

Structured AI vs no AI

2.6x

sustainable speedup

Rework saved

5.8h

vs raw agenting

Drag the complexity slider and even on simple features, notice how "just vibe code it" actually takes longer overall because of the rework. The sustainable speedup comes from better alignment, not faster typing.

Do not outsource the thinking

If there's one sentence I'd tattoo on the inside of my eyelids it's this: do not outsource the thinking. The agent should show you everything it's thinking, every pattern it found, every decision it wants to make, every assumption it's making about your architecture, and you correct it BEFORE it writes 2000 lines of code. The design discussion is brain surgery on the agent where you're forcing it to dump out its mental model so you can fix the parts that are wrong before they become 50 files of production code.

Every time I skipped this step because I was in a hurry, I paid for it later. Every. Single. Time.

What I actually do now

I use AI for everything: research, planning, coding, testing, and it probably writes 80% of my code by character count. But I read all of it, I align with it on architecture before it writes anything, I build vertically with checkpoints, I keep my prompts focused and under the instruction budget, I treat the 200 line design doc as the most important artifact in the whole process. And I ship at maybe 2 to 3x the speed I used to, not 10x, but 2 to 3x.

That might sound disappointing compared to the "10x developer" hype, but the difference is my code doesn't get ripped out next month, it passes code review on the first round, it doesn't have hidden security landmines, it's code I'm proud to have my name on. 10x faster doesn't matter if you're throwing it all away in 6 months. Shoot for 2 to 3x, read the code, don't outsource the thinking, and that's actually how you win with AI coding agents.

And this applies beyond just code. The same failure mode of confusing impressive output with actual understanding shows up everywhere in AI right now, from the code these agents write to the way AI labs interpret their own models. Anthropic just published a 243-page system card for Claude Mythos that's as much love letter as safety evaluation, and the pattern is the same: the output looks beautiful, but someone still needs to be in the room asking "wait, is this actually what we think it is?"

Shipping 50% more code doesn't mean 50% more progress

The magic words problem

Your LLM has an instruction budget

Instruction Budget

The context window has a dumb zone

The Dumb Zone

Don't read the plan. Read the code.

Horizontal plans will ruin your week

Horizontal vs Vertical Plans

So I split the whole thing up

RPI → CRISPY

The leverage is not in the coding

Where the time actually goes

Do not outsource the thinking

What I actually do now

Related Posts

Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science

The Claude Code Source Map Story Is Funny Until You Think About What It Means

Flip your order and it works? The system prompt is the worst place for your 'do not' rules

You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.

This Algorithm Made All of AI Possible. It Was Born From a Math Feud.

Honestly, MCP Is Just npm For Agent Tools

Gemma 4 Dropped and Everyone Started Talking About PLE Again. So I Built It.

Save Yourself from 'Something Bigger Is Happening'

The AI Layoff Trap. Why Knowing the Cliff Is Ahead Doesn't Stop the Race.