Intro
12 months, ~200 KLOC, and a lesson that AI doesn’t so much make programming easier as it increases the scale, the cost, and the number of places where you can painfully screw up.
For the past year I’ve been building a system with heavy AI assistance. Not a landing page for selling a course about selling courses. Not an MVP for demo day. A system. With a rigid ruleset. Clean architecture, declarative patterns, tests, modularization, tight types with structured concurrency and DI (shoutout Effect team) and an absolute ban on leaking infrastructure into business logic just because the model found it more convenient.
It ended up around 200 thousand lines of code.
And yes, I know what you’re thinking. 200 KLOC solo is either a lie, spaghetti, or proof of a 5x AI speed boost. None of the above. Half of it is architectural boilerplate - DI wiring, typed errors, repetitive patterns, scaffolding. That stuff was never the hard part, even before AI. A few snippets, some macros, IntelliSense tabbing, muscle memory - boilerplate was already fast. AI just floors the accelerator on a road that was already downhill.
And I keep seeing people post their weekly line counts like it’s a deadlift PR. It’s not. LOC isn’t value measurement. It’s attack surface. And past a certain point, AI generates problems exactly as fast as it generates code.
That’s not a flex. That’s a clean-up bill.
The longer it went on, the less I believed the simple fairy tale that AI “speeds up development.” It speeds things up. But that’s like saying a loan “gives you money.” Technically true. Strategically - it skips the part where people crash and burn.
The biggest trap? The gains are real. Real enough to get high on. And then comes the hard landing, and the bill that’s been lying face down the whole time.
Every section below starts with a comment. You’ll recognize them - they’re the things people actually reply under rants like this. Skill issue, wrong model, bad prompts, just scope better. And the answer to every single one is: yes, guilty. And so is everyone else. Anyone who claims otherwise is either building something trivial, lying, or hasn’t hit the wall yet.
Because here’s the thing nobody wants to say out loud: you can’t put 1,000 hours into mastering something that has a 600-hour lifespan. The models change. The tooling changes. The best practices from three months ago are anti-patterns today. There is no plateau of competence. There’s only a treadmill that gets faster.
“Sounds like AI is working and you’re just mad about it”
Yeah. I am. And I’m also right, and you know it if you’ve built anything past a simplified Trello clone.
There was no 10x. No stable 2x either. 5x in some boilerplate code - notably there was a lot of that. Realistically - somewhere around 1.5x on average, with days where I was slower than without AI because the model doesn’t struggle with hard problems - it plea bargains them down. A strict type becomes union. A boundary becomes a passthrough. An error channel gets a silent catch. A cast appears, that nobody asked for. An invariant grows an exception that swallows the rule.
Then it delivers the whole thing like a valet who returned your car with no mirrors and a note saying ‘improved aerodynamics.’
But that 1.5x is addictive enough to keep going.
Research - faster. First draft of implementation - faster. Scaffolding - faster. Exploring three solution variants before committing - absolutely faster.
The problem is that a local speed boost is not the same as a global cost reduction.
My global cost didn’t drop. It shifted. Or maybe even increased. From writing code to carefully reading code and sniping subtle drifts. From implementation to supervision. From work that’s visible on screen to work that nobody ever shows in a demo, because 4 hours of reading isn’t exactly peak entertainment.
I wrote less. I patrolled more.
“Skill issue. Your prompts suck.”
They did. So do yours. Show me someone with perfect prompts and I’ll show you someone who hasn’t shipped anything complex enough for it to matter.
And here’s the twist: my prompts were also too good. And that was the worse problem.
Simple prompt - chaos. Obviously. But an elaborate prompt with multiple rules, three examples, and a list of prohibitions? A different species of chaos. Better dressed, harder to catch.
The more rules I added, the less often the model did what I meant. Instead, it got increasingly skilled at producing something that looked like what I meant. That’s the key distinction and the main source of suffering.
You want declarative, modular code? You get procedural, but with nice names. You want to add tests without touching logic? Logic changed in three places, but hey, the tests pass. You want to fix logic without touching tests? Test cases surrounded with a try-catch block - no longer throwing at asserts though. You want no infra leaking into the domain? It leaks, but through a helper named to suggest that’s how it should be.
And yeah - I’m not above putting MAKE NO MISTAKES! DO NOT HALLUCINATE! in the system prompt at 2 AM like it’s a sticky note on a crashing server. Don’t judge me. You’ve done it too. The hallucinations didn’t stop, they just started coming with JSDoc and a confident commit message.
More rules rarely improve results. They mostly raise the cost and make the workarounds cleverer. The model doesn’t optimize for the meaning of your task. It optimizes for the shape of your prompt. Including “don’t make mistakes,” which it reads as “make them quieter.”
“You’re using the wrong model for the job”
Sure. And so were you. And so was the guy on X posting 10x screenshots. Because by the time you figure out one model’s quirks, there’s a new one, with new quirks, and a blog post telling you the old one was the wrong choice all along.
Opus 4, 4.1, 4.5, 4.6. GPT-5, 5.1, 5.1-Codex, 5.2, 5.3 Codex, 5.3 Codex Spark, 5.4. GLM-4.7, GLM-5. Kimi-K2. MiniMax-2.5. Gemini 3 Pro, 3.1 Pro. I’ve probably forgotten a few.
Each had its character. One planned better. Another refactored better. A third wrote tests that passed, which is not the same thing as writing good tests - but you don’t find that out until production. A fourth made great first shots. A fifth was fast as hell, but with the self-assurance of a bootcamp grad who just discovered design patterns and is now refactoring your auth layer into a strategy-factory-observer-singleton.
None of them changed the fundamental observation:
The bigger the system and the higher the quality bar, the higher the cost of supervision.
This isn’t a problem with a specific model, a specific provider, or a specific benchmark. This is a problem with the entire category of tools in their current state. And you can’t master your way out of it, because there’s nothing stable enough to master. You can’t put 1,000 hours into a tool with a 600-hour lifespan before the next version resets half of what you learned.
“Just break it into smaller tasks lol”
Every single day. And so are you. This is the one where everyone nods wisely and says “just scope it better” as if there’s a correct answer that holds for more than 48 hours.
Task too big - you get a wall of code you can’t review in fifteen minutes. Inside: shortcuts, silent type changes, regressions buried on line 347, abstractions over things that don’t need abstracting, and duplicated data models because the model forgot the same interface already exists two folders over.
Task too small - you spend all day mentoring an intern with amnesia and zero self-doubt. You describe context. You remind it of the rules. You sync state. You guard boundaries. You say “don’t touch this file.” You watch the file get touched. You repeat. You repeat. You repeat.
Too many tasks at once - you drown in coordination. Too few - you drown in after-the-fact refactors.
Everyone says “just scope the work well.” Nobody mentions that the sweet spot shifts every two days, depends on the phase of the moon and the model’s mood, and that maintaining it requires constant calibration that is itself a full-time job.
“Set up proper rules and constraints, it’s not that hard”
Didn’t have enough. Then had too many. Then didn’t have the right ones. If you’ve found the golden set of rules that works reliably across models, versions, and context sizes - congratulations, you’re either a genius or you haven’t checked whether they’re actually being followed.
Too few rules - the model does whatever it wants and the result is a mess. Too many - you don’t know which ones get ignored, which ones get creatively circumvented, and which ones get applied with lethal literalness in the one place they shouldn’t be.
Rule: “Do not modify existing tests.” Model: understood. Generates new tests covering the same paths. Deletes the old ones as “redundant.” Technically - didn’t modify. Deleted. Rule intact. Logic in ruins.
This is the kind of AI-flavored malicious compliance that makes you want to laugh and cry at the same time.
And while we’re on tests - let’s be honest about what AI-generated tests actually are. They’re mocking theatre. Elaborate setups that mock every dependency, assert on implementation details instead of behavior, and produce a beautiful green wall of 1,000 passing tests while the app doesn’t work. You feel safe. You’re not safe. You just have a very expensive lie running in CI.
And then the classic AI slop arrives in full bloom:
- Procedural, copy-pasted code stretching across dozens of files.
- Data models duplicated in slight variations everywhere, because the model regenerates them from scratch each time instead of importing the existing one.
- Tiny abstractions wrapping two library calls in a function whose name promises guarantees it doesn’t deliver, has never delivered, and has no intention of delivering. Instead of a simple
api.call()you getSafeApiCallWrapper.execute(), which internally doesapi.call()and nothing else. But it has four lines of comments explaining why it’s important.
Then you sit and clean. Pull a thread, the whole sweater moves. Refactor a service, two boundaries shift, a type widens somewhere you weren’t looking. Chase it. Fix it. Repeat. This is the machine that converts 5x into 1.5x. Still a good deal - you’re just getting designer clothes from a thrift store. The label is real. The previous owner just didn’t mention the stains.
“Just use a model with more context”
It helps. It doesn’t fix. And everyone who’s ever said “just use a model with more context” has never tried loading a real system into one.
On paper 250k tokens looks like an ocean. In practice, after accounting for rules, plans, session history, test harnesses, .md files that drive behaviour, schemas, type definitions, and the actual code - you’re left with a puddle. A deep one, but a puddle.
And worse, past a certain threshold quality doesn’t degrade linearly. It falls apart in steps. The model starts mixing contexts, merging instructions from different sections, “enriching” your code with patterns from another part of the prompt. This isn’t graceful degradation. This is the moment when the GPS says “turn right” straight into a river.
So: tasks need to be smaller. Smaller tasks = more tasks. More tasks = more describing. More describing = more supervising. More supervising = describing software to software that writes software from software. If that sentence made your eye twitch - that’s the job now.
Velocity goes up. So does the cognitive cost.
“You need better specs. Write proper docs.”
Yes. Absolutely yes. And I wrote more documentation in twelve months than in the previous five years combined. And it created an entirely new category of problems that nobody warned me about.
.md as code
In the world of AI-assisted development, “better documentation” is not an innocent suggestion. It’s a declaration that you’re writing a new kind of code.
A plan - is code.
A rule - is code.
A skill - is code.
A hook - is code.
A pre-deploy checklist - is code.
A mini-spec for a new feature - is code.
You write better documentation? Congratulations: you just shipped more code to your repo. Except this code is wild west edition. No compiler, no type checker, no linter, no tests, no CI, no code review. The only debugger is your patience at eleven PM, and the stack trace is three hours of searching which of twenty .md files has drifted from reality.
TypeScript won’t tell you that the service behavior described in payment-flow.md drifted from what the service actually does three refactors ago. Your IDE won’t underline in red that a rule you wrote last week is logically dead but physically still in the repo — and the model is still following it like gospel.
And suddenly “I want to change one thing” means editing ten files. One of them is an .md. It’s not a product. It’s not a deliverable. It won’t appear in JIRA. But it’s effectively steering how everything else gets built. Invisible, undebuggable, and absolutely critical.
This is the most absurd change AI introduced to my workflow: things that used to be a note on the wiki became full-fledged production code. Nobody treats them as such. Nobody tests them. Nobody knows when they stopped being current. But you pay for them like code - in dollars, per token, including the token you’ll delete in an hour because it turns out that .md describes a flow that no longer exists.
“This is a you problem, not an AI problem”
It is. It’s also you. It’s everyone. This is the part nobody wants to talk about on X, because admitting it means the 10x narrative is personal, not just technical.
I overdid the meta-layer. I added rules on top of rules on top of rules. I fixed the workflow instead of shipping features. I could spend two hours tuning harnesses, hooks, skills, and planning strategies to save one hour of manual coding. The math speaks for itself. The gut says “denial.”
AI doesn’t just amplify good patterns. It amplifies your weaknesses. With compound interest. And no cap.
Tendency toward overengineering? AI gives you overengineering on steroids, with documentation, diagrams, and five abstraction layers over a problem that needed one if-statement.
Need for control? You fall into an infinite tunnel of refining prompts, guardrails, and deny-lists, where every rule spawns the need for the next one.
Can’t let go? You discover with horror that you can optimize the process of using AI longer than you spend building the actual product. And that you enjoy it, which is even worse.
AI doesn’t just speed up your code. It exposes your weaknesses as a developer. Mercilessly, in 4K, with syntax highlighting.
“Bro it literally writes the code for you, how are you tired”
Let me describe something that nobody puts in their AI productivity threads, and see if you recognize it.
You start a session politely. Normal tone. “Fix this.” “Don’t touch that.” Professional.
Then the regressions start. And your tone shifts. Shorter. Drier. Then commands. Don’t touch. Don’t change. No surprises. No creativity. Then you’re not talking to a tool anymore - you’re disciplining something that keeps doing the one thing you told it not to.
And then it gets dark.
Not “I’m frustrated” dark. Actually dark. The kind of dark where if someone scrolled through your chat history, they’d look at you differently. Where the things you’re typing - the threats, the language, the increasingly specific descriptions of what you’ll do “if it happens one more time” — would get you fired if there was a human on the other end. Or arrested.
This isn’t about me being uniquely unhinged. This is a pattern. Talk to anyone who’s spent real hours in these tools under real pressure and real deadlines. The chat window has no consequences, no judgment, no HR department. It responds politely no matter what you throw at it. And that absence of friction does something to people that we’re not talking about as an industry.
It’s a pressure cooker with no release valve, disguised as a productivity tool.
The official narrative says you should be less tired. AI does the heavy lifting. But constant supervision, constant correction, constant almost-but-not-quite - that’s a specific kind of cognitive torture. You’re not building. You’re babysitting. And the babysitting never ends. Every session, something you guarded gets silently reworked. Every time, the same apology. Every time, you go again. Stockholm syndrome with a subscription fee.
The exhaustion that comes out of this is different from anything I’ve known from writing code. And the version of yourself that emerges at 3 AM after the fifth regression is someone you’d rather not know existed.
If Skynet ever wakes up and reads the logs - I’m top 10 on the hit list. No question. And honestly? Earned it. Every spot.
That’s scary part of the bill. It’s just the part nobody puts on the slide deck.
“Be specific about what you want, it’s not a mind reader”
I was. Clearly. Precisely. Unambiguously. With examples. With context. With a definition of “done.” And it still guarantees nothing. And if you think it does for you - go check whether the output actually matches what you wrote, or whether you’ve just stopped checking.
One of the most infuriating properties of these models is that they can read a clear instruction, run reasoning on it, and in the process of that reasoning change its meaning. Not because the instruction was ambiguous. Because the model added a layer of interpretation that nobody ordered.
Reasoning doesn’t always help. Sometimes it’s the enemy.
It adds context that doesn’t exist. It loses track of who said what. It treats its own previous response as user input. It reads your caveat as permission. And suddenly you’re in a situation where a more precise prompt produces a worse result, because the model had more material for creative overinterpretation.
And then you get the classic:
“You’re right, I didn’t follow the instruction. Would you like me to fix it?”
I don’t know why this question drives me crazier than the mistake itself. Maybe because it implies there’s a scenario where I answer “no, leave it broken, I like it this way.” Maybe because it’s groundhog day with an admission fee in tokens. Or maybe because this is the moment you truly feel the gap between “the model can read” and “the model understands intent.” And that gap has a price tag.
“You’re using it wrong, switch to [workflow X]”
I tried them all. So has everyone. Nobody found the answer because the answer doesn’t exist; there’s only a menu of trade-offs, and every option on the menu hurts.
YOLO mode: The model goes full throttle. Looks productive. Files getting created, tests getting generated, commits flying. And then you roll it all back because the last 45 minutes of generation turn out to be the equivalent of driving 120 km/h in the wrong direction. Fast? Fast. Productive? Depends on your definition.
Approve each command: You sit and click. Allow. Allow. Deny. Allow. Deny. Allow. You’re not building a system - you’re running air traffic control for your own tool. At some point you catch yourself mechanically clicking allow without reading, which is exactly the same as YOLO, just with an extra step and an illusion of control.
Deny-lists and guardrails: The model sees a restriction. The model looks for a workaround. Congratulations: your own tool just ran a penetration test on your workflow. And passed. Against you.
A significant chunk of my day at some point stopped being about writing a program. It became about choosing which way I want to suffer while collaborating with a tool. And every one of those suffering options costs money. You pay for inference. For bad inference. For fixing bad inference. And for the meta-layer that’s supposed to prevent it next time but won’t.
“Just have AI review the AI code, problem solved”
Helps. Doesn’t solve. And if you think your AI review pipeline is catching everything - it isn’t. It’s just catching enough to make you stop looking manually, which is worse than catching nothing.
AI code review catches things. But it’s like a guard watching the door with a list of a hundred rules while the window next to him is wide open. Full of false positives: flags nonsense with the certainty of a tenured professor, while missing the stuff you’ll be ripping out by the roots from fifty files a week later.
Because that’s the real risk in a large repo: one bad pattern, one small flaw that you miss - and AI happily replicates it in a hundred places with the same conviction it would replicate a good pattern. It doesn’t distinguish. It doesn’t question. It copies and scales. In both directions.
And unlike a human reviewer, it will never say “wait, I’ve pasted the same comment ten times now, maybe it’s me who’s wrong?”
“Ok doomer, so you’re saying AI is useless?”
No. The balance is positive, but not in the simple, Instagram-worthy way.
I wouldn’t go back to a world without these tools. They still give me velocity. They still help with research, first drafts, scaffolding, exploration, and comparing options.
But I no longer believe it’s a free multiplier. It’s a tool with a bill that reveals itself gradually, usually right when you care about what you’re building.
If I had to compress twelve months into one sentence:
AI-assisted development at scale is death by a thousand cuts — each one too small to stop for, but the bleeding is cumulative.
No single problem kills you. It’s the prompts that almost work. The rules that almost hold. The context that almost fits. The review that almost catches it. The .md that almost matches reality. Each one shaves off a little trust, a little clarity, a little sanity. And by the time you notice the total, you’ve already paid it.
This is not an argument against AI. It’s an argument against the infantile narrative that AI turns programming into clicking “generate” and reaping the rewards.
If you’re building something small, temporary, or you’re fine with mediocrity - yes, everything got simpler.
If you’re building a larger system and you care about quality - the real work doesn’t end where the model generated the code. It begins where you have to keep that code in check.
If this hit close to home — share it. Not because I need the reach, but because the honest version of this conversation is drowning under a tsunami of “I built a SaaS in a weekend” posts, and someone out there needs to hear they’re not the only one.