The Code You're Not Reading

Part 1: Asking for a Friend

Last week I posted something on LinkedIn that got more traction than I expected. Pete Hodgson had shared a research paper about AI-generated code quality, and I replied with this:

"So, uh, if my friend just had Claude Code slam out 4,000 lines of mostly working Python, which he's been testing and having the AI fix up, but he hasn't really looked at the code much, am I, I mean, is he in trouble? Asking for a friend."

Pete, to his credit, gave a genuinely useful answer rather than just dunking on me: it depends on how much you expect to extend and modify the code. If it’s a throwaway tool, probably fine. If not, you probably want some human-in-the-loop steering — and HITL steering the design is way better than HITL reviewing thousands of lines of Python after the fact.

Good advice. Possibly advice I should have applied earlier.

The “friend” in question is me. Over the past few weeks I’ve had Claude Code rewrite a legacy Perl infrastructure management script in Python, about 4,000 lines or roughly 100KB of working code. I’ve been testing it, prompting Claude to make fixes, iterating. What I haven’t been doing much of is reading the actual Python.

The paper Pete shared is called SlopCodeBench, from researchers at the University of Wisconsin–Madison, Washington State University, and MIT. The research question: if you ask a coding agent to build something and keep extending it with new requirements, what happens to the code quality over time? The benchmark covers 20 problems and 93 checkpoints, tracking two quality signals:

verbosity — measured by redundant or duplicated code
structural erosion — complexity concentrating in a small number of very complex functions such as god classes, 800-line methods, and so on

The numbers are bleak. No agent solved any problem end-to-end across 11 models tested. Structural erosion affected 80% of trajectories and verbosity increased in nearly 90%. Compared against human-authored open-source Python repositories, agent code was 2.2x more verbose — and while human code stayed flat over time, agent code deteriorated with each iteration. The key finding: code can pass the test suite, work well enough to deliver value, but still become progressively harder to extend. Pass-rate benchmarks (SWE-bench and friends), which is where most of the headline “this model is better than that model” rhetoric comes from, systematically undermeasure extension robustness.

The researchers also tested “prompt interventions” such as asking agents to plan carefully, refactor for quality, or focus only on new features. These helped initial code quality but didn’t stop the decay. Pete observed in the thread that the prompts weren’t particularly well-designed; “refactor to ensure high quality” is only marginally more useful than “make no mistakes.” Jonny LeRoy raised a more structured red/green/refactor harness as a possible improvement. Pete’s intuition is that it would help, but maintainability would still degrade far faster than for human-authored code.

The underlying problem isn’t that AI writes bad code. It’s that it writes plausible but sometimes unprincipled code — code that works today, passes the tests, but quietly becomes a liability. And it won’t tell you which you’ve got.

Part 2: The God Class in the Room

So, back to my friend.

The Python codebase Claude produced was largely working. Unit tests passed, and a few hours of real-world testing showed it was successfully managing a fairly complex piece of my infrastructure. But somewhere around 100KB of total code I noticed something: the main file had grown to about 50KB (2,000 lines) and Claude Code, when it needed to make edits, had started reaching for sed to find and modify code within that file. When I saw that, it was a serious alarm bell.

I’d been operating in “just get the AI to do it” mode. I had an old Perl script, I wanted a new Python script with better lifecycle management and new features, and surely the old code was better than any spec I could write? I was purely focused on getting it working and treated the Python as a black-box. But rather than read through 4,000 lines of Python myself to figure out why the code had gotten out of hand, I pasted the god class and the SlopCodeBench paper into ChatGPT and asked it to analyze the code in light of the research.

ChatGPT’s diagnosis was pointed. Yes, there were problems. But beyond the symptoms the paper describes — verbosity, strucural erosion — it identified something more fundamental: the code was missing a key system metaphor.

This is a concept from Extreme Programming that shows up in domain-driven design thinking too. A system metaphor is a shared, coherent central abstraction that the whole design speaks to. The right metaphor allows you to form a simple mental model that makes the system’s structure obvious, that gives you a vocabulary for talking about it, that lets you make good decisions about where new things should go. Ward Cunningham described it as “a story that everyone on the team can tell about how the system works.”

In my case, ChatGPT pointed out that much of the code was handling inline checks about the state of the core infrastructure component, but without modeling that component directly. This led to a lot of what looked like repeated special case code. The fix was obvious in retrospect — introduce an explicit model for that infrastructure component and encapsulate the state checking and lifecycle handling, reducing and simplifying surrounding code. The funny thing is that it took a different AI model, prompted with academic research about the failure mode, to surface the missing metaphor. Claude, left to its own devices iterating on the codebase it had built, hadn’t found its way there (although, to be fair, I expect a fresh Claude with the god class and the research paper would have come to many of the same conclusions.)

I asked Claude to fix it, at medium effort, using Sonnet. It didn’t, really, it just made some surface adjustments and declared victory. I had to push a couple of times, repeatedly prompting about the structural change and why the god class was still 2,000 lines before it actually did the right thing and fixed it. So even with some insight from me, it took some dancing to actually get things fixed to my satisfaction.

Part 3: Variability with Mature Codebases

The Python experiment was greenfield — Claude wrote all of the Python, so its failure modes were somewhat its own to own. But I also have a codebase that has been evolving for years under human authorship, where the AI is a guest rather than the original architect. It’s a C# codebase that’s about five years old, now sitting at around 60,000 lines. Claude’s behavior is noticeably different and more subtly problematic.

The variability is the first thing you notice. On low effort using Sonnet, the results are genuinely random. Some things come out well; others contain decisions that are quietly wrong and likely to cause problems in future. Last week I found that Claude had serialized an enum into JSON as integers 0 or 1, because those happened to be the only two values at the time. It then consumed the 0 or 1 in a web app which tacitly assumed those are the only two possible values. Functionally fine and works today but any developer looking at that would immediately want a string representation in the JSON using the enum’s actual name, which is how it ‘should’ be done. The integer serialization sat in the codebase for two weeks before I caught it.

The telling part came when I highlighted the relevant lines and asked, “isn’t there a better way of doing this?” On low effort, without any extended thinking, Claude immediately knew. It fixed the serialization correctly using a more standard and future-proof pattern. The knowledge was there all along somewhere inside the LLM, it just hadn’t been applied. That’s a specific kind of failure mode: the AI not bringing its judgment to bear unless you directly prompt it to.

On the flip side the successes are genuinely impressive. As a well-structured codebase matures it grows in capability and requires fewer lines of code to do something new. The internal APIs accumulate expressiveness and raise the overall level of the abstraction. Claude can tap into that to a surprising extent — I’ve had it deliver working implementations of large features in minutes that would have taken me days to code by hand, in a way that feels idiomatic to the existing code. That’s super valuable and I’ve moved several big features from “seems like a good idea” to actual implementation.

Claude will also happily balloon your 1,000-line class to 1,500 hundred lines, and then 2,000. Whether to create a new class for something or pile it into an existing file appears somewhat random. The moment you say “I’m getting uncomfortable with how big this is getting, can we do something better?” it does the right thing: sensible decomposition, new classes, sometimes even unit tests for the new thing. It knew, it just didn’t volunteer it.

One response to this is to invest more seriously in your CLAUDE.md setup. These are configuration files that act as a persistent system prompt for the agent, encoding conventions, architectural preferences, and even workflow such as TDD. I haven’t gone deep on this myself, and it’s a reasonable partial answer to some of the problems I’m seeing. At the level of engagement I’m bringing to this — not naive, not expert, probably a reasonable proxy for a typical enterprise developer picking up these tools for the first time — the AI will do a lot of impressive things and also make a lot of small, slightly wrong decisions, and it won’t tell you about them.

Part 4: The World’s Most Instructive Vibe-Coded Codebase

I’m a big Claude Code fan. It’s been genuinely transformative for how I work, and I think Anthropic is my favorite large scale AI lab both due to the products they’re creating and their genuine concern about AI risks. With that said.

Earlier this year, Claude Code’s 500,000 line source code was leaked. The reaction in some quarters was that competitors would catch up with Anthropic now that their “secret sauce” was out in the wild. Others felt there were architectural insights to be mined from the code and that the hierarchical memory system and unreleased autonomous agent features were worth studying. Sabrina Ramonov’s technical deep-dive found good elements such as the async generator architecture, the bash security system, and the prompt cache boundary design. These are not the work of people who don’t know what they’re doing.

But the leaked code is a codebase of two halves, and the less flattering half is an almost perfect illustration of what the SlopCodeBench paper describes, at scale.

So that’s where all the five hundred thousand lines come from - fallback conditions and then more fallback conditions to compensate for the variable output of all the other fallback conditions. — @jonny@neuromatch.social

Everything in Claude Code is done in multiple different ways, with those different approaches jammed in wherever they’d fit. Whatever coherent architecture exists is constantly undercut by special-case conditions that change how any of it actually works. The specifics are instructive.

Exception types whose names end with _I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS — the naming convention is literally a prompt, a reminder baked into the identifier to tell the model not to leak sensitive data. It apparently ignores it anyway.
Client-side behaviors gated behind an environment variable called USER_TYPE=ant, apparently with no server-side verification, labeled in comments as “internal only” — security by honor system.
A regex that detects “negative emotion” in user prompts, triggering a telemetry event. On a language model. A regex.

The image compression code is a particular lowlight. A single API call has twenty-two opportunities to recompress the same image across nine independent conditional code paths, with multiple layers of fallback logic that partially duplicate each other, inconsistent return types between branches, and a function called createCompressedImageResult that, in one branch, does nothing. This is not malicious. It’s the accumulated residue of iterative AI generation, each session solving the immediate problem without seeing the whole.

Then there’s the verification architecture. After each subagent runs, Claude Code spins up another agent to check whether the first one did what it was supposed to. The checking agent runs on a smaller, cheaper model than the agent being checked. And the mechanism for triggering the JSON schema validation step is not a function call — it’s a prompt. The system asks the model to call the validation tool, and if it doesn’t, there’s an entire error category for “agent finished without calling structured output tool,” handled by treating the job as cancelled. I’d use a facepalm emoji but then people would complain I had AI write this article.

Ramonov’s analysis also found a compaction bug generating 250,000 wasted API calls per day, documented in a dated internal comment three weeks before the leak, fixed eventually with a three-line change. This was the second time the same source map packaging bug leaked private assets into the public build; the first was on Claude Code’s launch day in February 2025.

Anthropic ran an entire ad campaign with the tagline “Claude Code is written with Claude Code.” After the leak, now that we can see the variable quality of the code, that’s actually instructive rather than a glowing endorsement.

Anthropic’s engineers are not bad. The good parts of the codebase are genuinely good. But we have strong evidence of what happens to a large, fast-moving codebase when the humans in the loop aren’t reading the code closely enough. Velocity and feature delivery clearly took priority over stepping back and asking “what is this thing, actually, and how should it be structured?” The Register noted that quality complaints in the Claude Code GitHub repo have escalated sharply. April is already on pace to exceed March’s issue count, which was itself a 3.5x jump over the January–February baseline. The tool’s architectural inefficiencies are a significant problem: every redundant code path and bloated prompt structure ripples outward into the token consumption of every session it runs. At a moment when we’re genuinely debating whether we can produce enough electricity for the AI industry’s ambitions, that’s really darned important.

Both things are true: there is good architecture in Claude Code, and there is also an incomprehensible mess. That’s actually the point. You don’t get to know which is which without reading the code.

Part 5: More Agents, More Problems

If the previous section hasn’t put you off agent-assisted development entirely, you may be wondering whether the answer is simply more agents. Coordinate a swarm of them, assign specialized roles, add supervisory layers — surely that irons things out? The most prominent current example is Steve Yegge’s Gas Town, which he launched on New Year’s Day to considerable fanfare.

Yegge is not someone to dismiss. He’s an ex-Amazon, ex-Google engineer with three decades of experience. Wrote an excellent book (with Gene Kim, co-author of Accelerate) on vibe coding. Gas Town is technically ambitious: a Go-based orchestrator managing 20–30 parallel Claude Code instances in distinct roles, a Mayor that coordinates work, Polecats that handle tasks, a Witness that monitors progress, all built on a git-backed state system called Beads. The goal is to keep the pipeline fed so development velocity is no longer the bottleneck.

Yegge himself describes Gas Town as “definitely sloppy,” requiring “a lot of manual steering and course-correction.” The codebase is 100% vibe coded, three weeks old at launch, and he warns that you need to be at “Stage 7” of AI-assisted development — managing 10 or more agents by hand, regularly — before you can use it effectively. That’s… not a lot of qualified people.

The community reaction has been mixed. Maggie Appleton wrote one of the more clear-eyed analyses, and the user reports she collected are instructive:

“The Mayor is dumb as rocks, the Witness regularly forgets to look at stuff, the Deacon makes his own rules, the crew have the object permanence of a tank full of goldfish.” — astrra.space

Her conclusion — that Gas Town fits the shape of Yegge’s brain and no one else’s — might be true.

What’s genuinely interesting in Appleton’s piece is a point that cuts against the hype in a different direction. When you have a swarm of agents burning through implementation tasks, development time stops being the bottleneck and design becomes the limiting factor. Yegge notes that Gas Town churns through implementation plans so quickly that you have to do a lot of design and planning just to keep the engine fed. This is an argument that human taste and judgment become more important at scale, not less.

That tracks with Pete Hodgson’s read on SlopCodeBench: the AI clankers aren’t replacing software engineers any time soon. They’ll do the coding. The design — knowing what you’re building, why the abstractions are the way they are, when a class has gotten too big — still requires human judgment. Agent mobbing doesn’t solve that. At best it defers it; at worst it industrializes the problem, producing slop at a rate no single engineer could match.

For every enthusiastic Gas Town anecdote there’s a cautionary tale to be found online too. The write-only code problem isn’t solved by writing-only code faster.

Part 6: How Much Attention Should You Actually Pay?

None of these drawbacks mean you should stop using these tools. The productivity gains are real. I’ve had Claude Code quickly deliver working implementations of features I’d been deferring for years, completely changing my calculus about what was worth building. The question isn’t whether to use AI coding assistance, it’s how to calibrate your engagement based on what you’re building.

I think it really depends on how much you’re expecting to modify and extend that code over time. If it’s a throwaway tool, or something that you’ll not need to extend, then great! Or, if you have a really nice external interface around it, and can just rewrite it when you need to change it, then great! But if not, then you probably want some HITL at some point, and HITL steering the design is way better than HITL reviewing 100KLOC of Python! — Pete Hodgson

A rough framework, based on my experience:

Throwaway analysis scripts. YOLO. Prompt until it works, sanity-check the outputs, move on. Pete Hodgson put it well: if you’re not expecting to extend it, and if you have a clean external interface around it such that you could just rewrite it when needed, you’re fine. The risk is proportional to your dependence on the internals.

Operational tooling you’ll actually maintain. This is where the trap is. The first version works, and because it works you don’t look at it closely, and then six months later you’ve got a 50KB god class and Claude is using sed to edit it. The time to think about architecture is before you’re 4,000 lines deep. Ask “what’s the right metaphor here?” early and often. What’s the central abstraction that the whole design, or the current piece of the design, should speak to? If you can’t answer that, neither can the AI, and you’ll end up with an incidental implementation rather than a deliberate one.

Durable, evolving codebases. Treat the AI like a very fast junior developer who writes prolifically and tests inconsistently. Review changes, don’t just push things blindly. Hit escape when you notice the agent going in the wrong direction. When something looks off, a light intervention — highlighting a few lines and asking “is there a better way of doing this?” — is often enough to nudge things in the right direction. The knowledge is usually there within the AI, it just needs to be prompted.

Worth noting: Claude does write tests, and regularly produces a failing test or two that it then has to fix. That’s not TDD in any formal sense, but it’s also not nothing and can provide some self-correction. I’ve found this to be a reasonable working pattern without having to enforce a strict red/green/refactor discipline.

The CLAUDE.md files are worth more investment than most developers give them. A well-crafted configuration that encodes your architectural and process preferences acts as a standing contract with the agent. You’re essentially writing a system prompt for your own development environment and the returns on that investment compound over time.

Closing: Read the Code

There is no write-only code. At some point, someone has to read it. The question is whether that someone is you, now, while it’s still tractable, or you, later, when the god class is three thousand lines and the system metaphor is buried under six months of iterations.

The genuinely hopeful signal in all of this is that AI can analyze AI code. When I pasted my 50KB god class and the SlopCodeBench paper into ChatGPT, it identified a missing system metaphor and suggested constructive ways for me to review critical parts of the code without reading several thousand lines of Python. These tools can catch their own failure modes, given the right prompt and a fresh perspective.

What that means in practice is that the programmer who slows down to review AI output, who asks “really?” when something looks off, who occasionally takes a step back and gets a second model’s opinion on what the first one built — that programmer isn’t doing it wrong. They’re doing exactly what the moment requires. Software engineering today is not agent swarms and it’s not vibe coding. It’s a developer and an AI, working together, with the developer in the loop.