AI Writes 1.7× More Bugs. Two Disciplines Decide Whether They Ship.

AI Made Code Cheap. It Also Made Correctness the Bottleneck.

In December 2025, CodeRabbit — a code-review vendor, so read the number with that in mind — published "State of AI vs Human Code Generation," an analysis of 470 real-world open-source pull requests. The headline: AI-co-authored PRs carried roughly 1.7× more issues than human-only ones. Underneath that average, the breakdown is sharper than the summary suggests. Logic and correctness issues rose 75%. Security vulnerabilities ran up to ~2.7× higher. Readability problems landed more than 3× more often. And excessive-I/O operations — the read-it-twice, N+1, fetch-the-world category — appeared ~8× as often.

Here's the part that matters more than the multiplier. These are not typos. A typo throws a parse error and never leaves your editor. The defects that actually ship are the silent ones: business-logic errors, unsafe control flow, an off-by-one in a boundary nobody wrote a test for. They compile. They pass CI. They survive code review because they look right — the AI is very good at producing code that looks right. They clear staging because staging never hits the edge that breaks them. They surface in production, under real load, against real data, in front of real users.

The vendor caveat cuts both ways: CodeRabbit sells AI code review, so a finding that AI code needs more review is convenient for them. But the direction matches what anyone shipping AI-assisted code has felt. Generation got cheap. Typing is no longer the bottleneck. Correctness is. And you control correctness in exactly two places: the gates that measure what your code actually does once it's rendered, and the structure of how you orchestrate the agents that wrote it. This post is about both, because both are the same thing — engineering discipline for AI code.

Why Your Tests Don't Catch Them

The instinct is "write more tests." It's the wrong instinct, and the reason is structural.

When you let an agent write a feature and its tests in the same pass, the tests inherit the code's blind spots. The model that misunderstood the empty-array case writes a test suite that doesn't probe the empty-array case — because in the model's internal model of the problem, that case doesn't exist. The test passes. The bug ships. Tests written by the same intelligence that wrote the code are a consistency check, not a correctness check. They confirm the code does what the model thought it should do. They say nothing about whether what the model thought was right.

That's the deeper issue: tests assert intended behavior. Silent defects live in the unstated edge cases. The arXiv survey "A Survey of Bugs in AI-Generated Code" (2512.05239) catalogs exactly this — the dominant failure modes in generated code are semantic and logical, not syntactic. The bug isn't in the path you described; it's in the path you didn't think to describe, which is precisely the path the model also didn't think to handle.

A concrete one from my own work: a scroll-reveal animation where the content stayed permanently invisible. The implementation gated an element's opacity on an IntersectionObserver firing at a 0.15 threshold — reveal the content once 15% of it scrolls into view. Reasonable. Except the element was taller than the viewport, so 15% of it could never be on screen at once. The threshold was unsatisfiable. The content sat at opacity: 0 forever. Every unit test passed, because every test asserted the intended behavior: "when the observer fires, reveal." None asserted the unstated precondition: "the observer can fire for an element this tall." I wrote that up in full in a post on that exact failure mode — it's the cleanest example I have of a defect that lives entirely in the gap between intent and reality.

You cannot test your way out of this with more of the same tests. You have to verify something the model never reasoned about: the actual rendered output.

Discipline 1 — Verify the Output, Not the Intent

The fix is to stop asking "does the code do what we meant?" and start asking "does the rendered reality match a measurable budget?" That's a different kind of check, and it catches a different class of bug.

Measure rendered pixels, not source intent. My most-used example is a contrast CI gate. A Playwright probe samples actual canvas pixels via gl.readPixels under each text element, computes the WCAG contrast ratio against the real background the GPU drew, and fails the PR if contrast regresses past a frozen budget — exactly like a bundle-size budget, but for accessibility. The full writeup is here. The point for this post: the gate doesn't read the code's intent. It reads the photons. A one-line opacity tweak that no test and no reviewer flagged still gets caught, because the measurement is downstream of every blind spot in the source.

Probe the live DOM after you deploy. "Shipped" and "landed" are different claims. I've watched fixes that merged green and deployed clean never actually reach the DOM — wrong selector, a CSS variable that didn't resolve, the change merged to the wrong branch. So the discipline is: after deploy, probe production and confirm the fix is there. Did the selector resolve to a node? Did the value persist? It's a few lines:

// After deploy: did the fix actually land in the live DOM?
const page = await browser.newPage();
await page.goto(PROD_URL, { waitUntil: 'networkidle' });
const reading = await page.evaluate(() => {
  const el = document.querySelector('.post-content p');
  if (!el) return { found: false };          // selector never resolved → "shipped" was a lie
  return { found: true, opacity: getComputedStyle(el).opacity };
});
assert(reading.found && Number(reading.opacity) > 0.99, reading);

The model can tell you it fixed something. The DOM cannot lie about it.

Try to refute the claim before you fix it. When an audit — automated or AI — hands me a "root cause," my first move is not to fix it. It's to code-read and probe and try to make the bug reproduce. More often than I'd have guessed, it doesn't. The auditor named a mechanism that isn't in the repo: a Server Action that's actually a client fetch, a missing onChange that's actually present, a hydration bug on a component that renders fine. Here's the shape of one I see constantly: the audit confidently reports "the form doesn't submit because the server action isn't awaited," and the suggested fix is to add an await. Two minutes of code-reading shows the call is already awaited — and a probe shows the submit fires fine; the real failure was a disabled button upstream, gated on a validation flag that never cleared. Had I trusted the narration, I'd have "fixed" working code and left the actual bug untouched. AI is fluent at narrating a plausible failure. Adversarial verification — actively trying to disprove the diagnosis — keeps you from spending a sprint fixing a bug that was never there.

Gate the class, not the instance. When a probe catches a real defect, fixing that instance is half the job. The other half is turning the catch into a permanent gate so the class can't recur. The contrast regression became a frozen budget the next change has to clear. The unsatisfiable-threshold bug became a visibility probe in the repo's verify suite that re-checks every post's body actually renders at full opacity. A caught bug that doesn't become a standing check is a bug you've agreed to catch again later.

Discipline 2 — Orchestrate So Errors Contain, Not Amplify

The second place you control correctness is upstream of any single line: in how you fan work out across agents. Get the structure wrong and you don't just fail to catch errors — you manufacture them.

Google Research's "Towards a Science of Scaling Agent Systems" (arXiv 2512.08296) put numbers on this by evaluating 180 agent configurations. It is a benchmark study, not a guarantee about your repo — but the shape is hard to argue with. On genuinely parallelizable tasks — decomposable financial reasoning, where distinct agents can simultaneously analyze revenue, cost, and market data — centralized multi-agent coordination improved performance by 80.8%. On sequential tasks — where step N depends on step N−1 — every multi-agent variant degraded performance, by 39% to 70%. Same agents. Opposite result. The only thing that changed was whether the work was actually independent.

The mechanism is error propagation. Independent agents running unchecked amplified errors 17.2× — each one's mistake fed forward into the next with nothing in between. Route the same work through a central coordinator and that amplification dropped to 4.4×. The coordinator acts as a validation bottleneck: every handoff passes through one checkpoint that can catch a bad result before it poisons everything downstream. And they could predict the optimal architecture for an unseen task 87% of the time — meaning this isn't folklore, it's structural enough to model.

Two rules fall out of that, and they're the ones I actually run by:

Fan out only genuinely independent work. Three unrelated bug fixes in three unrelated modules? Parallel agents, great. A build that goes scaffold → wire → style → test, each step depending on the last? That is a sequential task wearing a parallel costume, and splitting it across agents is the −39%-to−70% trap. Keep it single-agent with review checkpoints between steps. I lay out that brainstorm → spec → plan → one-subagent-per-task flow in detail in my post on master prompts — the discipline there is precisely not parallelizing the dependent stages.
Make the coordinator a verification layer, not a dispatcher. The cheapest 4.4× you'll ever buy is read-only reviewers. The agent that writes the code should not be the agent that blesses it — a fresh reviewer also gets a clean context budget, which is its own discipline (more on context rot and agent memory here). Concretely, that's a permission split:

// Implementer: can change the tree.
const implementer = { tools: ['read', 'edit', 'bash'] };
 
// Reviewer: can look, cannot touch. Structurally can't rubber-stamp
// its own edits, because it has no edits.
const reviewer   = { tools: ['read', 'grep'] };
 
// Coordinator gates task N+1 on the reviewer's verdict for task N.
if (!(await reviewer.approves(taskN))) halt('review gate failed');

A reviewer that can't edit can't quietly fix-and-pass its own work; it has to report, which means a human or a gate sees the verdict. That review gate between tasks is the orchestration-level version of Discipline 1: a checkpoint the error has to survive before it moves on.

Synthesis: Cheap Generation, Verified Output, Contained Orchestration

Put the two together and the 1.7× tax stops being a tax.

Cheap generation is the gift — let the model write the code, fast, in volume. Output verification is the first guardrail — gates that measure rendered reality instead of trusting the model's account of it, because the tests it wrote share its blind spots. Contained orchestration is the second — a structure where independent work fans out and dependent work stays single-threaded behind review checkpoints, so a wrong result gets caught at 4.4× instead of compounding at 17.2×. Generation gives you speed. The two disciplines decide whether that speed ships features or ships the 1.7× of extra defects straight to your users.

The leverage is in where the error gets caught, because the cost of a defect is wildly asymmetric across that line. An error stopped at the coordinator's review gate is contained — one bad result, examined and discarded, the 4.4× world where the damage stops at one handoff. The same error let loose through independent agents with nothing checking the seams compounds — it feeds the next step, which feeds the next, the 17.2× world. And a defect that slips past a missing output gate is worse still: it costs nothing until it surfaces under real user load, then it costs everything at once. The bottleneck didn't move because catching bugs got harder; it moved because the judgment about where to put the gate is now the expensive part.

Both are the same idea seen from two altitudes. At the line level, you verify the output because intent is unreliable. At the system level, you structure the orchestration so a single bad output is contained instead of amplified. Neither is "write better prompts" and neither is "add more tests." They're engineering discipline applied to a new bottleneck.

Because that's what actually moved. AI didn't remove the hard part of building software — it relocated it. The bottleneck used to be typing the code. Now the code is nearly free and the bottleneck is judgment: knowing what to measure, knowing what to refuse to trust, knowing which work is independent and which only looks it. The teams that ship AI-assisted code well aren't the ones generating the most. They're the ones who verify the output and contain the structure — the ones who treat correctness as the thing they engineer, now that everything else got cheap.

Where This Leaves You

If you're shipping AI-assisted code, audit yourself against two questions. First: do you have a single gate that measures rendered output — pixels, DOM, behavior — rather than re-running the model's own tests? Second: when you fan work across agents, is the split along genuinely independent seams, with a read-only reviewer gating each dependent handoff? If either answer is no, that's where your share of the 1.7× is hiding.

I build production SaaS solo with exactly this workflow, and I'm always glad to compare notes on what gates earned their keep and which orchestration splits backfired.

Building with AI and want a second set of eyes on your verification and orchestration discipline? Let's talk.