All articles

Cheap to build, costly to keep

Mathilde Rigabert
··9 min read

Over 40% of committed code is now AI-assisted (1). AI has broken the cost of writing code. It hasn't touched the cost of owning it.

Velocity metrics look great. We ship more, faster. Everything is green. That's worth questioning.

There are two prices

AI has collapsed the price of writing code to near zero. Features that used to take a week now ship in two days, a major lever for any scale-up.

But writing code was never the most expensive part. What costs is maintenance and evolution.

The first large-scale empirical studies converge on the same finding: AI-generated code introduces 1.7x more issues than human code. Without guardrails, that compounds: maintenance costs reach 4x traditional levels by the second year (1). These numbers come from an Ox Security report relayed by InfoQ. Ox sells application security tooling, so they have skin in the game. But other studies point in the same direction (4)(6), and the pattern matches what I observe in the field.

Why? Because AI is additive by default. Take a common case: an endpoint returns a badly formatted date. An experienced developer would trace back to the parser and fix the format at the source. AI adds a .toISOString() in the controller, a sanitizeDate() wrapper in the service, and a test that validates the workaround. The bug is "fixed." Three layers of code added, zero lines removed. The root cause is still there.

A junior developer would make the same mistake. The difference is scale. AI produces this kind of palliative on every PR, in every module, without anyone systematically pushing back. What used to be a coaching moment becomes a systemic codebase problem.

GitClear's longitudinal study across millions of lines of code (6) quantifies the shift: the share of changed lines associated with refactoring dropped from 25% in 2021 to under 10% in 2024. In the same period, code duplication rose from 8.3% to 12.3%. AI amplifies the pattern: it generates new code rather than restructuring what exists. (On greenfield projects, high addition rates are expected. The signal matters on code in maintenance, over time.)

Another signal worth tracking: code churn, the percentage of code rewritten within two weeks of its creation. The same study (6) measured churn rising 84% between 2020 and 2024, from 3.1% to 5.7%. The period also covers post-COVID shifts and the Great Resignation, so AI adoption isn't the only factor. But the correlation is strong enough to warrant monitoring. A feature that ships in two days but gets rewritten the following sprint is not a velocity gain. It's a deferred cost wearing a speed label.

These are the indicators that separate teams where AI builds from teams where AI just adds.

The code works. Nobody understands why.

Addy Osmani named this phenomenon: comprehension debt (3). The code runs. Tests pass. Syntax is flawless. But no one on the team can explain how it works. The team merged code it didn't write, didn't truly read, and couldn't reproduce without AI.

This has nothing to do with classic technical debt. There's nothing to refactor. The code is clean. It's just that nobody carries it in their head.

When a production incident hits, resolution time increases. The team discovers the implementation in real time, under pressure. An empirical study of 304,000 commits confirms it (4): developers place excessive trust in AI-generated code and merge it without thorough validation. Issues frequently go unfixed.

This is where bus factor becomes a leading indicator. If AI writes code that only AI can explain, the bus factor of affected modules tends toward zero. Not because a single person holds the knowledge, but because no one does. In a post-mortem, this surfaces as "nobody knew this code existed," which is worse than "only one person knew."

One way to detect it: track the MTTR Drift (2), the deviation of Mean Time To Recovery from its pre-AI baseline:

javascript
MTTR Drift = (MTTR post-AI - MTTR pre-AI) / MTTR pre-AI

The proposed thresholds (2) are as follows. Between -10% and +10%, the team has internalized the generated code. Above +30%, it's a signal worth investigating. MTTR is a noisy indicator. A single major incident can skew a quarter, so measure it as a rolling median over at least three months. The drift can have other causes (turnover, infrastructure changes), but if it correlates with AI adoption and nothing else explains it, comprehension debt is a serious hypothesis.

If the drift is real, the intervention point is clear. Code review is the last moment where the team can take ownership of code it didn't write. Automating convention and pattern checks (via hooks or review bots) frees human attention for business logic and architecture decisions. A hook that blocks PRs over 400 lines unless the author provides a section-by-section breakdown helps keep reviews at a human scale.

Measuring real impact, not velocity

"The faster you go, the further ahead you need to look."
— Todd Gagne, The Barrels Paradox

Lines of code generated, number of PRs, completion speed: these are production metrics. They measure volume. They say nothing about the cost of what we produce.

The obvious starting point is DORA metrics before and after AI adoption. If deployment frequency goes up but change failure rate does too, we're not going faster. We're breaking more often. One study measured a +30% increase in change failure rate within 90 days of AI adoption (1). Part of that may reflect the learning curve of new tooling rather than a structural problem.

Three questions to ask: does the AI produce code we keep? Can the review pipeline absorb the volume? Does the architecture hold?

On the review pipeline specifically: review cycle time on AI-assisted PRs versus manual ones is a revealing metric. A study of over 8,000 AI-agent PRs (7) shows that 35% are never merged, either closed or left to rot. Merge rates vary from 42% to 82% depending on the tool. If a third of AI-generated PRs never land, the velocity gain measured at commit time is absorbed downstream. The bottleneck moves from writing to reviewing.

In his paper (2), Nadarajah applies this reasoning to two scenarios with the same tool and the same team, over a one-month cycle. Without guardrails: net impact of -4,200. The team spends its time cleaning up. With the right safety nets: net impact of +5,780. Same tool, opposite outcome.

The guardrails make the difference

The numbers tell part of the story. There's also a human cost that metrics don't capture. Seniors spend their days reviewing code they didn't write and understand less and less. Experienced developers lose touch with their own codebase. In the teams I work with, that's often the first warning sign: not a metric going off, but a tech lead saying "I don't recognize the code anymore."

theThe answer is to give AI the right context. With one of my clients, we formalized architecture conventions in tool-readable specs that the AI loads before generating code, set up pre-commit hooks to block known anti-patterns, and invested in building shared skills across the team so that everyone can evaluate what the AI produces. The hooks catch issues before they reach review; the shared understanding catches everything else.

It's too early to measure the impact. The setup is recent. But the logic holds: give AI the context to produce aligned code from the start, rather than fixing it after the fact. I'll detail the full pipeline (specs, hooks, task templates, review automation) in a follow-up article.

AI also reduces certain types of bugs, improves test coverage on boilerplate, and enables small teams to deliver what used to take months. The risks described here are real, but they exist alongside genuine gains.

What to watch

If you do one thing Monday morning: pull the insertion/deletion ratio on your three most active repos for the last quarter. If it exceeds 10:1, start asking which PRs contribute the most.

Signals to track over time:

  • Code churn within 14 days. Extractible from git, no vendor dependency. If AI code is rewritten more than human code, the speed is illusory. (Tagging AI vs. human commits requires convention: commit message tags or Copilot metadata.)
  • Refactoring ratio on your repos. If it drops below 10% of changed lines, the codebase is bloating. The 10% threshold comes from GitClear's methodology (6). Your mileage may vary with different tooling.
  • MTTR Drift (rolling median, 3+ months). Above +30%, the team understands less of what it ships. Threshold from (2), not an industry standard.
  • Review cycle time, AI PRs vs. manual. If AI PRs take longer to merge, the bottleneck moved. Normalize by PR size to account for size differences.
  • Change failure rate before and after AI. If it's going up after the adoption curve has stabilized, we're breaking faster than we're building.

Today's acceleration becomes tomorrow's bottleneck when nobody looks beyond the velocity dashboards. That's a leadership choice.


Sources

(1) AI-Generated Code Creates New Wave of Technical Debt — InfoQ / Ox Security (Nov 2025)
(2) The Velocity Mirage: The Agentic Impact Framework — Mag-Stellon Nadarajah (March 2026)
(3) Comprehension Debt: The Hidden Cost of AI-Generated Code — Addy Osmani (March 2026)
(4) Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild — arXiv (March 2026)
(5) The Barrels Paradox: Why AI Makes Leadership More Human, Not Less — Todd Gagne (Feb 2025)
(6) AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones — GitClear (2025)
(7) Why AI Agent-Involved Pull Requests Remain Unmerged — arXiv (Feb 2026)

Stay in the loop

Get new articles delivered directly to your inbox. No spam, unsubscribe anytime.

0 Comments

No comments yet. Be the first to comment!