Developers have become so dependent on AI coding tools that a leading AI research lab could not run a controlled experiment without them — and the productivity gains they believe they are getting may be largely illusory, according to a convergence of recent research and corporate disclosures.
In February 2026, METR — the machine-learning evaluation and research lab — attempted to update a landmark 2025 study measuring how much time open-source developers took to complete tasks with and without AI assistance. The update never happened. Developers refused to participate “because they do not wish to work without AI,” the researchers acknowledged, effectively making a controlled comparison impossible.
What the Data Actually Shows
METR’s original 2025 study produced a result that surprised even its authors: developers who believed AI was accelerating their work were, in measurable terms, slower. The speed at which AI generated code was more than offset by the time spent steering the model, waiting on completions, and — critically — hunting down and correcting errors. The net effect was negative.
Unable to replicate those conditions in 2026, METR pivoted to a self-reported survey published in May. Technical employees perceived that AI made them roughly twice as valuable to their organizations. Self-reported productivity surveys, however, are among the weakest instruments in empirical research; perception and measurement routinely diverge, particularly when workers are enthusiastic adopters of the technology being evaluated.
Corporate budget data tells a more sobering story. Uber exhausted its entire 2026 AI budget within the first four months of the year, according to reporting by The Information. COO Andrew Macdonald said on a recent podcast that the expenditure had not produced a measurable increase in completed projects or overall productivity. Separately, Amazon shut down an internal token-tracking leaderboard called Kirorank after employees gamed it by running AI agents excessively and driving up costs, the Financial Times reported. Both cases illustrate the same dynamic: AI use does not automatically translate into output.
That dynamic has acquired a name in 2026: tokenmaxxing — using token consumption as a proxy for productivity. The practice, according to the same reporting, may already be in retreat as finance teams scrutinize the bills.
Code quality metrics compound the budget concern. CodeRabbit, which makes an AI-powered code-review tool, analyzed open-source pull requests and found that AI-generated code introduced 1.7 times more problems than human-written code. Entelligence AI founder and CEO Aiswarya Sankar has claimed that companies are spending 44 percent of their tokens on fixing bugs that AI itself generated. Both statistics come from vendors with a commercial interest in AI code review — a limitation the source material explicitly flags — but they are directionally consistent with independent academic findings.
Researchers from Singapore Management University published a report in April 2026 warning that “AI-generated code can introduce long-term maintenance costs into real software projects.” The SMU paper adds institutional weight to arguments that had previously circulated mainly in blog posts and developer forums. One such post, by programmer and author James Shore, went viral on Hacker News. “You write code twice as quick now?” Shore wrote. “Better hope you’ve halved your maintenance costs. Otherwise, you’re screwed. You’re trading a temporary speed boost for permanent indenture.”
Taken together, METR’s failed replication attempt and the corporate budget overruns point to a structural shift that the productivity debate has largely missed: the question is no longer whether AI speeds up individual code generation — it clearly does in narrow, task-level benchmarks — but whether the downstream maintenance burden accrues faster than the upstream speed gain. If developers are generating code at 2× the rate while maintenance costs grow at 1.7× or more, the net effect on engineering capacity could be flat or negative even as token spend climbs. This is a systems-level accounting problem, not a tool-evaluation problem, and it is unlikely to be solved by adding another AI layer on top.
The pattern echoes earlier debates in how AI is reshaping hiring and engineering workflows — adoption outpaces measurement, and organizations discover the real costs only after budgets have been committed.
The Strongest Counterargument
The most credible objection to this framing comes from within the AI coding industry itself, most explicitly from Cognition founder and CEO Scott Wu, whose company makes Devin, an autonomous AI coding agent. Wu’s position — shared by others in the agent space — is that the maintenance burden created by AI code generation is itself automatable: AI coding agents can fix AI-generated bugs as fast as they are produced, effectively closing the loop without additional human labor.
It is a coherent argument, but Wu himself immediately qualifies it. He rates Devin’s current capability at somewhere between a junior and a mid-level programmer, depending on the task. Delegating code review and maintenance to a system operating at junior-developer proficiency does not obviously reduce risk; it may redistribute it in ways that are harder to observe. The SMU researchers and Wu agree on one point: humans should retain ownership of high-level decisions — software architecture, security design, and system-level reasoning — because these are precisely the areas where current AI models perform least reliably. That consensus limits how much of the maintenance loop can be safely automated today, regardless of how the tooling improves.
Research into how machine learning models can introduce privacy and security vulnerabilities reinforces the case for keeping humans in the architectural loop, particularly when AI-generated code touches sensitive data paths.
The SMU team’s practical guidance points in the same direction: developers need to understand, at a granular level, which tasks AI handles reliably and which it does not — analogous to knowing a programming language’s edge cases. They also need quality-assurance pipelines explicitly designed for AI output, and they should review AI-generated code with the same scrutiny applied to a junior developer’s pull request. None of that is a rejection of AI tooling; it is a framework for using it without accumulating hidden technical debt.
The broader pattern — where enthusiasm for AI adoption precedes rigorous measurement — has been documented across domains, from NASA’s use of machine learning for wildfire prediction to academic efforts at applying deep learning to scientific classification problems. In each case, performance gains in narrow benchmarks coexist with unresolved questions about reliability at scale.
Where This Ends Up
The most probable near-term outcome is that enterprise engineering teams begin treating AI-generated code the way regulated industries treat third-party dependencies: with mandatory review gates, automated quality checks, and explicit accounting of maintenance liability before code is merged. Tokenmaxxing will fade as a performance metric as CFOs demand output-based justification for AI infrastructure spend, and the tooling market will consolidate around products that can demonstrate measurable defect-rate improvements rather than raw generation speed.
The second-most-likely scenario is that agentic systems — AI agents that autonomously write, test, and fix code — mature fast enough to genuinely close the maintenance gap before the debt becomes unmanageable. That outcome depends on whether models at the frontier can move from junior-developer-equivalent capability to reliable mid-senior-level reasoning on open-ended engineering tasks within the next 12 to 18 months. If the current generation of AI agents is indeed a practice run for more capable successors, that timeline is plausible — but it has not yet been demonstrated in production at scale, and the organizations running up budget overruns today are not waiting for it.











