Why the AI Productivity Wave Is Repeating an Old Management Mistake

There is a familiar pattern in tech. A new tool arrives. It looks powerful. Leaders want adoption. Finance wants ROI. Managers want dashboards.

Someone asks the obvious question: “How do we measure whether this is making people more productive?”

That question is reasonable, but then the mistake happens. Instead of measuring whether the company is building better software, solving customer problems faster, reducing toil, or improving engineering quality, the organization starts measuring the thing that is easiest to count.

In the old world, that meant commits, lines of code, pull requests, or tickets closed. In the AI world, it increasingly means tokens, AI-generated lines, AI-assisted pull requests, prompt volume, or agent usage. And this is where the industry risks repeating the same failure with newer vocabulary.

Tokens are the new commit count because they have the same seductive flaw: they are easy to measure and easy to misunderstand. A token count tells you that activity happened. It tells you that language moved through a model. It may help a company understand cost, adoption, or capacity. But it does not tell you whether the engineer understood the problem better, produced a safer change, reduced review burden, prevented an incident, or delivered something customers actually needed.

A team can burn millions of tokens going in circles, just as an engineer can make dozens of commits without moving the product forward. In both cases, the metric captures motion. It does not capture judgment. That is why token counts are useful as operational telemetry, but dangerous as productivity scores.

The uncomfortable truth is that the industry already learned this lesson with commits. Now it seems determined to learn it again, but with a much larger inference bill.

The Seduction of Measurability

Developer productivity has always been hard to measure because software engineering is not factory work.

A factory can count units produced. A warehouse can count packages moved. A call center can count calls handled. Even then, those metrics can be gamed or misused, but at least the “unit” is somewhat visible.

Software is different. Sometimes the most productive thing an engineer does all week is delete 2,000 lines of code. Sometimes it is preventing a bad architectural decision. Sometimes it is reviewing someone else’s design and catching a flaw before it becomes six months of operational pain. Sometimes it is writing the boring migration plan that makes future work safer. Sometimes it is sitting with a junior engineer and giving them enough context to unblock themselves for the next year.

Most of that does not show up cleanly in a commit graph. This is why activity metrics are so seductive. They give management something crisp to look at. They create the appearance of objectivity. They turn messy knowledge work into a scoreboard. The problem is that the scoreboard often measures the wrong game.

The SPACE framework, developed by researchers from Microsoft Research, GitHub, and the University of Victoria, explicitly argues that developer productivity cannot be reduced to one dimension. It defines productivity across satisfaction and well-being, performance, activity, communication and collaboration, and efficiency and flow. It also warns that activity metrics like pull requests, commits, and code reviews should not be used in isolation to reward or penalize developers. That warning matters even more in the AI era, because AI makes activity easier to inflate.

The Old Mistake: Commits Were Never Productivity

Counting commits felt reasonable at first. A commit is visible. It is timestamped. It is attributable. It appears to show that work happened.

But commits are arbitrary. One engineer may commit every small step. Another may squash a week of work into one clean commit. One task may require deep debugging and produce a tiny patch. Another may produce hundreds of generated files with minimal thought.

GitLab put it bluntly years ago: commits are “arbitrary changes captured in a single moment in time,” and their size or frequency does not correlate with the work needed to achieve the change. At best, commits indicate activity. They should not be compared across programmers as productivity measures.

Lines of code are even worse. A developer can add code, remove code, simplify code, generate code, or replace code with a library. The business does not care whether the answer took 10 lines or 1,000 lines. It cares whether the problem was solved, whether the system remains maintainable, and whether customers benefit.

The classic example is Bill Atkinson, who reportedly logged “-2000” lines of code after optimizing QuickDraw’s region calculation machinery by removing code. The point is obvious: less code can represent more engineering value. Researchers Ciera Jaspan and Caitlin Sadowski use that story in their chapter No Single Metric Captures Productivity, where they argue that no single metric adequately captures developer productivity and that searching for one can be counterproductive.

This is not just a philosophical concern. When individual productivity metrics feed into performance reviews, compensation, or job security, people rationally adapt. They split commits. They inflate estimates. They optimize for the dashboard. They avoid hard, ambiguous, low-visibility work. They stop doing the invisible work that keeps teams healthy.

That is Goodhart’s law in action: once a measure becomes a target, it stops being a good measure. The problem was never that commits, pull requests, or lines of code contain no information. The problem was treating them as judgment instead of telemetry.

The New Mistake: Tokens Look Objective, but They Are Even Further From Value

Now the AI wave is creating a new category of activity metrics. How many tokens did an engineer consume? How many prompts did they send? How often did they use Copilot, Cursor, Claude Code, ChatGPT, or an internal agent? How much code was AI-generated? How many AI-assisted pull requests were merged?

These numbers feel modern. They feel measurable. They make it look like an organization is becoming “AI-native.” But token count is not productivity. Token count can rise because an engineer is doing valuable work with AI. It can also rise because:

An agent is stuck in a loop.
A prompt includes too much irrelevant context.
The model generated bad code that had to be thrown away.
The engineer is repeatedly asking the AI to fix its own mistakes.
The team is using AI for work that would have been faster to do manually.
The organization created a culture where people use AI performatively because they believe usage itself is being watched.

That last point is not hypothetical. In May 2026, Business Insider reported that Amazon shut down an internal employee-created AI token leaderboard called “KiroRank” after it encouraged some employees to use AI in ways that did not necessarily solve problems. An Amazon senior vice president reportedly told staff not to use AI “just for the sake of using AI,” and the company said it tracks token usage for cost purposes but does not encourage “tokenmaxxing.”

That is the whole issue in miniature. Once token usage becomes status, people will produce token usage. Not value… usage.

And once usage becomes tied, directly or indirectly, to performance, people will protect themselves by appearing maximally “AI-forward,” whether or not the work benefits. This is how a tool becomes a ritual.

The Cargo Cult of AI Transformation

There is a phrase from anthropology and organizational criticism that fits this perfectly: cargo cult. A cargo cult imitates the visible rituals of a system without understanding the underlying mechanisms that made the system work. In engineering organizations, the AI cargo cult looks like this:

Leadership sees successful companies using AI.
So they roll out AI tools.
Then they create dashboards.
Then they track usage.
Then they create internal leaderboards.
Then they pressure teams to become “AI-first.”
Then they measure who is using the tools most.
Then they assume the people using the most AI must be the most productive.

But the actual mechanisms of productivity are not tokens, prompts, or AI-generated diffs. The mechanisms are:

clear ownership
good architecture
fast feedback loops
high-quality tests
strong code review
low operational burden
usable documentation
psychological safety
reasonable scope
healthy on-call
and enough slack in the system for engineers to think

AI can amplify those mechanisms. It cannot replace them. If a team has bad tests, AI can generate more code that nobody can safely validate. If a team has overloaded reviewers, AI can create larger review queues. If a system is poorly understood, AI can produce confident changes that fit the surface pattern while violating hidden invariants.

If an organization rewards volume, AI can generate volume. And if leadership does not understand the difference between activity and outcome, AI simply gives the organization a more expensive way to fool itself. That is the cargo cult danger.

The company copies the symbols of modern engineering: agents, dashboards, AI usage metrics, productivity claims — but does not build the operating system that makes those tools valuable.

AI May Increase Local Speed While Hurting System Throughput

One of the most important things to understand about software organizations is that local productivity and system productivity are not the same thing.

An individual engineer may produce code faster. But if that code increases review burden, test flakiness, operational risk, incident load, or long-term maintenance cost, the system may become slower. This is where AI complicates the productivity story.

The DORA research program’s 2026 report on generative AI in software development found that a 25% increase in AI adoption was associated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. DORA’s explanation is not “AI is useless”; it is that AI can generate code faster, which may lead to larger batch sizes that are harder to review and more prone to instability.

That is a systems-thinking result. The bottleneck may not be writing code. The bottleneck may be understanding the code. Or reviewing it. Or testing it. Or safely deploying it. Or operating it at 3 a.m. If AI accelerates the easiest part of the pipeline while leaving the rest unchanged, it can flood the system.

That nuance gets destroyed by a token counter. A token counter cannot distinguish between: “AI helped me write a good test suite in 20 minutes,” and “AI hallucinated a migration plan for two hours and I had to throw it away.” It just sees tokens.

This is why “more code faster” is not automatically good. A team does not need more code in the abstract. It needs more correct, maintainable, well-reviewed, well-tested, well-owned changes that solve real problems. Sometimes AI helps produce that. Sometimes it produces more material for humans to clean up.

The Hidden Work Does Not Disappear

One reason commit-count productivity failed is that it ignored invisible work.

AI does not change this. In fact, AI may increase the importance of invisible work. Someone has to review the generated code. Someone has to decide whether the model’s suggestion fits the architecture. Someone has to understand the security implications. Someone has to notice when generated tests test the mock instead of the behavior. Someone has to maintain the code six months later. Someone still has to carry ownership.

A 2026 study from Carnegie Mellon and BNY Mellon argued that measuring productivity with AI coding assistants requires a multifaceted approach. Their interviews identified factors such as self-sufficiency, frustration and cognitive load, task completion rate, ease of peer review, technical expertise, and ownership of work. The authors specifically emphasize that long-term factors like expertise and ownership are often missed by narrower metrics.

That matters because AI can create a dangerous illusion: the work looks done before the understanding exists. The code compiles, the diff looks plausible, the assistant explains it confidently, the pull request is opened, the dashboard celebrates activity… but the human system may have lost knowledge, not gained productivity.

Token Usage Is a Cost Metric, Not a Productivity Metric

There is a legitimate reason to track tokens. Companies should know what they are spending. They should understand which tools are being adopted. They should identify runaway costs, inefficient workflows, and places where agents are burning compute without results. That is operational hygiene.

But a cost metric is not a productivity metric. Cloud spend does not equal customer value. CPU usage does not equal business impact. Number of Slack messages does not equal collaboration. Calendar hours do not equal leadership. And tokens do not equal engineering output.

This distinction is already showing up in executive discussions. Business Insider reported that Uber’s COO said it was becoming harder to justify AI costs and that higher token usage had not yet translated into a proportional increase in useful consumer-facing features. That is exactly the right distinction.

The question is not: “Are we using a lot of AI?” The question is: “What changed because we used AI?” Did cycle time improve? Did review time decrease? Did escaped defects go down? Did incident load improve? Did engineers spend less time on toil? Did customers get useful features faster? Did onboarding become easier? Did documentation improve? Did teams reduce operational risk? Did the company get more valuable output per dollar of engineering investment? Those are harder questions, but they are the real questions.

What Companies Should Measure Instead

The answer is not to stop measuring. The answer is to stop pretending one metric can carry the truth. A good AI productivity framework should separate utilization, impact, quality, cost, and human sustainability.

Utilization answers: are people using the tool?
Impact answers: is the tool improving outcomes?
Quality answers: is the output safe and maintainable?
Cost answers: is the investment economically justified?
Human sustainability answers: is the system making engineers more effective or just more monitored?

The DX AI Measurement Framework, for example, separates AI measurement into utilization, impact, and cost, and recommends combining AI-specific metrics with broader engineering productivity measures rather than relying on a single measure.

A healthier measurement system might include:

Adoption and usage metrics: active users, frequency of AI use, token cost, tool coverage, workflow integration. These are useful for rollout and budgeting, but they should not be used as individual productivity scores.
Flow metrics: cycle time, lead time for changes, review latency, deployment frequency, batch size, blocked time, and time from idea to production. DORA’s delivery metrics focus on throughput and instability, including change lead time, deployment frequency, recovery time, change fail rate, and deployment rework rate.
Quality metrics: change failure rate, escaped defects, incident volume, rollback frequency, test coverage that actually maps to behavior, security findings, reliability regressions, and maintainability signals.
Review metrics: pull request size, review depth, number of review cycles, reviewer load, time spent validating AI-generated code, and whether AI-created diffs increase or reduce reviewer burden.
Human metrics: developer satisfaction, cognitive load, perceived ability to make progress, interruption load, on-call burden, documentation quality, onboarding time, and whether engineers feel they understand the systems they own.
AI-specific outcome metrics: time saved on specific workflows, reduction in repetitive toil, percentage of generated code that survives review without major rewrite, post-merge rework, and whether AI-assisted changes correlate with fewer or more incidents.

None of these metrics is perfect. That is the point. You need a constellation of measures with tension between them. If AI increases pull request throughput but also increases change failure rate, that is not a clear win. If AI increases code volume but decreases ownership, that is not a clear win. If AI saves time on boilerplate but creates more review burden, the bottleneck moved. If AI reduces cognitive load and helps engineers learn faster, that value may not show up in token counts at all.

The Better Question: What Friction Did AI Remove?

A good manager should not ask, “How many tokens did you use this week?”

A good manager should ask: What work became easier? What work became safer? What work became faster? What toil disappeared? What did we learn? What did customers receive sooner? What risks did we introduce? What guardrails do we need?

AI should be evaluated like an engineering tool, not a religious conversion.

When continuous integration works, we do not celebrate the number of CI minutes consumed. We care whether it catches problems early. When observability works, we do not celebrate the number of logs emitted. We care whether teams can understand production behavior. When code review works, we do not celebrate the number of comments. We care whether the change improves before it ships.

AI should be treated the same way. The value of AI is not that tokens were consumed. The value is that a human and a system became more capable.

Tokens Are Receipts, Not Results

The AI wave is real and the tools are useful. Some engineers will become dramatically faster at some tasks. Some teams will unlock real leverage. Some companies will build better products with fewer bottlenecks.

But the organizations that reduce AI productivity to token counts are not becoming modern. They are repeating an old mistake with a new dashboard. We already learned that commits do not equal productivity. We already learned that lines of code do not equal value. We already learned that activity metrics, when tied to performance, create gaming, fear, and distorted behavior.

Now we need to learn that tokens do not equal impact. Tokens are receipts. They tell you something was spent. They do not tell you whether anything worthwhile was bought. The companies that get this right will measure AI by the friction it removes, the quality it preserves, the outcomes it accelerates, and the humans it helps.

The companies that get it wrong will build leaderboards, celebrate usage, inflate costs, exhaust reviewers, and call it transformation. And then, a year later, they will ask why all those tokens did not turn into better software.