Tokenmaxxing - The Lines-of-Code Mistake in 2026 Clothing

                
                    Christian Lehnert •
                
                2026-05-23 •
                ~8 min read

The Inversion

In April 2026, an employee at Meta built an internal dashboard called
Claudeonomics that ranked all eighty-five thousand of the company's
employees by AI token consumption. The top user averaged hundreds of
billions of tokens per month. The leaderboard awarded titles such as
Token Legend, Model Connoisseur, and Cache Wizard. Andrew Bosworth,
Meta's CTO, told Forbes his best engineer was spending the equivalent
of his salary on tokens and was "5x to 10x more productive" as a
result. "It's like, this is easy money. Keep doing it."

A few weeks later, the Financial Times reported that Amazon employees
had begun running AI tools on trivial tasks specifically to inflate
their counts on a similar internal leaderboard. The practice acquired
a name. Tokenmaxxing. Microsoft has been running its own leaderboard
since January. Salesforce ships engineers a Mac widget that updates
their personal token spend every fifteen minutes, with a minimum
target of one hundred dollars on Claude Code plus seventy on Cursor
per month. Jensen Huang told the All-In podcast that an engineer
earning five hundred thousand dollars who failed to consume at least
two hundred and fifty thousand dollars worth of tokens annually would
alarm him "deeply."

This is wrong. Not because it is gameable, though it is. Not because
it is wasteful, though it is. It is wrong because it rewards the
exact opposite of what competent engineering looks like, and the
engineers who understand this are being publicly ranked beneath the
engineers who do not.

What Good Engineering Does With Tokens

The technical case is sharper than the management case, because it
can be specified precisely. Competent engineering of LLM-based
systems reduces token consumption. Across every dimension where the
work is real, the output of good engineering is a system that does
more with less.

Concise prompt engineering cuts token usage by thirty to fifty
percent without sacrificing quality. This is not folklore. The cost
versus quality trade-off has been studied empirically. A senior
prompt engineer takes a verbose first draft and halves it while
improving consistency of output.

Retrieval-augmented generation done well sends the model the right
context, not the most context. Aggressive filtering and reranking
before the chunks reach the LLM produces order-of-magnitude cost
reductions relative to the naive approach. The quality goes up
because the model is no longer drowning in noise.

Context engineering as a discipline (offloading large tool outputs
into references, compacting conversations as they grow, isolating
sub-tasks into sub-agents, retrieving at runtime instead of ahead of
time) has produced documented ninety percent token cost reductions
in production systems. The agent that does the work better consumes
substantially fewer tokens to do it.

Semantic caching at the model layer eliminates the inference call on
cache hits, with documented cost reductions of up to seventy-three
percent. Prompt caching at the provider layer reduces the cost of
repeated context dramatically. Right-sized model selection routes
most work to the cheaper model and reserves the frontier model for
the cases that need it.

A team that knows what it is doing produces a system that consumes
fewer tokens, costs less to operate, and outputs better results than
a team that does not. The metric that measures token consumption
tells you the opposite of this. The team that has done the engineering
work appears, in the metric, less productive. The team that has not
done the work appears more productive. The metric inverts the
signal.

What the Metric Selects For

The behaviors that the leaderboards actually reward have been
documented in reporting. They are the precise opposite of the
practices above. Engineers ask Claude or Cursor to build random side
projects that they have no intention of shipping, specifically to
burn tokens. Engineers calibrate their spend to slightly above the
peer average to avoid being seen as "using too little AI" — a
Microsoft engineer interviewed by Pragmatic Engineer admitted this
explicitly. Engineers run multiple agents in parallel on overlapping
problems, producing duplicated context windows and inflated counts.
The Cleo CEO's reported thirty-six-thousand-dollar single-month
spend was produced this way. Always-on AI assistants burn tokens
continuously for ambient capability that the engineer may or may
not actually use. Verbose lengthy prompts replace concise ones,
producing higher token counts and frequently worse model output.

The collective effect is a workforce optimizing against the metric
and against the actual work simultaneously.

The Lines-of-Code Mistake Returns

Engineering managers have been making this category of mistake since
the 1970s. The canonical instance is lines of code. The canonical
critique is Bill Gates: measuring programming progress by lines of
code is like measuring aircraft building progress by weight. The
canonical formalization is Goodhart's Law: when a measure becomes a
target, it ceases to be a good measure.

The principle has been applied to engineering metrics for fifty
years. It is taught in every serious engineering management course.
It is in every book on the subject. It keeps surfacing because the
underlying epistemological problem is genuinely hard. Productivity
is not directly observable. It has to be inferred from proxies. The
proxies that are easy to count (lines of code, commits, pull
requests, story points, tickets, tokens) are not what the manager
actually cares about, but they are countable. The proxies that
would actually reflect productivity (working systems, satisfied
users, reduced operational burden, code that other engineers want
to build on) are slow, qualitative, and require taste to evaluate.
Counting is cheaper than judging. Counting wins.

Tokens are the same mistake at a higher cost basis. Lines of code
wasted attention. Tokenmaxxing wastes cash. The ratio of waste per
incentive is higher than at any previous round.

The disappointing observation is that engineering leadership at
companies that have presumably read at least one book on metrics
has adopted this version of it anyway. The 1970s lesson did not
transfer to 2026. Each generation is required to relearn the
principle from first principles. The current generation is doing so
in real time, on the public balance sheets of the largest companies
in technology.

The Bosworth Defense Does Not Work

The strongest defense of the metric, offered by Bosworth and echoed
by Huang, is that the metric is rewarding genuinely productive
engineers who happen to use a lot of AI. If an engineer is willing
to spend their salary on tokens and the result is multi-fold
productivity, the company should be delighted to pay the bill and
the metric is correctly identifying the productive engineers.

There are two reasons this fails.

The "5x to 10x" claim is unsubstantiated. There is no methodology,
no controlled experiment, no comparison to counterfactual. The claim
is asserted by an executive with a strategic interest in justifying
the spend and a public commitment to the AI-augmented engineering
narrative. The base rate of such estimates being accurate, across
the history of technology adoption claims, is not high.

More importantly: even if the claim is true for the specific
engineer Bosworth was describing, it does not transfer to the
metric. A single engineer who is genuinely 5x more productive by
spending heavily does not imply that token spend is a measurement
of productivity. The relationship may be present in the specific
case and absent in the general one. The metric extrapolates the
specific to the general, which is exactly the move Goodhart's Law
warns against. The most productive engineer at Meta might be the
highest token consumer. The highest token consumer at Meta is not
therefore the most productive engineer. The reasoning runs in only
one direction. The metric assumes it runs in both.

The Money Is Real

Meta's sixty trillion tokens in a month, at blended pricing of two
to four dollars per million tokens, implies a hundred-and-twenty
to two-hundred-and-forty million dollars in monthly spend across
the eighty-five-thousand employee base. The annual run rate is in
the one-and-a-half to three billion dollar range. This is
acquisition-scale, data-center-buildout-scale capital allocation.
It is being allocated, in part, through the proxy of a leaderboard.

The Cleo CEO's thirty-six thousand dollars in a single month is
approximately three times the monthly cost of an engineer in many
markets. That engineer was not hired in addition to the CEO. The
spend was on top of the existing salary.

A substantial fraction of this spend is being driven by the metric
rather than by the underlying work. The ratio is unknown. The
existence of the ratio is not.

Closing

The engineers who will actually win the AI-augmented era are the
ones whose token-per-task ratio goes down quarter over quarter,
because they are getting better at the craft of using these tools.
The engineers who lose are the ones whose token consumption keeps
rising in service of an internal dashboard. The leaderboards of
2026 are systematically identifying the second group as the
winners. The market will correct this eventually, because the
finance side of any organization will eventually notice that
billions of dollars are being spent on a metric rather than on
output. The careers damaged in the meantime are the cost of the
correction.

The engineers in any tokenmaxxing organization who quietly continue
to optimize their token spend downward, build proper RAG and
caching, choose right-sized models, and write economical prompts,
are doing the actual job. They are also being publicly ranked
beneath the engineers who run agents on trivial tasks to inflate
their numbers.

The discomfort of this mismatch is the most useful thing a
thoughtful engineer can carry forward from this episode. The
discomfort is correct. The metric is wrong. The engineering work
that the metric is failing to measure is still the right work to
do.

Count the things you actually want. Resist the temptation to count
the things that are easy to count just because they are easy. Treat
any leaderboard that ranks engineers on a single dimension as
evidence of management that does not understand its own work.

The Token Legend title is a prize for losing the more important
game.

                Tagged:
            

#ai #llm #metrics

                
                   ← Back to posts