Everyone knows AI is getting cheaper — model prices have fallen dramatically over the past two years, and the narrative is firmly one of democratization. But a quieter, less flattering counter-story is unfolding inside corporate IT departments: uncontrolled AI usage has already handed at least one company an unexpected $500 million bill, and the tab is still climbing across the industry.
What Is It?
At the heart of this cost story is a concept called a token. When you type a question into a generative AI tool — say, asking it to summarise a contract or draft an email — the AI doesn’t read your words the way you do. Instead, it breaks your text into small chunks called tokens. Think of tokens as syllable-sized puzzle pieces: a word like “running” might be one token, while “unbelievable” might be split into two or three. Numbers, punctuation, and spaces each have their own tokens too.
Tokens matter because they are the unit by which AI companies measure and charge for usage. Send a short prompt, get a short answer — a handful of tokens. Ask an AI agent to research a topic, generate a report, and cross-reference a database — you could burn thousands or even millions of tokens in a single automated workflow. Google CEO Sundar Pichai has described tokens as “the fundamental units of data our models process, many representing a problem being solved.” Google itself processes roughly 3.2 quadrillion tokens a month — a number that gives a sense of the industrial scale involved.
Large language models (LLMs) — the type of AI system behind tools like ChatGPT, Gemini, and Claude — are especially token-hungry. An LLM is an AI trained on vast quantities of text to predict and generate language. Every step of reasoning, every line of output, every piece of context the model is asked to hold in memory: all of it costs tokens.
Why It Matters
For individual users, token costs are mostly invisible — absorbed into a flat monthly subscription. But for businesses deploying AI at scale, they can become a dominant line item on the IT budget almost overnight. As Uber discovered after burning through its annual AI budget in four months, the economics of AI at enterprise scale are brutal if left unmanaged.
This isn’t merely a CFO headache. Token costs shape which AI use cases are viable, which vendors win enterprise contracts, and ultimately how quickly AI can be deployed across an organization. A tool that looks cheap in a pilot — running a few hundred test queries — can look terrifying when rolled out to 50,000 employees or embedded inside automated agent workflows that run around the clock.
Gartner senior director analyst Deepak Seth puts it plainly: there is “sometimes overkill” with large language models. A sprawling model trained on the collected works of Shakespeare and Dickens is not always the right tool for answering a routine HR query. Using it as though it were is an expensive habit.
There is a structural parallel worth drawing here. The token crisis rhymes closely with the server sprawl problem that hit enterprises in the early 2000s, when cheap virtualization software and easy provisioning led IT teams to spin up servers indiscriminately — until the bills arrived. Dheeraj Pandey, CEO of DevRev, makes exactly this comparison, likening the current AI moment to the pre-virtualization era of cloud computing. The fix back then was consolidation and abstraction layers. The fix emerging now — caching, memory layers, smarter routing — follows the same engineering logic. The lesson enterprises failed to learn from cloud cost governance appears to be repeating itself, this time denominated in tokens rather than compute hours. For a deeper look at how AI infrastructure spending is reshaping broader cost structures, the pattern is hard to ignore.
How It Works
Step 1 — Your prompt enters the model
When you send a message to an AI system, it is tokenised — converted from human-readable text into numerical representations the model can process. Think of it like a translation service at a conference: every word you say must be simultaneously interpreted before the speaker on stage can respond. The longer your speech, the more work the interpreter does, and the higher the bill.
Step 2 — The model reasons and generates
The LLM processes your tokens in what is called its context window — essentially its working memory. Everything the model needs to “know” in order to answer you must fit within this window. Larger, more complex tasks require bigger context windows, which consume more tokens and more compute. Advanced “reasoning” models, which think through problems step by step before answering, are particularly expensive: every intermediate reasoning step is itself token-generating.
Step 3 — Output is returned (and charged)
The model generates a response, also measured in tokens. In most commercial APIs, both input tokens (your prompt) and output tokens (the AI’s response) are charged separately, with output tokens typically costing more. Automated agents — software that uses AI to complete tasks with minimal human input — can chain dozens of these input-output cycles together, multiplying costs rapidly.
Understanding this pipeline matters because it reveals where savings are possible: at the prompt level, the model selection level, and the infrastructure level. As token pricing has started to fall thanks to model competition, smart architecture decisions can compound those savings further.
What companies are actually doing to cut the bill
Switching to lighter models. Not every task needs the most powerful model available. Google’s Gemini 2.5 Flash, for instance, is positioned as a “frontier-capable” model at less than half the price of comparable flagship models. Routing routine queries to cheaper models while reserving heavier models for complex reasoning can yield significant savings. Amazon Q, a business AI assistant priced at $20 per month, has found fans among analysts like Hyperframe Research’s Steven Dickens for exactly this reason.
Adding memory and caching layers. DevRev is building a layer between AI agents and primary data sources — such as Salesforce or ERP records — that holds a knowledge graph of common answers. Agents query the cache first; only genuinely novel questions reach the LLM. This both reduces token load and cuts expensive GPU cycles in favour of cheaper CPU processing. NetBrain takes a similar approach: conventional computing maps a network’s topology, then only the relevant summary is handed to the AI for reasoning. “So you don’t have to spend all the tokens,” as NetBrain CTO Song Pang puts it.
Prompt engineering. ManpowerGroup found that improving prompt design cut the number of follow-up questions users needed to ask its internal labour-market tool from an average of ten to four. Fewer exchanges mean fewer tokens. Prompt efficiency is unglamorous work, but the ROI is immediate.
Going local. Perhaps the most structurally significant development is the push toward on-device and on-premise AI. At GTC Taipei, Nvidia and Microsoft unveiled the RTX Spark, an agentic AI desktop PC capable of running 120-billion-parameter models locally on Windows. Microsoft CEO Satya Nadella described the goal as delivering “unmetered intelligence to every home and every desk.” If token consumption moves to local hardware, the per-query cost drops to effectively zero — though the upfront hardware investment is substantial.
Forward-deployed engineers. AWS’s Generative AI Innovation Center is fielding teams of forward-deployed engineers (FDEs) — specialists embedded in customer environments specifically to architect cost-efficient AI systems. Taimur Rashid, the centre’s managing director, notes that the goal isn’t to minimize token use at all costs, but to ensure the economics hold: “if you’re generating revenue, as long as the economics work out, then you’re at peace.”
The Strongest Counterargument
The most credible objection to the token-cost panic is this: token prices are falling fast enough that optimization may be unnecessary. This is the position implicitly held by many AI-first companies and venture investors who argue that model efficiency improvements — driven by competition between OpenAI, Anthropic, Google, and open-source projects — will commoditize inference costs within two to three years, rendering today’s optimization efforts obsolete. The data supports part of this view: token prices have already fallen dramatically, and newer model architectures are becoming significantly more efficient per unit of output.
However, this argument has a critical flaw: it assumes that usage remains constant as prices fall. In practice, cheaper tokens historically drive more usage, not less — a classic Jevons paradox in which efficiency gains are consumed by increased demand. As AI agents become more autonomous and workflows more complex, the volume of tokens being processed is growing faster than the per-token price is falling. The $500 million surprise bill wasn’t caused by expensive tokens — it was caused by unexpected volume. Falling prices don’t fix a volume problem; they may actually worsen it. The optimization imperative, then, is not about price hedging — it is about architectural discipline that no amount of price compression will substitute for.
The race between frontier models like Claude Opus 4 and GPT-5 will likely keep per-token prices under pressure, which is genuinely good news. But enterprises that treat falling prices as a reason to avoid architectural governance are repeating the cloud-sprawl mistake in a new medium.
Common Misconceptions
1. “Tokens are just words.” Close, but not quite. Tokens are sub-word units determined by the model’s tokeniser. Common short words may be a single token; longer or rarer words are often split. In some models, a single emoji counts as multiple tokens. This matters practically because a prompt that looks short to a human eye may be token-heavy for the model, especially if it contains technical jargon, foreign-language text, or complex formatting.
2. “The expensive part is always the AI model itself.” Not necessarily. Sending agents directly at enterprise systems like ServiceNow or Salesforce without a caching layer “will burn a lot more tokens — it’s also not precise,” says DevRev’s Pandey. Infrastructure decisions — how data is retrieved, how context is assembled, how results are stored — often drive a larger share of token consumption than the underlying model choice.
3. “Token-based pricing is permanent.” Gartner’s Seth believes the industry is moving toward outcome-based pricing, where customers pay for results — a completed task, a resolved ticket, a successful analysis — rather than the number of tokens consumed. Some vendors are already piloting this model. If it takes hold, the entire optimization conversation shifts from “use fewer tokens” to “achieve more per dollar,” which is a fundamentally different engineering problem.
Where to Learn More
- Google AI Research — Google’s research hub covers LLM architecture, efficiency advances, and the engineering behind large-scale model deployment.
- Anthropic Research — Anthropic publishes detailed papers on model safety, scaling, and efficiency that illuminate the token economics debate from first principles.
- Gartner AI Research — Gartner’s AI practice tracks enterprise adoption, cost governance, and the shift toward outcome-based AI metrics — directly relevant to the token-cost conversation.
- Why AI Is About to Make Everything More Expensive — Blockgeni’s analysis of how AI infrastructure spending ripples into broader economic costs.
- Amazon Engineers Speak Out Against the $200B Data Center Push — A ground-level view of how AI infrastructure costs look from inside one of the world’s largest cloud providers.
The Prediction
Within 18 months, token-based pricing will begin a meaningful shift toward outcome-based billing at the enterprise tier — and the companies that built architectural discipline into their AI stacks now will capture disproportionate margin as a result. The signal to watch: if two or more major AI vendors announce outcome-pricing tiers by the end of 2026, the transition is real. If token pricing remains the dominant model by mid-2027, today’s optimization work will still have delivered compounding savings — but the strategic window for differentiation will have narrowed considerably.











