Token Sprawl Is Real. Here's How to Cap It.

Your engineering team's Anthropic bill jumped 4x last month. Nobody shipped a new product. Nobody onboarded a major customer. Someone just discovered they can pipe entire log files into Claude Opus to find a typo, and word got around. This is the conversation happening in finance meetings right now, and the fix isn't "ban AI" — it's plumbing.
The tokenmaxxing phase — where every developer was encouraged to throw the biggest model at every task — was brief. We're now in the token rationing phase. CFOs are looking at usage dashboards that didn't exist 12 months ago and asking why a single engineer ran $1,800 of inference in a week to refactor a 200-line file. The companies handling this well aren't rationing access. They're routing it.
Why your AI spend exploded in 90 days
The pattern is consistent across the teams I audit: spend doesn't grow linearly with headcount or output. It grows with habit formation. Once someone gets comfortable pasting a full repo into a frontier model, they stop pasting small chunks into smaller models. The unit cost per task doubles, then doubles again, while the perceived value stays flat.
Three behaviors drive most of the bleed:
- Frontier-by-default. Every task — summarizing a Slack thread, writing a regex, naming a variable — gets routed to the most expensive model available. Anthropic's own pricing pages show a roughly 5x cost gap between Haiku-class and Opus-class models per million tokens, and that gap matters when you multiply by 50 engineers making 200 calls a day.
- No context discipline. Engineers paste 80k tokens of context to answer a question that needs 2k. Long-context windows trained people to stop curating inputs.
- No caching. The same system prompt — your coding standards, your architecture doc, your style guide — gets re-billed on every single request because nobody flipped on prompt caching.
According to a 2025 a16z analysis of enterprise AI spend, the median company underestimated its first-year LLM cost by roughly 2-3x, primarily because individual developer usage scaled faster than procurement modeled. The number isn't the point. The pattern is: bottom-up adoption + top-down pricing = surprise.
The four categories of token waste
Before you cut anything, sort your usage into these four buckets. The fix is different for each.
| Category | What it looks like | Typical % of spend | Fix |
|---|---|---|---|
| Model mismatch | Opus answering "format this JSON" | 30-50% | Routing |
| Context bloat | Pasting whole repos for one-file edits | 15-25% | Retrieval + scoping |
| Repeat context | Same system prompt billed 10k times/day | 10-20% | Prompt caching |
| Failed agents | Loops that retry 8 times before giving up | 5-15% | Budgets + guardrails |
Most teams attack this with a Slack announcement: "Please use Haiku for simple tasks." That fails within a week because individual engineers don't know what counts as simple, and the cost of guessing wrong (a worse answer) feels personal while the savings feel abstract. You need the routing to happen below the engineer, not above them.
Model routing: the highest-leverage fix
A router sits between your application (or your developers' tooling) and the model APIs. It looks at the request, decides which model is appropriate, and forwards it. Done right, it cuts spend 40-70% with no quality drop on the easy 80% of tasks, and you keep frontier models for the hard 20%.
Here's a minimal router that classifies by task complexity:
import anthropic
client = anthropic.Anthropic()
# Cheap classifier decides routing
def classify_complexity(prompt: str) -> str:
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=10,
messages=[{
"role": "user",
"content": f"Classify this task as SIMPLE, MEDIUM, or HARD. "
f"Reply with one word only.\n\nTask: {prompt[:500]}"
}],
)
return resp.content[0].text.strip().upper()
MODEL_MAP = {
"SIMPLE": "claude-haiku-4-5",
"MEDIUM": "claude-sonnet-4-5",
"HARD": "claude-opus-4-5",
}
def route(prompt: str, **kwargs):
tier = classify_complexity(prompt)
model = MODEL_MAP.get(tier, "claude-sonnet-4-5")
return client.messages.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs,
)
This is intentionally crude. A production router will use signals like: input token count, presence of code, whether the user is a paid customer, whether the request is part of an agent loop, and historical accuracy on similar tasks. Tools like OpenRouter, Martian, and Portkey solve the boring parts (failover, observability) so you can focus on the routing rules.
A few rules of thumb that have held up across my client projects:
- Default to the mid-tier model, not the frontier one. Use the cheap model as the classifier, not the worker.
- Route by intent, not by user. "All engineers use Sonnet" is a worse policy than "all single-file edits use Sonnet."
- Log the routing decision. You need to A/B the classifier itself when accuracy complaints come in.
Prompt caching is free money
If you're sending the same system prompt with every request — your coding standards, your tool definitions, your few-shot examples — you're paying full price for tokens the provider has already processed for you minutes ago. Anthropic's prompt caching reduces the cost of cached input tokens to roughly 10% of the standard rate, with cache writes priced at about 1.25x. The break-even is usually 2-3 hits.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_STATIC_SYSTEM_PROMPT, # 8000 tokens of standards
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_question}],
)
What to cache:
- System prompts over ~1k tokens
- Tool definitions for agents
- Long reference documents queried repeatedly (style guides, schemas, API specs)
- Conversation history in long-running sessions
What not to cache: anything that changes per user, retrieval results, or short prompts where the cache write overhead eats the savings.
Right-sizing agents: the 5x problem
Agents are where token bills go nonlinear. A poorly-built agent doesn't cost 2x a single call — it costs 10-50x, because each step adds the full conversation history back into context. A 12-step agent with growing history can re-bill the same opening tokens twelve times.
Three controls that actually work:
1. Hard budget per task. Every agent invocation gets a token ceiling. When it's hit, the agent must return whatever it has or escalate to a human.
class TokenBudget:
def __init__(self, max_tokens: int):
self.max = max_tokens
self.used = 0
def charge(self, n: int):
self.used += n
if self.used > self.max:
raise BudgetExceeded(
f"Agent exceeded budget: {self.used}/{self.max}"
)
# In your agent loop
budget = TokenBudget(max_tokens=50_000)
for step in range(MAX_STEPS):
resp = client.messages.create(...)
budget.charge(resp.usage.input_tokens + resp.usage.output_tokens)
2. Context window pruning. After every N steps, summarize the conversation back down. Yes, you lose some fidelity. You also stop paying for tokens about a tool call you made eight steps ago.
3. Tool-call short-circuits. If the agent calls the same tool with the same arguments twice in a row, that's a loop. Kill it. Most agent frameworks now support this natively but very few teams configure it.
A practical FinOps stack for LLMs
The companies handling token sprawl well treat it like any other infra cost: meter it, attribute it, alert on it. Here's the minimum viable stack:
| Layer | Purpose | Open-source options | Commercial |
|---|---|---|---|
| Gateway | Single endpoint, key management, routing | LiteLLM, Helicone | Portkey, OpenRouter |
| Observability | Per-request logs, cost, latency | Langfuse, Phoenix | Datadog LLM, Helicone |
| Caching | Semantic + exact match | GPTCache, Redis | Provider-native (Anthropic, OpenAI) |
| Evals | Catch quality regressions when routing | Promptfoo, DeepEval | Braintrust, Patronus |
You don't need all of this on day one. The order that has worked for me: gateway first (so you have one chokepoint), observability second (so you know what's actually happening), caching third (the easy wins), evals fourth (so you can route more aggressively without fear).
A 2025 OpenAI cookbook recommendation puts it bluntly: "Optimize for the cheapest model that meets your quality bar, not the smartest model you have access to." That's the whole game.
Policies that don't make engineers hate you
Token rationing fails the moment it feels like surveillance. Here's what works in practice:
- Budget by team, not by person. A team with a $5k/month inference budget will self-police. A person with a $200/month budget will hide usage on personal keys.
- Surface cost in the IDE. If an engineer sees "this prompt will cost ~$0.40" before sending, behavior shifts immediately. Cursor, Continue, and Claude Code all support this now.
- Make the cheap path the default path. If using Haiku requires a config change and Opus is the default, you've lost. Flip it.
- Publish a "when to use what" doc. Three lines, not three pages: "Use Haiku for: extraction, formatting, classification. Sonnet for: code, drafts, reasoning. Opus for: anything where a wrong answer costs more than $50."
- Show the savings publicly. "Routing saved us $14k in May" goes in the engineering all-hands. People want to be on the team that ships the optimization.
What doesn't work: blanket bans, per-request approval workflows, weekly "AI cost" shame emails. Those just push usage to personal credit cards and Anthropic Console accounts you don't control.
How BizFlowAI approaches this
Most of our client engagements that start as "build us an agent" end up including a token spend audit, because the bill is often the reason they called. We typically find a 40-60% reduction available within two weeks: model routing on the obvious traffic, prompt caching on the long system prompts, and tightening the agent loops that were spinning. The work is unglamorous — instrumenting the gateway, classifying historical traffic, swapping models behind a feature flag — but it's the difference between an AI line item that scales with revenue and one that scales with team morale.
If your monthly AI bill has more than doubled in the last quarter and you can't draw a straight line from that increase to revenue or hours saved, that's the audit. We map every call to a category, route the easy 80% to right-sized models, set hard budgets on agent loops, and leave you with a dashboard that shows cost per task instead of cost per month. Book a discovery call and we'll walk through your actual usage before recommending anything.
What to do this week
If you read this far, here's the order I'd attack it in starting Monday:
- Pull the last 30 days of usage from your provider's console. Sort by user and by endpoint. The Pareto distribution will be brutal — usually 3-5 people account for 60%+ of spend.
- Enable prompt caching on your largest system prompts. This is the cheapest 20% win available.
- Put a gateway in front of everything. Even if you don't route yet, having one chokepoint means you can route tomorrow.
- Set per-team budgets, not per-person. Make them generous enough that nobody hits them in week one — you want data, not friction.
- Add cost-per-task to your engineering dashboards alongside latency and error rate. Make it visible.
The companies that handled cloud cost sprawl in 2014-2018 are the same ones handling token sprawl in 2026: the ones who built FinOps muscle early. You don't need to be Netflix about it. You do need a gateway, a routing rule, and someone who looks at the dashboard once a week.
The era of "just throw the biggest model at it" is ending not because the models got worse, but because the bills got real. The teams that adapt fastest will be the ones that treat inference like compute — measured, attributed, and right-sized — instead of like magic.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
How do I reduce my Anthropic Claude API bill without limiting developer access?
Implement model routing instead of access restrictions. Place a gateway between developers and the Claude API that classifies each request and routes simple tasks to Haiku, medium tasks to Sonnet, and only hard tasks to Opus. Combine this with prompt caching for static system prompts and token budgets for agent loops. Teams typically cut spend 40-70% on easy tasks while preserving quality on hard ones.
What is LLM model routing and how does it work?
LLM model routing is a layer between your application and model APIs that inspects each request and forwards it to the cheapest model that can handle it. A common pattern uses a cheap classifier (like Claude Haiku) to label tasks as SIMPLE, MEDIUM, or HARD, then maps each tier to an appropriately sized model. Production routers also use signals like input token count, code presence, and historical accuracy. Tools like LiteLLM, OpenRouter, and Portkey handle failover and observability.
When should I use Anthropic prompt caching?
Use prompt caching whenever you send the same content repeatedly within minutes: system prompts over ~1k tokens, tool definitions for agents, long reference documents like style guides or API specs, and conversation history in long sessions. Cached input tokens cost roughly 10% of the standard rate, while cache writes cost about 1.25x, so break-even is typically 2-3 hits. Avoid caching per-user content, retrieval results, or short prompts.
Why do AI agents cost so much more than single LLM calls?
Agents re-bill the conversation history on every step, so a 12-step agent can charge for the same opening tokens twelve times. Failed loops that retry repeatedly compound this further. Control costs with three mechanisms: a hard token budget per task that forces the agent to stop or escalate, periodic context window pruning that summarizes old steps, and short-circuits that kill the agent when it calls the same tool with identical arguments twice.
What does a minimum FinOps stack for LLMs look like?
Start with four layers in order: a gateway (LiteLLM or Portkey) to create a single chokepoint with key management and routing, observability (Langfuse, Helicone) for per-request cost and latency logs, caching (provider-native or GPTCache) for the easy wins, and evals (Promptfoo, Braintrust) to catch quality regressions when routing aggressively. You don't need all four on day one, but the gateway must come first so every request flows through one controllable point.