Use prompt caching aggressively
Anthropic's prompt caching lets you mark a prefix of your prompt as cacheable. Subsequent requests that share that prefix pay a fraction of the input token cost for the cached portion. For applications that re-send the same system prompt, tool definitions, or document context across many requests, this is the single biggest lever — often a 5-10x reduction on the cached portion of input tokens.
The catch: cache TTL is short (5 minutes by default), so it works best for high-frequency workloads. Restructure prompts so the stable, large content sits at the top and the variable parts sit at the bottom.
Pick the right model tier
Claude comes in tiers — Opus, Sonnet, Haiku — at very different price points. Many teams default to the most capable model for everything, when most of their requests would be answered just as well by a cheaper tier. A common pattern is routing: cheap model for classification, summarization, simple lookups; expensive model for reasoning-heavy tasks. A simple router can cut total spend by 60–80% with minimal quality impact.
Constrain output length
Output tokens cost more than input tokens. If you do not set max_tokens tightly, the model is happy to ramble. For structured outputs, JSON responses, or short answers, set a strict cap. For chat-style applications, prompt the model explicitly for brevity. Output reduction often matters more than input reduction.
Use the Batch API for non-realtime work
The Batch API gives you a 50% discount on requests that can wait up to 24 hours for a response. Anything that runs on a schedule — overnight reports, document summarization pipelines, evaluation runs, dataset labeling — should use it. The savings are automatic and the only cost is a small amount of orchestration code.
If you are using Claude Code, take the shortcut
The above levers apply when you control your own API calls. If your spend is going through Claude Code rather than a custom application, you do not get to choose the model, configure caching manually, or batch requests. The most direct path is to cut the volume of context Claude Code sends in the first place.
That is what Headroom does — it intercepts every Claude Code prompt locally, strips out repetitive logs and boilerplate, and forwards a smaller payload. Token spend drops ~50% with no quality regression. Read our Claude Code cost guide for the full breakdown of which workflows benefit most, or the Why is Claude Code so expensive? page if you want to understand the underlying patterns first.