Use prompt caching aggressively

Anthropic's prompt caching lets you mark a prefix of your prompt as cacheable. Subsequent requests that share that prefix pay a fraction of the input token cost for the cached portion. For applications that re-send the same system prompt, tool definitions, or document context across many requests, this is the single biggest lever — often a 5-10x reduction on the cached portion of input tokens.

The catch: cache TTL is short (5 minutes by default), so it works best for high-frequency workloads. Restructure prompts so the stable, large content sits at the top and the variable parts sit at the bottom.

Pick the right model tier

Claude comes in tiers — Opus, Sonnet, Haiku — at very different price points. Many teams default to the most capable model for everything, when most of their requests would be answered just as well by a cheaper tier. A common pattern is routing: cheap model for classification, summarization, simple lookups; expensive model for reasoning-heavy tasks. A simple router can cut total spend by 60–80% with minimal quality impact.

Constrain output length

Output tokens cost more than input tokens. If you do not set max_tokens tightly, the model is happy to ramble. For structured outputs, JSON responses, or short answers, set a strict cap. For chat-style applications, prompt the model explicitly for brevity. Output reduction often matters more than input reduction.

Use the Batch API for non-realtime work

The Batch API gives you a 50% discount on requests that can wait up to 24 hours for a response. Anything that runs on a schedule — overnight reports, document summarization pipelines, evaluation runs, dataset labeling — should use it. The savings are automatic and the only cost is a small amount of orchestration code.

If you are using Claude Code, take the shortcut

The above levers apply when you control your own API calls. If your spend is going through Claude Code rather than a custom application, you do not get to choose the model, configure caching manually, or batch requests. The most direct path is to cut the volume of context Claude Code sends in the first place.

That is what Headroom does — it intercepts every Claude Code prompt locally, compresses repetitive logs and boilerplate, and forwards a smaller payload. Token spend drops ~50% with no measurable quality regression. Read our Claude Code cost guide for the full breakdown of which workflows benefit most, or the Why is Claude Code so expensive? page if you want to understand the underlying patterns first.

Reduce Claude API costs: practical cuts for 2026

Use prompt caching aggressively

Pick the right model tier

Constrain output length

Use the Batch API for non-realtime work

If you are using Claude Code, take the shortcut

Cut the Claude bill that hurts most

Not ready to install yet?

Headroom is macOS-only — for now

You're on the list.