The one-sentence version: prompt caching lets you pay a fraction of the normal input price on tokens you send repeatedly. For workloads with a long stable prefix (system prompt + retrieved context + conversation history), that fraction is 10-25% of normal — so your input bill shrinks 70-90%.
The calculator below runs real numbers on your workload against each provider's current rates. Below that, a walkthrough of how each provider implements caching and when it's actually worth turning on.
· cache savings calculator
Apr 2026 rates
90% off reads · 25% write premium (5-min TTL)
Assumes 30 days / month.
✓ above minimum (1,024 tokens)
The part that varies per call (user message, retrieved chunk).
Realistic: 90-99% if traffic is steady. Drops during quiet periods (cache expires).
· without caching
$720
/ month · 30,000 requests
· with caching
$342
/ month · explicit
· savings
$378
/ month · 52.5% off
Numbers are approximate — actual costs depend on exact cache hit rate, workload burstiness, and any commit or batch discounts. Pricing pulled from provider docs as of April 2026; providers change pricing, so verify at claude docs / openai docs before relying on these numbers for budgeting.
Why caching exists
When a model processes your prompt, it builds an internal "attention" state — a big matrix of how every token relates to every other token. For long prompts (10K+ tokens), building that state is the most expensive part of the request, often more than generating the output.
If you send the same prefix twice, the provider could in principle cache the attention state from the first call and reuse it for the second — skipping the rebuild entirely. The challenge was making this safe (caches scoped properly, never leaking across accounts) and efficient (cache lookup faster than recomputation). Anthropic launched it first in mid-2024, OpenAI followed in late 2024, Google in 2025. By 2026 it's standard across the frontier.
How each provider implements it
· Anthropic (Claude)
Explicit. You mark cacheable content with cache_control: { type: "ephemeral" } on a message block.
- Discount: cached reads cost 10% of the normal input rate (90% off).
- Write premium: 1.25× for the default 5-minute TTL, 2× for the 1-hour TTL.
- Minimum: 1,024 tokens (Opus, Sonnet) or 2,048 tokens (Haiku). Blocks below the minimum aren't cached.
- Scope: workspace-level (since Feb 2026) — previously org-level.
· OpenAI
Automatic. If you reuse a prefix of 1,024+ tokens, you get the discount. No code changes.
- Discount: up to 90% off cached input (varies by model — typically 50% with some discount up to 90% on newer tiers).
- Write premium: none.
- TTL: 5-10 minutes of inactivity, hard 1-hour cap.
- Granularity: cache hits occur in 128-token increments after the 1,024 floor is hit.
· Google (Gemini)
Explicit, with storage fees. You create a cached content object via the Context Caching API, then reference it in subsequent requests.
- Discount: ~75% off cached input rate.
- Storage fee: $4.50 per 1M cached tokens per hour. Non-trivial — changes the math for anything you're not reading frequently.
- Minimum: 4,096 tokens.
- Model support: Gemini 2.5 Pro, 2.5 Flash, 3.1 Pro, 3 Flash.
When caching pays off
- Agents with a long system prompt. A tool-using agent typically has 5K-20K tokens of system prompt + tool definitions that are stable across every call. Caching this is a no-brainer — 70-90% savings with one line of code.
- RAG over a stable knowledge base. If every user query pulls from the same corpus and you send the corpus (or a stable summary) in the prompt, caching turns that cost into near-zero.
- Multi-turn conversations. Each turn sends the full conversation history plus the new message. Caching the history means each turn pays only for the new message.
- Large few-shot examples. If your prompt has a long list of demonstration examples that's the same across calls, cache them.
When it doesn't
- Unique prompts per call. One-shot completions where no two calls share a prefix — caching just adds write costs for no reads.
- Low / bursty traffic. If your workload has gaps longer than the cache TTL, you're paying the write cost every cold start. For Claude, consider the 1-hour TTL; for OpenAI, it just means caching works less often.
- Short prompts. Under the 1,024-token minimum (Claude / OpenAI) or 4,096-token minimum (Gemini), caching doesn't apply.
- Prefix-unstable prompts. If you interpolate timestamps, request IDs, or user session tokens near the top of the prompt, every call is a cache miss. Move dynamic content to the end.
How to structure prompts for maximum caching
Order matters. Cache hits require an exact prefix match, so every variable content should live at the END of the prompt, never the middle or beginning. The canonical structure:
[1] System prompt ← stable, cache this [2] Tool definitions ← stable, cache this [3] Few-shot examples ← stable, cache this [4] Retrieved context (RAG) ← stable per-session, cache per session [5] Conversation history ← grows, cache prefix up through turn N-1 [6] Current user message ← the only fresh bit
The further down the varying content is, the more tokens you get for free. Common mistakes: putting today's date at the top ("Today is 2026-04-21"), echoing the user's name in the system prompt, or doing string-concat in a way that changes whitespace subtly between calls. All of those invalidate cache on everything after.
FAQ
What is prompt caching?
Prompt caching lets an LLM provider store the computed attention state from a prefix of your prompt (typically a long system message or stable context) and reuse it across calls. When the same prefix comes in again, the model skips the expensive work of processing it and charges you a small fraction of the normal input price. Claude discounts cached reads by 90%, OpenAI by up to 90% automatically, and Gemini by around 75%.
How much can prompt caching save?
For workloads with a long stable prefix — like an agent with a big system prompt, RAG over a static knowledge base, or a customer-support bot with conversation history — caching routinely saves 50-80% of input costs. For workloads where each call is different (one-shot completions, varying prompts), caching saves little. Use the calculator above to run your own numbers.
What's the difference between Claude's and OpenAI's caching?
Claude caching is explicit — you mark which parts of the prompt are cacheable with cache_control markers. OpenAI caching is automatic — if you reuse the same prefix and it's 1024+ tokens, you get the discount with no code changes. Claude charges a 25% premium on cache writes (the first time a block is cached) and up to 2× for a 1-hour TTL. OpenAI has no write premium. Gemini requires explicit cache creation and also charges a storage fee ($4.50/M tokens/hour) on top of reduced read costs.
When is prompt caching NOT worth it?
Three scenarios where caching hurts more than it helps: (1) your prefixes don't repeat — every call has different system prompt or context, so you pay write costs without getting reads; (2) your traffic is sparse — cache expires between calls, so you're effectively paying write costs on every request; (3) the stable portion of your prompt is under the minimum cache size (1024 tokens on Claude/OpenAI, 4096 on Gemini). Below the minimum, caching doesn't apply.
How does the 5-minute TTL work?
By default, Claude cache entries live for 5 minutes after the last access. Each use refreshes the TTL. If your traffic is bursty or intermittent, cache entries expire between uses and you pay write costs again. For sustained traffic (say, 100+ requests/hour), the cache stays warm continuously and you effectively pay the write cost once. For very low-volume workloads (under ~10 requests/hour), Claude offers a 1-hour extended TTL at a 2× write premium — more expensive per write but worth it when the baseline is paying writes repeatedly.
Do all models support prompt caching?
In the Claude family, all production models support it (Opus 4.7, Sonnet 4.6, Haiku 4.5). In OpenAI, GPT-5.4, GPT-5, GPT-5 mini, GPT-4o, GPT-4o mini, o4-mini, and most current models support it automatically. For Gemini, context caching is available on 2.5 Pro, 2.5 Flash, 3.1 Pro, and 3 Flash; older models don't support it. Check each provider's pricing page for your specific model.
How do I structure my prompts to maximize caching?
Put stable content first, varying content last. Cache hits require an EXACT prefix match, so even a single character difference at position 10 invalidates cache on everything after. The pattern that works: [long system prompt] [stable context / docs / few-shot examples] [user message]. If your code dynamically inserts today's date or a request ID at the top of the prompt, move it to the bottom — otherwise every call is a cache miss.
Does caching work with structured outputs and tool use?
Yes. The cached tokens include everything before the varying user input — system prompt, tool definitions, example tool calls, whatever else is stable. Because tool definitions are often long JSON schemas, caching them tends to be one of the highest-leverage applications: you write the tool schemas once per session, then read them cheaply on every call.
Are there any privacy implications?
Claude's caches are scoped to your workspace as of Feb 2026 (previously org-level, which caused isolation concerns between teams). OpenAI's cache is scoped to your API key and is never visible to other tenants. Gemini caches are explicit and you control creation and deletion. None of the major providers train on cached content. Check each provider's data-use policy for specifics, especially if you process regulated data (HIPAA, PCI, etc).
Related
- · LLM token counter — count your cached + uncached token volumes across 21 models.
- · Claude pricing in 2026, compared — full per-workload math including the Claude/GPT/Gemini tradeoffs.
- · Claude vs ChatGPT vs Gemini — an honest side-by-side of the three flagship LLMs.
- · MCP server primer — how to plug external tools into Claude Desktop and Cursor.