cache_control
block:
cache_control
parameter. This enables reuse of this large text across multiple API calls without reprocessing it each time. Changing only the user message allows you to ask various questions about the book while utilizing the cached content, leading to faster responses and improved efficiency.
How prompt caching works
When you send a request with prompt caching enabled:- The system checks if a prompt prefix, up to a specified cache breakpoint, is already cached from a recent query.
- If found, it uses the cached version, reducing processing time and costs.
- Otherwise, it processes the full prompt and caches the prefix once the response begins.
- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
tools
, system
, and messages
(in that order) up to and including the block designated with cache_control
.Pricing
Prompt caching introduces a new pricing structure. The table below shows the price per million tokens for each supported model:Model | Base Input Tokens | 5m Cache Writes | 1h Cache Writes | Cache Hits & Refreshes | Output Tokens |
---|---|---|---|---|---|
Claude Opus 4.1 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Opus 4 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Sonnet 4 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Sonnet 3.7 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Sonnet 3.5 (deprecated) | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Haiku 3.5 | $0.80 / MTok | $1 / MTok | $1.6 / MTok | $0.08 / MTok | $4 / MTok |
Claude Opus 3 (deprecated) | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Haiku 3 | $0.25 / MTok | $0.30 / MTok | $0.50 / MTok | $0.03 / MTok | $1.25 / MTok |
- 5-minute cache write tokens are 1.25 times the base input tokens price
- 1-hour cache write tokens are 2 times the base input tokens price
- Cache read tokens are 0.1 times the base input tokens price
How to implement prompt caching
Supported models
Prompt caching is currently supported on:- Claude Opus 4.1
- Claude Opus 4
- Claude Sonnet 4
- Claude Sonnet 3.7
- Claude Sonnet 3.5 (deprecated)
- Claude Haiku 3.5
- Claude Haiku 3
- Claude Opus 3 (deprecated)
Structuring your prompt
Place static content (tool definitions, system instructions, context, examples) at the beginning of your prompt. Mark the end of the reusable content for caching using thecache_control
parameter.
Cache prefixes are created in the following order: tools
, system
, then messages
. This order forms a hierarchy where each level builds upon the previous ones.
How automatic prefix checking works
You can use just one cache breakpoint at the end of your static content, and the system will automatically find the longest matching prefix. Here’s how it works:- When you add a
cache_control
breakpoint, the system automatically checks for cache hits at all previous content block boundaries (up to approximately 20 blocks before your explicit breakpoint) - If any of these previous positions match cached content from earlier requests, the system uses the longest matching prefix
- This means you don’t need multiple breakpoints just to enable caching - one at the end is sufficient
When to use multiple breakpoints
You can define up to 4 cache breakpoints if you want to:- Cache different sections that change at different frequencies (e.g., tools rarely change, but context updates daily)
- Have more control over exactly what gets cached
- Ensure caching for content more than 20 blocks before your final breakpoint
Cache limitations
The minimum cacheable prompt length is:- 1024 tokens for Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 (deprecated) and Claude Opus 3 (deprecated)
- 2048 tokens for Claude Haiku 3.5 and Claude Haiku 3
cache_control
. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.
For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.
Currently, “ephemeral” is the only supported cache type, which by default has a 5-minute lifetime.
Understanding cache breakpoint costs
Cache breakpoints themselves don’t add any cost. You are only charged for:- Cache writes: When new content is written to the cache (25% more than base input tokens for 5-minute TTL)
- Cache reads: When cached content is used (10% of base input token price)
- Regular input tokens: For any uncached content
cache_control
breakpoints doesn’t increase your costs - you still pay the same amount based on what content is actually cached and read. The breakpoints simply give you control over what sections can be cached independently.
What can be cached
Most blocks in the request can be designated for caching withcache_control
. This includes:
- Tools: Tool definitions in the
tools
array - System messages: Content blocks in the
system
array - Text messages: Content blocks in the
messages.content
array, for both user and assistant turns - Images & Documents: Content blocks in the
messages.content
array, in user turns - Tool use and tool results: Content blocks in the
messages.content
array, in both user and assistant turns
cache_control
to enable caching for that portion of the request.
What cannot be cached
While most request blocks can be cached, there are some exceptions:-
Thinking blocks cannot be cached directly with
cache_control
. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache. - Sub-content blocks (like citations) themselves cannot be cached directly. Instead, cache the top-level block. In the case of citations, the top-level document content blocks that serve as the source material for citations can be cached. This allows you to use prompt caching with citations effectively by caching the documents that citations will reference.
- Empty text blocks cannot be cached.
What invalidates the cache
Modifications to cached content can invalidate some or all of the cache. As described in Structuring your prompt, the cache follows the hierarchy:tools
→ system
→ messages
. Changes at each level invalidate that level and all subsequent levels.
The following table shows which parts of the cache are invalidated by different types of changes. ✘ indicates that the cache is invalidated, while ✓ indicates that the cache remains valid.
What changes | Tools cache | System cache | Messages cache | Impact |
---|---|---|---|---|
Tool definitions | ✘ | ✘ | ✘ | Modifying tool definitions (names, descriptions, parameters) invalidates the entire cache |
Web search toggle | ✓ | ✘ | ✘ | Enabling/disabling web search modifies the system prompt |
Citations toggle | ✓ | ✘ | ✘ | Enabling/disabling citations modifies the system prompt |
Tool choice | ✓ | ✓ | ✘ | Changes to tool_choice parameter only affect message blocks |
Images | ✓ | ✓ | ✘ | Adding/removing images anywhere in the prompt affects message blocks |
Thinking parameters | ✓ | ✓ | ✘ | Changes to extended thinking settings (enable/disable, budget) affect message blocks |
Non-tool results passed to extended thinking requests | ✓ | ✓ | ✘ | When non-tool results are passed in requests while extended thinking is enabled, all previously-cached thinking blocks are stripped from context, and any messages in context that follow those thinking blocks are removed from the cache. For more details, see Caching with thinking blocks. |
Tracking cache performance
Monitor cache performance using these API response fields, withinusage
in the response (or message_start
event if streaming):
cache_creation_input_tokens
: Number of tokens written to the cache when creating a new entry.cache_read_input_tokens
: Number of tokens retrieved from the cache for this request.input_tokens
: Number of input tokens which were not read from or used to create a cache.
Best practices for effective caching
To optimize prompt caching performance:- Cache stable, reusable content like system instructions, background information, large contexts, or frequent tool definitions.
- Place cached content at the prompt’s beginning for best performance.
- Use cache breakpoints strategically to separate different cacheable prefix sections.
- Regularly analyze cache hit rates and adjust your strategy as needed.
Optimizing for different use cases
Tailor your prompt caching strategy to your scenario:- Conversational agents: Reduce cost and latency for extended conversations, especially those with long instructions or uploaded documents.
- Coding assistants: Improve autocomplete and codebase Q&A by keeping relevant sections or a summarized version of the codebase in the prompt.
- Large document processing: Incorporate complete long-form material including images in your prompt without increasing response latency.
- Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude’s responses. Developers often include an example or two in the prompt, but with prompt caching you can get even better performance by including 20+ diverse examples of high quality answers.
- Agentic tool use: Enhance performance for scenarios involving multiple tool calls and iterative code changes, where each step typically requires a new API call.
- Talk to books, papers, documentation, podcast transcripts, and other longform content: Bring any knowledge base alive by embedding the entire document(s) into the prompt, and letting users ask it questions.
Troubleshooting common issues
If experiencing unexpected behavior:- Ensure cached sections are identical and marked with cache_control in the same locations across calls
- Check that calls are made within the cache lifetime (5 minutes by default)
- Verify that
tool_choice
and image usage remain consistent between calls - Validate that you are caching at least the minimum number of tokens
- The system automatically checks for cache hits at previous content block boundaries (up to ~20 blocks before your breakpoint). For prompts with more than 20 content blocks, you may need additional
cache_control
parameters earlier in the prompt to ensure all content can be cached
tool_choice
or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created. For more details on cache invalidation, see What invalidates the cache.Caching with thinking blocks
When using extended thinking with prompt caching, thinking blocks have special behavior: Automatic caching alongside other content: While thinking blocks cannot be explicitly marked withcache_control
, they get cached as part of the request content when you make subsequent API calls with tool results. This commonly happens during tool use when you pass thinking blocks back to continue the conversation.
Input token counting: When thinking blocks are read from cache, they count as input tokens in your usage metrics. This is important for cost calculation and token budgeting.
Cache invalidation patterns:
- Cache remains valid when only tool results are provided as user messages
- Cache gets invalidated when non-tool-result user content is added, causing all previous thinking blocks to be stripped
- This caching behavior occurs even without explicit
cache_control
markers
Cache storage and sharing
- Organization Isolation: Caches are isolated between organizations. Different organizations never share caches, even if they use identical prompts.
- Exact Matching: Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control.
- Output Token Generation: Prompt caching has no effect on output token generation. The response you receive will be identical to what you would get if prompt caching was not used.
1-hour cache duration
If you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration. To use the extended cache, includettl
in the cache_control
definition like this:
cache_creation_input_tokens
field equals the sum of the values in the cache_creation
object.
When to use the 1-hour cache
If you have prompts that are used at a regular cadence (i.e., system prompts that are used more frequently than every 5 minutes), continue to use the 5-minute cache, since this will continue to be refreshed at no additional charge. The 1-hour cache is best used in the following scenarios:- When you have prompts that are likely used less frequently than 5 minutes, but more frequently than every hour. For example, when an agentic side-agent will take longer than 5 minutes, or when storing a long chat conversation with a user and you generally expect that user may not respond in the next 5 minutes.
- When latency is important and your follow up prompts may be sent beyond 5 minutes.
- When you want to improve your rate limit utilization, since cache hits are not deducted against your rate limit.
Mixing different TTLs
You can use both 1-hour and 5-minute cache controls in the same request, but with an important constraint: Cache entries with longer TTL must appear before shorter TTLs (i.e., a 1-hour cache entry must appear before any 5-minute cache entries). When mixing TTLs, we determine three billing locations in your prompt:- Position
A
: The token count at the highest cache hit (or 0 if no hits). - Position
B
: The token count at the highest 1-hourcache_control
block afterA
(or equalsA
if none exist). - Position
C
: The token count at the lastcache_control
block.
B
and/or C
are larger than A
, they will necessarily be cache misses, because A
is the highest cache hit.- Cache read tokens for
A
. - 1-hour cache write tokens for
(B - A)
. - 5-minute cache write tokens for
(C - B)
.
Prompt caching examples
To help you get started with prompt caching, we’ve prepared a prompt caching cookbook with detailed examples and best practices. Below, we’ve included several code snippets that showcase various prompt caching patterns. These examples demonstrate how to implement caching in different scenarios, helping you understand the practical applications of this feature:Large context caching example
Large context caching example
input_tokens
: Number of tokens in the user message onlycache_creation_input_tokens
: Number of tokens in the entire system message, including the legal documentcache_read_input_tokens
: 0 (no cache hit on first request)
input_tokens
: Number of tokens in the user message onlycache_creation_input_tokens
: 0 (no new cache creation)cache_read_input_tokens
: Number of tokens in the entire cached system message
Caching tool definitions
Caching tool definitions
cache_control
parameter is placed on the final tool (get_time
) to designate all of the tools as part of the static prefix.This means that all tool definitions, including get_weather
and any other tools defined before get_time
, will be cached as a single prefix.This approach is useful when you have a consistent set of tools that you want to reuse across multiple requests without re-processing them each time.For the first request:input_tokens
: Number of tokens in the user messagecache_creation_input_tokens
: Number of tokens in all tool definitions and system promptcache_read_input_tokens
: 0 (no cache hit on first request)
input_tokens
: Number of tokens in the user messagecache_creation_input_tokens
: 0 (no new cache creation)cache_read_input_tokens
: Number of tokens in all cached tool definitions and system prompt
Continuing a multi-turn conversation
Continuing a multi-turn conversation
cache_control
so the conversation can be incrementally cached. The system will automatically lookup and use the longest previously cached prefix for follow-up messages. That is, blocks that were previously marked with a cache_control
block are later not marked with this, but they will still be considered a cache hit (and also a cache refresh!) if they are hit within 5 minutes.In addition, note that the cache_control
parameter is placed on the system message. This is to ensure that if this gets evicted from the cache (after not being used for more than 5 minutes), it will get added back to the cache on the next request.This approach is useful for maintaining context in ongoing conversations without repeatedly processing the same information.When this is set up properly, you should see the following in the usage response of each request:input_tokens
: Number of tokens in the new user message (will be minimal)cache_creation_input_tokens
: Number of tokens in the new assistant and user turnscache_read_input_tokens
: Number of tokens in the conversation up to the previous turn
Putting it all together: Multiple cache breakpoints
Putting it all together: Multiple cache breakpoints
-
Tools cache (cache breakpoint 1): The
cache_control
parameter on the last tool definition caches all tool definitions. - Reusable instructions cache (cache breakpoint 2): The static instructions in the system prompt are cached separately. These instructions rarely change between requests.
- RAG context cache (cache breakpoint 3): The knowledge base documents are cached independently, allowing you to update the RAG documents without invalidating the tools or instructions cache.
-
Conversation history cache (cache breakpoint 4): The assistant’s response is marked with
cache_control
to enable incremental caching of the conversation as it progresses.
- If you only update the final user message, all four cache segments are reused
- If you update the RAG documents but keep the same tools and instructions, the first two cache segments are reused
- If you change the conversation but keep the same tools, instructions, and documents, the first three segments are reused
- Each cache breakpoint can be invalidated independently based on what changes in your application
input_tokens
: Tokens in the final user messagecache_creation_input_tokens
: Tokens in all cached segments (tools + instructions + RAG documents + conversation history)cache_read_input_tokens
: 0 (no cache hits)
input_tokens
: Tokens in the new user message onlycache_creation_input_tokens
: Any new tokens added to conversation historycache_read_input_tokens
: All previously cached tokens (tools + instructions + RAG documents + previous conversation)
- RAG applications with large document contexts
- Agent systems that use multiple tools
- Long-running conversations that need to maintain context
- Applications that need to optimize different parts of the prompt independently
FAQ
Do I need multiple cache breakpoints or is one at the end sufficient?
Do I need multiple cache breakpoints or is one at the end sufficient?
- You have more than 20 content blocks before your desired cache point
- You want to cache sections that update at different frequencies independently
- You need explicit control over what gets cached for cost optimization
Do cache breakpoints add extra cost?
Do cache breakpoints add extra cost?
- Writing content to cache (25% more than base input tokens for 5-minute TTL)
- Reading from cache (10% of base input token price)
- Regular input tokens for uncached content
What is the cache lifetime?
What is the cache lifetime?
How many cache breakpoints can I use?
How many cache breakpoints can I use?
cache_control
parameters) in your prompt.Is prompt caching available for all models?
Is prompt caching available for all models?
How does prompt caching work with extended thinking?
How does prompt caching work with extended thinking?
How do I enable prompt caching?
How do I enable prompt caching?
cache_control
breakpoint in your API request.Can I use prompt caching with other API features?
Can I use prompt caching with other API features?
How does prompt caching affect pricing?
How does prompt caching affect pricing?
Can I manually clear the cache?
Can I manually clear the cache?
How can I track the effectiveness of my caching strategy?
How can I track the effectiveness of my caching strategy?
cache_creation_input_tokens
and cache_read_input_tokens
fields in the API response.What can break the cache?
What can break the cache?
How does prompt caching handle privacy and data separation?
How does prompt caching handle privacy and data separation?
- Cache keys are generated using a cryptographic hash of the prompts up to the cache control point. This means only requests with identical prompts can access a specific cache.
- Caches are organization-specific. Users within the same organization can access the same cache if they use identical prompts, but caches are not shared across different organizations, even for identical prompts.
- The caching mechanism is designed to maintain the integrity and privacy of each unique conversation or context.
-
It’s safe to use
cache_control
anywhere in your prompts. For cost efficiency, it’s better to exclude highly variable parts (e.g., user’s arbitrary input) from caching.
Can I use prompt caching with the Batches API?
Can I use prompt caching with the Batches API?
- Gather a set of message requests that have a shared prefix.
- Send a batch request with just a single request that has this shared prefix and a 1-hour cache block. This will get written to the 1-hour cache.
- As soon as this is complete, submit the rest of the requests. You will have to monitor the job to know when it completes.
Why am I seeing the error `AttributeError: 'Beta' object has no attribute 'prompt_caching'` in Python?
Why am I seeing the error `AttributeError: 'Beta' object has no attribute 'prompt_caching'` in Python?
Why am I seeing 'TypeError: Cannot read properties of undefined (reading 'messages')'?
Why am I seeing 'TypeError: Cannot read properties of undefined (reading 'messages')'?