cache_control
block:
cache_control
parameter. This enables reuse of this large text across multiple API calls without reprocessing it each time. Changing only the user message allows you to ask various questions about the book while utilizing the cached content, leading to faster responses and improved efficiency.
tools
, system
, and messages
(in that order) up to and including the block designated with cache_control
.Model | Base Input Tokens | 5m Cache Writes | 1h Cache Writes | Cache Hits & Refreshes | Output Tokens |
---|---|---|---|---|---|
Claude Opus 4.1 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Opus 4 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Sonnet 4 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Sonnet 3.7 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Sonnet 3.5 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
Claude Haiku 3.5 | $0.80 / MTok | $1 / MTok | $1.6 / MTok | $0.08 / MTok | $4 / MTok |
Claude Opus 3 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
Claude Haiku 3 | $0.25 / MTok | $0.30 / MTok | $0.50 / MTok | $0.03 / MTok | $1.25 / MTok |
cache_control
parameter.
Cache prefixes are created in the following order: tools
, system
, then messages
. This order forms a hierarchy where each level builds upon the previous ones.
cache_control
breakpoint, the system automatically checks for cache hits at all previous content block boundaries (up to approximately 20 blocks before your explicit breakpoint)cache_control
. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.
For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.
Currently, “ephemeral” is the only supported cache type, which by default has a 5-minute lifetime.
cache_control
breakpoints doesn’t increase your costs - you still pay the same amount based on what content is actually cached and read. The breakpoints simply give you control over what sections can be cached independently.
cache_control
. This includes:
tools
arraysystem
arraymessages.content
array, for both user and assistant turnsmessages.content
array, in user turnsmessages.content
array, in both user and assistant turnscache_control
to enable caching for that portion of the request.
cache_control
. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache.
tools
→ system
→ messages
. Changes at each level invalidate that level and all subsequent levels.
The following table shows which parts of the cache are invalidated by different types of changes. ✘ indicates that the cache is invalidated, while ✓ indicates that the cache remains valid.
What changes | Tools cache | System cache | Messages cache | Impact |
---|---|---|---|---|
Tool definitions | ✘ | ✘ | ✘ | Modifying tool definitions (names, descriptions, parameters) invalidates the entire cache |
Web search toggle | ✓ | ✘ | ✘ | Enabling/disabling web search modifies the system prompt |
Citations toggle | ✓ | ✘ | ✘ | Enabling/disabling citations modifies the system prompt |
Tool choice | ✓ | ✓ | ✘ | Changes to tool_choice parameter only affect message blocks |
Images | ✓ | ✓ | ✘ | Adding/removing images anywhere in the prompt affects message blocks |
Thinking parameters | ✓ | ✓ | ✘ | Changes to extended thinking settings (enable/disable, budget) affect message blocks |
Non-tool results passed to extended thinking requests | ✓ | ✓ | ✘ | When non-tool results are passed in requests while extended thinking is enabled, all previously-cached thinking blocks are stripped from context, and any messages in context that follow those thinking blocks are removed from the cache. For more details, see Caching with thinking blocks. |
usage
in the response (or message_start
event if streaming):
cache_creation_input_tokens
: Number of tokens written to the cache when creating a new entry.cache_read_input_tokens
: Number of tokens retrieved from the cache for this request.input_tokens
: Number of input tokens which were not read from or used to create a cache.tool_choice
and image usage remain consistent between callscache_control
parameters earlier in the prompt to ensure all content can be cachedtool_choice
or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created. For more details on cache invalidation, see What invalidates the cache.cache_control
, they get cached as part of the request content when you make subsequent API calls with tool results. This commonly happens during tool use when you pass thinking blocks back to continue the conversation.
Input token counting: When thinking blocks are read from cache, they count as input tokens in your usage metrics. This is important for cost calculation and token budgeting.
Cache invalidation patterns:
cache_control
markersextended-cache-ttl-2025-04-11
as a beta header to your request, and then include ttl
in the cache_control
definition like this:
cache_creation_input_tokens
field equals the sum of the values in the cache_creation
object.
A
: The token count at the highest cache hit (or 0 if no hits).B
: The token count at the highest 1-hour cache_control
block after A
(or equals A
if none exist).C
: The token count at the last cache_control
block.B
and/or C
are larger than A
, they will necessarily be cache misses, because A
is the highest cache hit.A
.(B - A)
.(C - B)
.Large context caching example
input_tokens
: Number of tokens in the user message onlycache_creation_input_tokens
: Number of tokens in the entire system message, including the legal documentcache_read_input_tokens
: 0 (no cache hit on first request)input_tokens
: Number of tokens in the user message onlycache_creation_input_tokens
: 0 (no new cache creation)cache_read_input_tokens
: Number of tokens in the entire cached system messageCaching tool definitions
cache_control
parameter is placed on the final tool (get_time
) to designate all of the tools as part of the static prefix.This means that all tool definitions, including get_weather
and any other tools defined before get_time
, will be cached as a single prefix.This approach is useful when you have a consistent set of tools that you want to reuse across multiple requests without re-processing them each time.For the first request:input_tokens
: Number of tokens in the user messagecache_creation_input_tokens
: Number of tokens in all tool definitions and system promptcache_read_input_tokens
: 0 (no cache hit on first request)input_tokens
: Number of tokens in the user messagecache_creation_input_tokens
: 0 (no new cache creation)cache_read_input_tokens
: Number of tokens in all cached tool definitions and system promptContinuing a multi-turn conversation
cache_control
so the conversation can be incrementally cached. The system will automatically lookup and use the longest previously cached prefix for follow-up messages. That is, blocks that were previously marked with a cache_control
block are later not marked with this, but they will still be considered a cache hit (and also a cache refresh!) if they are hit within 5 minutes.In addition, note that the cache_control
parameter is placed on the system message. This is to ensure that if this gets evicted from the cache (after not being used for more than 5 minutes), it will get added back to the cache on the next request.This approach is useful for maintaining context in ongoing conversations without repeatedly processing the same information.When this is set up properly, you should see the following in the usage response of each request:input_tokens
: Number of tokens in the new user message (will be minimal)cache_creation_input_tokens
: Number of tokens in the new assistant and user turnscache_read_input_tokens
: Number of tokens in the conversation up to the previous turnPutting it all together: Multiple cache breakpoints
cache_control
parameter on the last tool definition caches all tool definitions.
cache_control
to enable incremental caching of the conversation as it progresses.
input_tokens
: Tokens in the final user messagecache_creation_input_tokens
: Tokens in all cached segments (tools + instructions + RAG documents + conversation history)cache_read_input_tokens
: 0 (no cache hits)input_tokens
: Tokens in the new user message onlycache_creation_input_tokens
: Any new tokens added to conversation historycache_read_input_tokens
: All previously cached tokens (tools + instructions + RAG documents + previous conversation)Do I need multiple cache breakpoints or is one at the end sufficient?
Do cache breakpoints add extra cost?
What is the cache lifetime?
How many cache breakpoints can I use?
cache_control
parameters) in your prompt.Is prompt caching available for all models?
How does prompt caching work with extended thinking?
How do I enable prompt caching?
cache_control
breakpoint in your API request.Can I use prompt caching with other API features?
How does prompt caching affect pricing?
Can I manually clear the cache?
How can I track the effectiveness of my caching strategy?
cache_creation_input_tokens
and cache_read_input_tokens
fields in the API response.What can break the cache?
How does prompt caching handle privacy and data separation?
cache_control
anywhere in your prompts. For cost efficiency, it’s better to exclude highly variable parts (e.g., user’s arbitrary input) from caching.
Can I use prompt caching with the Batches API?
Why am I seeing the error `AttributeError: 'Beta' object has no attribute 'prompt_caching'` in Python?
Why am I seeing 'TypeError: Cannot read properties of undefined (reading 'messages')'?