claude-opus-4-1-20250805
)claude-opus-4-20250514
)claude-sonnet-4-20250514
)claude-3-7-sonnet-20250219
)thinking
content blocks where it outputs its internal reasoning. Claude incorporates insights from this reasoning before crafting a final response.
The API response will include thinking
content blocks, followed by text
content blocks.
Here’s an example of the default response format:
thinking
object, with the type
parameter set to enabled
and the budget_tokens
to a specified token budget for extended thinking.
The budget_tokens
parameter determines the maximum number of tokens Claude is allowed to use for its internal reasoning process. In Claude 4 models, this limit applies to full thinking tokens, and not to the summarized output. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32k.
budget_tokens
must be set to a value less than max_tokens
. However, when using interleaved thinking with tools, you can exceed this limit as the token limit becomes your entire context window (200k tokens).
thinking_delta
events.
For more documention on streaming via the Messages API, see Streaming Messages.
Here’s how to handle streaming with thinking:
tool_choice: {"type": "auto"}
(the default) or tool_choice: {"type": "none"}
. Using tool_choice: {"type": "any"}
or tool_choice: {"type": "tool", "name": "..."}
will result in an error because these options force tool use, which is incompatible with extended thinking.
thinking
blocks back to the API for the last assistant message. Include the complete unmodified block back to the API to maintain reasoning continuity.
Example: Passing thinking blocks with tool results
thinking
blocks back to the API, and you must include the complete unmodified block back to the API. This is critical for maintaining the model’s reasoning flow and conversation integrity.
thinking
blocks from prior assistant
role turns, we suggest always passing back all thinking blocks to the API for any multi-turn conversation. The API will:thinking
blocks, the entire sequence of consecutive thinking
blocks must match the outputs generated by the model during the original request; you cannot rearrange or modify the sequence of these blocks.
interleaved-thinking-2025-05-14
to your API request.
Here are some important considerations for interleaved thinking:
budget_tokens
can exceed the max_tokens
parameter, as it represents the total budget across all thinking blocks within one assistant turn.interleaved-thinking-2025-05-14
.interleaved-thinking-2025-05-14
in requests to any model, with no effect.interleaved-thinking-2025-05-14
to any model aside from Claude Opus 4.1, Opus 4, or Sonnet 4, your request will fail.Tool use without interleaved thinking
Tool use with interleaved thinking
cache_control
markersSystem prompt caching (preserved when thinking changes)
Messages caching (invalidated when thinking changes)
cache_creation_input_tokens=1370
and cache_read_input_tokens=0
, proving that message-based caching is invalidated when thinking parameters change.max_tokens
exceeded the model’s context window, the system would automatically adjust max_tokens
to fit within the context limit. This meant you could set a large max_tokens
value and the system would silently reduce it as needed.
With Claude 3.7 and 4 models, max_tokens
(which includes your thinking budget when thinking is enabled) is enforced as a strict limit. The system will now return a validation error if prompt tokens + max_tokens
exceeds the context window size.
max_tokens
limit for that turnmax_tokens
behavior with extended thinking Claude 3.7 and 4 models, you may need to:
max_tokens
values as your prompt length changessignature
field. This field is used to verify that thinking blocks were generated by Claude when passed back to the API.
signature_delta
inside a content_block_delta
event just before the content_block_stop
event.signature
values are significantly longer in Claude 4 than in previous models.signature
field is an opaque field and should not be interpreted or parsed - it exists solely for verification purposes.signature
values are compatible across platforms (Anthropic APIs, Amazon Bedrock, and Vertex AI). Values generated on one platform will be compatible with another.thinking
block and return it to you as a redacted_thinking
block. redacted_thinking
blocks are decrypted when passed back to the API, allowing Claude to continue its response without losing context.
When building customer-facing applications that use extended thinking:
ANTHROPIC_MAGIC_STRING_TRIGGER_REDACTED_THINKING_46C9A13E193C177646C7398A98432ECCCE4C1253D5E2D82641AC0E52CC2876CB
thinking
and redacted_thinking
blocks back to the API in a multi-turn conversation, you must include the complete unmodified block back to the API for the last assistant turn. This is critical for maintaining the model’s reasoning flow. We suggest always passing back all thinking blocks to the API. For more details, see the Preserving thinking blocks section above.
Example: Working with redacted thinking blocks
redacted_thinking
blocks that may appear in responses when Claude’s internal reasoning contains content flagged by safety systems:Feature | Claude Sonnet 3.7 | Claude 4 Models |
---|---|---|
Thinking Output | Returns full thinking output | Returns summarized thinking |
Interleaved Thinking | Not supported | Supported with interleaved-thinking-2025-05-14 beta header |
Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens |
---|---|---|---|---|
Claude Opus 4.1 | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok |
Claude Opus 4 | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok |
Claude Sonnet 4 | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
Claude Sonnet 3.7 | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
max_tokens
is greater than 21,333. When streaming, be prepared to handle both thinking and text content blocks as they arrive.temperature
or top_k
modifications as well as forced tool use.top_p
to values between 1 and 0.95.