It’s always better to first engineer a prompt that works well without model or prompt constraints, and then try latency reduction strategies afterward. Trying to reduce latency prematurely might prevent you from discovering what top performance looks like.
How to measure latency
When discussing latency, you may come across several terms and measurements:- Baseline latency: This is the time taken by the model to process the prompt and generate the response, without considering the input and output tokens per second. It provides a general idea of the model’s speed.
- Time to first token (TTFT): This metric measures the time it takes for the model to generate the first token of the response, from when the prompt was sent. It’s particularly relevant when you’re using streaming (more on that later) and want to provide a responsive experience to your users.
How to reduce latency
1. Choose the right model
One of the most straightforward ways to reduce latency is to select the appropriate model for your use case. Anthropic offers a range of models with different capabilities and performance characteristics. Consider your specific requirements and choose the model that best fits your needs in terms of speed and output quality. For more details about model metrics, see our models overview page.2. Optimize prompt and output length
Minimize the number of tokens in both your input prompt and the expected output, while still maintaining high performance. The fewer tokens the model has to process and generate, the faster the response will be. Here are some tips to help you optimize your prompts and outputs:- Be clear but concise: Aim to convey your intent clearly and concisely in the prompt. Avoid unnecessary details or redundant information, while keeping in mind that claude lacks context on your use case and may not make the intended leaps of logic if instructions are unclear.
- Ask for shorter responses:: Ask Claude directly to be concise. The Claude 3 family of models has improved steerability over previous generations. If Claude is outputting unwanted length, ask Claude to curb its chattiness.
Due to how LLMs count tokens instead of words, asking for an exact word count or a word count limit is not as effective a strategy as asking for paragraph or sentence count limits.
- Set appropriate output limits: Use the
max_tokens
parameter to set a hard limit on the maximum length of the generated response. This prevents Claude from generating overly long outputs.Note: When the response reaches
max_tokens
tokens, the response will be cut off, perhaps midsentence or mid-word, so this is a blunt technique that may require post-processing and is usually most appropriate for multiple choice or short answer responses where the answer comes right at the beginning. - Experiment with temperature: The
temperature
parameter controls the randomness of the output. Lower values (e.g., 0.2) can sometimes lead to more focused and shorter responses, while higher values (e.g., 0.8) may result in more diverse but potentially longer outputs.