Useful tip to cut time-to-first-token on longer prompts in the API: pre-warm the prompt cache. Send your system prompt before the user prompt. Claude writes it to the cache, but skips generating any output. When the real user request lands, it'll hit a warm cache. https://t.co/6BdEzbamr2
Anthropic Launches Prompt Cache Pre-Warming to Eliminate Initial Claude API Latency
Anthropic introduced a "pre-warming" capability for the Claude API to load large prompts into the server-side cache before a user interaction begins. By setting
max_tokens: 0, the API processes the input and writes it to the cache without generating a response. This official method replaces the previous community workaround of using a single-token limit.- Cache write cost (5m)
- 1.25x base input price
- Cache write cost (1h)
- 2x base input price
- Cache read cost
- 0.1x base input price
- Minimum cache length
- 1,024 to 4,096 tokens
- Availability
- Claude API, AWS, Microsoft Foundry
Latency remains the primary friction point for Claude Code's autonomous workflows. While Claude's prompt caching dashboard reduced costs, the first request in a session still suffered a "cache miss" delay. Pre-warming removes this bottleneck, mirroring OpenAI's persistent connections to speed up autonomous loops.
To implement this, you must use explicit cache_control breakpoints on your static content. The pre-warm request is billed as a cache write but incurs no output token costs. Note that max_tokens: 0 is incompatible with streaming, extended thinking, or structured outputs. This feature is available across the Claude API, AWS, and Microsoft Foundry.
ClaudeDevs
@ClaudeDevs
266retweets4.2klikes
View on XStill wondering? A few quick answers below.
Prompt cache pre-warming is a technique to load large, static portions of a prompt into Anthropic's server-side memory before a user request arrives. By proactively caching system instructions or context documents, developers can eliminate the initial latency penalty of a cache miss, resulting in significantly faster response times for the first interaction in a session.
To pre-warm the cache, send an API request with the max_tokens parameter set to zero and include an explicit cache_control breakpoint on your static content. The API will process the prompt and write it to the cache without generating any output tokens. This official method ensures the cache is ready for subsequent user requests that share the same prefix.
Prompt caching introduces a tiered pricing model based on cache writes and reads. Writing to a standard five-minute cache costs 1.25 times the base input token price, while an extended one-hour cache costs double the base rate. However, reading from the cache is significantly cheaper, priced at only ten percent of the standard input token rate for supported models.
Prompt caching and the zero-token pre-warming method are supported on all active Claude models, including the Opus, Sonnet, and Haiku families. This includes the latest versions like Claude Opus 4.7 and Sonnet 4.6. The feature is available through the Claude API, the Claude Platform on AWS, and Microsoft Foundry, though specific platform availability for automatic caching may vary.
Pre-warming with a zero-token limit is incompatible with several API features that require output generation. You cannot use it with streaming, extended thinking, or structured output configurations. Additionally, it is not supported within the Message Batches API. Developers must use explicit cache breakpoints rather than automatic caching to ensure the cache is correctly keyed to the static content.




