Anthropic Launches Prompt Cache Pre-Warming to Eliminate Initial Claude API Latency

Anthropic

May 15, 2026 · Updated Jun 12, 2026

Anthropic introduced a pre-warming method for the Claude API that uses a zero-token limit to load prompts into the cache without generating output. This allows developers to eliminate the latency penalty on the first request of a session for high-context applications. By proactively caching system instructions or large documents, tools like coding agents can achieve near-instant response times.

Anthropic introduced a "pre-warming" capability for the Claude API to load large prompts into the server-side cache before a user interaction begins. By setting max_tokens: 0, the API processes the input and writes it to the cache without generating a response. This official method replaces the previous community workaround of using a single-token limit.

Cache write cost (5m): 1.25x base input price
Cache write cost (1h): 2x base input price
Cache read cost: 0.1x base input price
Minimum cache length: 1,024 to 4,096 tokens
Availability: Claude API, AWS, Microsoft Foundry

Latency remains the primary friction point for Claude Code's autonomous workflows. While Claude's prompt caching dashboard reduced costs, the first request in a session still suffered a "cache miss" delay. Pre-warming removes this bottleneck, mirroring OpenAI's persistent connections to speed up autonomous loops.

To implement this, you must use explicit cache_control breakpoints on your static content. The pre-warm request is billed as a cache write but incurs no output token costs. Note that max_tokens: 0 is incompatible with streaming, extended thinking, or structured outputs. This feature is available across the Claude API, AWS, and Microsoft Foundry.

View the full update on platform.claude.com

ClaudeDevs

@ClaudeDevsMay 14

Useful tip to cut time-to-first-token on longer prompts in the API: pre-warm the prompt cache. Send your system prompt before the user prompt. Claude writes it to the cache, but skips generating any output. When the real user request lands, it'll hit a warm cache. https://t.co/6BdEzbamr2

2664.2k

View on X

Still wondering? A few quick answers below.

Prompt cache pre-warming is a technique to load large, static portions of a prompt into Anthropic's server-side memory before a user request arrives. By proactively caching system instructions or context documents, developers can eliminate the initial latency penalty of a cache miss, resulting in significantly faster response times for the first interaction in a session.

To pre-warm the cache, send an API request with the max_tokens parameter set to zero and include an explicit cache_control breakpoint on your static content. The API will process the prompt and write it to the cache without generating any output tokens. This official method ensures the cache is ready for subsequent user requests that share the same prefix.

Prompt caching introduces a tiered pricing model based on cache writes and reads. Writing to a standard five-minute cache costs 1.25 times the base input token price, while an extended one-hour cache costs double the base rate. However, reading from the cache is significantly cheaper, priced at only ten percent of the standard input token rate for supported models.

Prompt caching and the zero-token pre-warming method are supported on all active Claude models, including the Opus, Sonnet, and Haiku families. This includes the latest versions like Claude Opus 4.7 and Sonnet 4.6. The feature is available through the Claude API, the Claude Platform on AWS, and Microsoft Foundry, though specific platform availability for automatic caching may vary.

Pre-warming with a zero-token limit is incompatible with several API features that require output generation. You cannot use it with streaming, extended thinking, or structured output configurations. Additionally, it is not supported within the Message Batches API. Developers must use explicit cache breakpoints rather than automatic caching to ensure the cache is correctly keyed to the static content.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Anthropic Launches Claude Prompt Caching Dashboard to Optimize API Costs

Anthropic introduced a dedicated dashboard in the Claude Developer Console to provide visibility into prompt caching performance. This allows developers to track cache hit rates and reduce both API expenses and latency for high-context workloads.

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning

ClaudeApr 24

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning

Anthropic introduced a native memory layer for Claude Managed Agents in public beta, allowing autonomous systems to retain knowledge across multiple sessions. By storing memories as manageable files, the update removes the need for custom state-management infrastructure while giving developers full control over what an agent remembers.

What is prompt cache pre-warming in the Claude API?

How do you pre-warm the Claude prompt cache?

What is the pricing for Anthropic's prompt caching?

Which Claude models support prompt caching and pre-warming?

What are the technical limitations of Claude cache pre-warming?

Keep reading

Anthropic Launches Claude Prompt Caching Dashboard to Optimize API Costs

Anthropic Launches Claude Prompt Caching Dashboard to Optimize API Costs

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning

Keep reading

Anthropic Launches Claude Prompt Caching Dashboard to Optimize API Costs

Anthropic Launches Claude Prompt Caching Dashboard to Optimize API Costs

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning

Anthropic Launches Native Memory for Claude Managed Agents to Enable Persistent Learning