OpenRouter Launches Response Caching to Deliver Free and Instant Identical Requests

OpenRouter

May 2, 2026 · Updated May 10, 2026

OpenRouter introduced a beta response caching feature that stores the output of identical API requests at the edge. By skipping the model provider for repeated calls, developers can eliminate token costs and reduce latency from seconds to milliseconds.

OpenRouter, a unified API for accessing hundreds of language models, launched a beta response caching feature that stores the full output of identical requests. By hashing the request body and parameters, the system returns stored responses from the edge without contacting the provider, resulting in zero token charges and sub-second latency.

Cache hit cost: $0 (zero tokens)
Cache hit latency: 80-300ms
Cache TTL range: 1 second to 24 hours
Supported endpoints: Chat completions, embeddings, messages
Availability: Beta, free for all users

This update matches the latency focus of OpenAI's agentic loops and automated testing, where identical prompts are often sent repeatedly during retries. It follows OpenRouter's model alias system and adds to OpenRouter Workspaces to streamline how developers manage and optimize their inference infrastructure.

You can enable caching by adding the X-OpenRouter-Cache: true header to chat completions or embedding requests, or by toggling it within a preset. The feature is currently free in beta and supports text, images, and tool calls, with configurable expiration times ranging from one second to 24 hours.

View the full update on openrouter.ai

OpenRouter

@OpenRouterMay 2

Introducing Response Caching: save tons of money and time on tests and agent retries. Blog post: https://t.co/1tasyIRssI Available for free. Learn more 👇 https://t.co/It2AXRhPAm

691.1k

View on X

Still wondering? A few quick answers below.

OpenRouter response caching is a beta feature that stores the full output of identical API requests at the network edge. When a user sends a request that matches a previously cached one, the system returns the stored response immediately. This eliminates the need to contact the model provider, saving both time and money.

When caching is enabled via a header or preset, OpenRouter creates a unique hash based on the request body, model, API key, and streaming mode. If a match is found in the cache, the response is delivered in roughly 80 to 300 milliseconds. Users can control the cache duration from one second up to 24 hours.

Prompt caching reduces the cost of processing input text when multiple requests share a common prefix at the provider level. In contrast, OpenRouter response caching skips the provider entirely by returning the full generated output from an edge cache. This results in zero tokens billed for the entire request, including both the prompt and the completion.

During the current beta period, OpenRouter response caching is available for free. While the first request that populates the cache is billed normally by the model provider, every subsequent identical request that hits the cache is billed at zero tokens. This applies to both the input prompt and the generated output tokens.

Response caching currently works across the chat completions, responses, messages, and embeddings endpoints. It supports text, images, audio, and tool calls, though very large multimodal payloads may be ineligible. Legacy completions, text-to-speech, speech-to-text, reranking, and video generation endpoints are not supported during the initial beta phase of the feature.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenRouter →

Keep reading

OpenRouter Launches Latest Model Aliases to Automate Frontier Model Updates

OpenRouter introduced a new system of model aliases that automatically route API requests to the most recent version of major LLMs. By using tags like -latest, developers can ensure their applications always use the newest capabilities without manually updating model identifiers in their code.

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

QwenMay 25

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Qwen introduced implicit and explicit context caching for its flagship Qwen3.7-Max model to reduce API latency and expenses. By allowing developers to pin massive system prompts and tool definitions, the update cuts the cost of repeated inputs by 90 percent.

What is OpenRouter response caching?

How does OpenRouter response caching work?

How is response caching different from prompt caching?

What is the pricing for OpenRouter response caching?

Which endpoints support OpenRouter response caching?

Keep reading

OpenRouter Launches Latest Model Aliases to Automate Frontier Model Updates

OpenRouter Launches Latest Model Aliases to Automate Frontier Model Updates

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Keep reading

OpenRouter Launches Latest Model Aliases to Automate Frontier Model Updates

OpenRouter Launches Latest Model Aliases to Automate Frontier Model Updates

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent