Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Qwen

May 25, 2026 · Updated Jun 13, 2026

Qwen introduced implicit and explicit context caching for its flagship Qwen3.7-Max model to reduce API latency and expenses. By allowing developers to pin massive system prompts and tool definitions, the update cuts the cost of repeated inputs by 90 percent.

Qwen, the AI lab behind Alibaba's foundation models (large AI models trained on broad data), launched context caching (storing data for faster retrieval) for Qwen3.7-Max. The update includes implicit caching for automatic optimization and explicit caching for manual control. This follows the recent Alibaba Qwen3.7-Max launch, providing the framework to sustain massive context windows.

Cache hit price: 0.1x base input
Cache creation price: 1.25x base input
Minimum cache size: 1,024 tokens
Cache TTL: 5 minutes
Max markers per request: 4

As the industry moves toward agentic AI, repeated processing overhead for long histories and tool definitions has become a major bottleneck. This caching mechanism answers Anthropic's prompt caching feature by offering a 90 percent discount on cached tokens, improving the unit economics of production-grade agents.

You can now implement explicit caching via API by adding markers to content exceeding 1,024 tokens. The feature is natively supported in tools like Claude Code and OpenCode. While cache creation costs 1.25x the base rate, the 0.1x hit rate price makes it ideal for multi-turn coding and batch processing.

View the full update on alibabacloud.com

Qwen

@Alibaba_QwenMay 25

✅Implicit caching is now live on Qwen3.7-Max — kicks in automatically, no setup needed. ⚡️Faster + cheaper out of the box. Need higher, more deterministic hit rates? Try explicit caching instead. 🙌 🔗Best practices 🔗 ：https://t.co/3hSs6zquBH

1011.4k

View on X

Still wondering? A few quick answers below.

Implicit caching is an automatic feature that requires no manual setup to optimize requests for faster and cheaper performance. Explicit caching is a manual method where developers use specific markers to pin content in the model memory. This provides deterministic cache hits for identical input content, which is ideal for stable context reuse in production agents.

Creating a new cache block incurs a twenty-five percent surcharge over the standard input price. However, every subsequent cache hit is billed at only ten percent of the standard input rate, representing a ninety percent discount. A single cache hit is typically enough for a developer to break even on the initial creation cost.

To trigger explicit caching, the cached content must be at least 1,024 tokens in length. Developers must add a cache control marker to the message content using an array format. The system supports up to four markers per request, and the cache remains active for five minutes, automatically renewing its lifespan after every successful hit.

Caching is currently live for the Qwen3.7-Max flagship model and is accessible through both OpenAI and Anthropic compatible API endpoints. Several agentic coding tools natively support this feature, including Claude Code, OpenCode, and OpenClaw. These tools automatically inject the necessary markers to optimize context management for system prompts and user messages.

Explicit caching is recommended when you need guaranteed, deterministic cache hits regardless of backend resource scheduling. It is particularly effective for managing long contexts in AI agents where system reminders or project knowledge bases remain stable while the surrounding conversation evolves. This ensures that key context segments stay cached even as the total message history grows.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Qwen →

Keep reading

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter integrated Alibaba's Qwen3.7-Max, a flagship model optimized for autonomous agent loops and multi-hour task execution. The update introduces explicit prompt caching for the Qwen series, allowing developers to maintain massive context windows at a 90 percent discount on subsequent requests.

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Nous ResearchMay 27

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Nous Research integrated Alibaba's Qwen 3.7 Max into its open-source Hermes Agent platform. This allows users to power autonomous multi-step workflows with reasoning models while benefiting from recent cost-saving context caching.

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

QwenFeb 24

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

Alibaba's Qwen team released four medium-sized models that match their previous flagship while activating a fraction of the parameters. The standout Qwen3.5-35B-A3B uses just 3 billion active parameters yet surpasses Qwen3-235B-A22B across reasoning, coding, and agentic benchmarks.

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

VercelMay 21

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel added Alibaba's Qwen 3.7 Max to its AI Gateway, enabling developers to access the agent-focused model without separate provider accounts. The model is optimized for long-horizon execution, allowing it to maintain reasoning across complex, multi-step tasks like multi-file engineering and office automation.

What is the difference between implicit and explicit caching on Qwen3.7-Max?

How much does Qwen context caching cost?

What are the technical requirements for using Qwen explicit caching?

Which AI tools and models support Qwen context caching?

When should developers use explicit caching instead of implicit caching?

Keep reading

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Keep reading

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Nous Research Adds Qwen 3.7 Max Support to Hermes Agent

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

Qwen 3.5 Medium Models Match Flagship Performance With a Fraction of Parameters

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows