✅Implicit caching is now live on Qwen3.7-Max — kicks in automatically, no setup needed. ⚡️Faster + cheaper out of the box. Need higher, more deterministic hit rates? Try explicit caching instead. 🙌 🔗Best practices 🔗 :https://t.co/3hSs6zquBH
Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent
Qwen, the AI lab behind Alibaba's foundation models (large AI models trained on broad data), launched context caching (storing data for faster retrieval) for Qwen3.7-Max. The update includes implicit caching for automatic optimization and explicit caching for manual control. This follows the recent Alibaba Qwen3.7-Max launch, providing the framework to sustain massive context windows.
- Cache hit price
- 0.1x base input
- Cache creation price
- 1.25x base input
- Minimum cache size
- 1,024 tokens
- Cache TTL
- 5 minutes
- Max markers per request
- 4
As the industry moves toward agentic AI, repeated processing overhead for long histories and tool definitions has become a major bottleneck. This caching mechanism answers Anthropic's prompt caching feature by offering a 90 percent discount on cached tokens, improving the unit economics of production-grade agents.
You can now implement explicit caching via API by adding markers to content exceeding 1,024 tokens. The feature is natively supported in tools like Claude Code and OpenCode. While cache creation costs 1.25x the base rate, the 0.1x hit rate price makes it ideal for multi-turn coding and batch processing.
Qwen
@Alibaba_Qwen
49retweets678likes
View on XStill wondering? A few quick answers below.
Implicit caching is an automatic feature that requires no manual setup to optimize requests for faster and cheaper performance. Explicit caching is a manual method where developers use specific markers to pin content in the model memory. This provides deterministic cache hits for identical input content, which is ideal for stable context reuse in production agents.
Creating a new cache block incurs a twenty-five percent surcharge over the standard input price. However, every subsequent cache hit is billed at only ten percent of the standard input rate, representing a ninety percent discount. A single cache hit is typically enough for a developer to break even on the initial creation cost.
To trigger explicit caching, the cached content must be at least 1,024 tokens in length. Developers must add a cache control marker to the message content using an array format. The system supports up to four markers per request, and the cache remains active for five minutes, automatically renewing its lifespan after every successful hit.
Caching is currently live for the Qwen3.7-Max flagship model and is accessible through both OpenAI and Anthropic compatible API endpoints. Several agentic coding tools natively support this feature, including Claude Code, OpenCode, and OpenClaw. These tools automatically inject the necessary markers to optimize context management for system prompts and user messages.
Explicit caching is recommended when you need guaranteed, deterministic cache hits regardless of backend resource scheduling. It is particularly effective for managing long contexts in AI agents where system reminders or project knowledge bases remain stable while the surrounding conversation evolves. This ensures that key context segments stay cached even as the total message history grows.




