HeadsUpAI

Xiaomi MiMo Engineering Breakthrough Cuts Long Context KVCache Costs Sevenfold

Xiaomi MiMo released inference optimizations for the MiMo-V2.5 series to cut long-context costs. A dual-pool KVCache (temporary storage for model memory) separates local and global attention. This ensures sliding window layers only store immediate context, reducing storage and compute requirements to one-seventh of traditional models.
KVCache storage reduction
7x
Server-side cache hit rate
93% to 95%
MTP initial token speedup
2.3x
1-hour video decoding time
23 seconds
RDMA read throughput
170 GB/s

These optimizations solve memory bottlenecks for million-token windows. By using SWA-aware prefix cache trees and a distributed L3 cache called GCache, Xiaomi maintains a 95% cache hit rate. This engineering supports the MiMo-V2.5 price reduction, rivaling Qwen for high-volume agentic tasks.

Users get 2.3x faster initial responses via Multi-Token Prediction. Multimodal performance also jumps, with parallel video decoding cutting one-hour video processing to 23 seconds. These updates are live for MiMo-V2.5 and MiMo-V2.5-Pro, with several optimizations being upstreamed to the SGLang community.

Xiaomi MiMo
Xiaomi MiMo
@XiaomiMiMo
X

What’s new with MiMo-V2.5 series inference? We just published a blog on our full pipeline inference optimizations for MiMo-V2.5 series, including how we pushed hybrid SWA efficiency to the limit. Read the full blog here: https://t.co/lYBEcgaVhU

30retweets360likes
View on X

Still wondering? A few quick answers below.

It is a full-pipeline engineering update designed to maximize the efficiency of the Hybrid Sliding Window Attention architecture. By refactoring KVCache management and introducing a distributed caching system, Xiaomi reduces the memory and compute overhead of processing long sequences by approximately 85% compared to traditional models.

Hybrid Sliding Window Attention interleaves local and global attention layers. Because local layers only need to store data for a small window of tokens rather than the entire sequence, the KVCache storage requirement drops to 1/7th of standard models. This allows for higher concurrency and lower hardware costs per request.

GCache is a high-performance distributed cache system that serves as the L3 storage tier for the inference engine. It uses RDMA networking and consistent hashing to store session data across GPU nodes, enabling high cache hit rates and reducing the need for expensive recomputation during multi-turn agent interactions.

The update introduces parallel video decoding and asynchronous multimodal embedding transfers. These optimizations allow the system to process large images and long videos without stalling the GPU, effectively doubling encoder throughput and significantly reducing the latency of vision and audio tasks.

Share this update