Xiaomi MiMo Engineering Breakthrough Cuts Long Context KVCache Costs Sevenfold

MiMo

May 31, 2026 · Updated Jun 12, 2026

Xiaomi MiMo released a full-pipeline optimization for its MiMo-V2.5 series to maximize the efficiency of its hybrid attention architecture. The update reduces KVCache storage requirements by 7x and achieves a 95% hit rate for long-context agentic workflows.

Xiaomi MiMo released inference optimizations for the MiMo-V2.5 series to cut long-context costs. A dual-pool KVCache (temporary storage for model memory) separates local and global attention. This ensures sliding window layers only store immediate context, reducing storage and compute requirements to one-seventh of traditional models.

KVCache storage reduction: 7x
Server-side cache hit rate: 93% to 95%
MTP initial token speedup: 2.3x
1-hour video decoding time: 23 seconds
RDMA read throughput: 170 GB/s

These optimizations solve memory bottlenecks for million-token windows. By using SWA-aware prefix cache trees and a distributed L3 cache called GCache, Xiaomi maintains a 95% cache hit rate. This engineering supports the MiMo-V2.5 price reduction, rivaling Qwen for high-volume agentic tasks.

Users get 2.3x faster initial responses via Multi-Token Prediction. Multimodal performance also jumps, with parallel video decoding cutting one-hour video processing to 23 seconds. These updates are live for MiMo-V2.5 and MiMo-V2.5-Pro, with several optimizations being upstreamed to the SGLang community.

View the full update on mimo.xiaomi.com

Xiaomi MiMo

@XiaomiMiMoMay 30

What’s new with MiMo-V2.5 series inference? We just published a blog on our full pipeline inference optimizations for MiMo-V2.5 series, including how we pushed hybrid SWA efficiency to the limit. Read the full blog here: https://t.co/lYBEcgaVhU

31375

View on X

Still wondering? A few quick answers below.

It is a full-pipeline engineering update designed to maximize the efficiency of the Hybrid Sliding Window Attention architecture. By refactoring KVCache management and introducing a distributed caching system, Xiaomi reduces the memory and compute overhead of processing long sequences by approximately 85% compared to traditional models.

Hybrid Sliding Window Attention interleaves local and global attention layers. Because local layers only need to store data for a small window of tokens rather than the entire sequence, the KVCache storage requirement drops to 1/7th of standard models. This allows for higher concurrency and lower hardware costs per request.

GCache is a high-performance distributed cache system that serves as the L3 storage tier for the inference engine. It uses RDMA networking and consistent hashing to store session data across GPU nodes, enabling high cache hit rates and reducing the need for expensive recomputation during multi-turn agent interactions.

The update introduces parallel video decoding and asynchronous multimodal embedding transfers. These optimizations allow the system to process large images and long videos without stalling the GPU, effectively doubling encoder throughput and significantly reducing the latency of vision and audio tasks.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Xiaomi MiMo →

Keep reading

Xiaomi MiMo Slashes V2.5 API Pricing by 99 Percent

Xiaomi permanently reduced MiMo-V2.5 Series API costs by up to 99% and eliminated tiered pricing for long-context inputs. The update uses inference optimizations to provide 5–8x more tokens in subscription plans, making high-volume agentic workflows significantly more affordable.

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

OpenRouterApr 30

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

Xiaomi released the MiMo-V2.5 series on OpenRouter, featuring a 1 million token context window and native multimodal support for image and video tasks. The models are specifically architected for long-horizon agentic workflows and coding, offering reasoning-enabled thinking tokens to improve task stability. By delivering pro-level performance at roughly half the typical inference cost, these models lower the economic barrier for deploying autonomous agents at scale.

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

OpenCodeApr 24

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

OpenCode integrated Xiaomi's MiMo v2.5 and v2.5 Pro models into its Go platform, offering native multimodality and specialized coding intelligence. These agent-centric models provide a 1-million-token context window for complex engineering tasks at the same price point as previous versions.

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model

ArenaApr 30

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model

Arena.ai validated Xiaomi's MiMo-V2.5-Pro as a top-three open-weight model for frontend web development following its official open-source release under the MIT license. The model features a 1-million-token context window and native multimodality, offering a high-performance alternative for commercial agentic workflows.

What is the MiMo-V2.5 inference optimization?

How does Hybrid SWA reduce AI costs?

What is GCache in the MiMo ecosystem?

How does this update improve multimodal AI?

Keep reading

Xiaomi MiMo Slashes V2.5 API Pricing by 99 Percent

Xiaomi MiMo Slashes V2.5 API Pricing by 99 Percent

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model

Keep reading

Xiaomi MiMo Slashes V2.5 API Pricing by 99 Percent

Xiaomi MiMo Slashes V2.5 API Pricing by 99 Percent

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

Xiaomi Launches MiMo-V2.5 Series With 1M Context and Reasoning Tokens

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

OpenCode Adds Xiaomi MiMo v2.5 Models to Go for Agentic Coding

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model

Arena.ai Ranks Xiaomi MiMo-V2.5 as Top Open Source Coding Model