HeadsUpAI

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax released technical data for MiniMax-M3, a multimodal model featuring a 1-million-token context window. It uses MiniMax Sparse Attention (MSA), an architecture performing block-level selection while keeping the original, uncompressed KV cache. This design allows the model to handle massive context windows efficiently without the data loss of standard compression methods.
Context Window
1,000,000 tokens
Attention Kernel Time
5% of per-decode wall-clock time
Input Modalities
Text, Image, Video
Architecture
MiniMax Sparse Attention

At full context, the attention kernel consumed 30% of per-decode wall-clock time; MiniMax Sparse Attention drops that to 5%. MiniMax-M3 is natively multimodal, accepting image and video inputs, and can operate a desktop computer. It also performs vision-coding tasks by rendering its own output, inspecting the results, and iterating based on its own judgment.

Beyond coding, the model demonstrates junior-analyst-level performance on finance tasks and handles long-horizon agentic workflows. Future releases will focus on multi-file tasks and scaling post-training compute. The model is available via the MiniMax platform, and is also offered through inference providers including SiliconFlow and Together AI.

MiniMax (official)
MiniMax (official)
@MiniMax_AI
X

We wrapped a live session on M3 yesterday with the @togethercompute team & our researchers @zpysky1125 and @HaohaiSun A few highlights 🧵 1. MSA (MiniMax Sparse Attention) is the star ⭐️. Unlike CSA/HCA, which compress the KV cache, MSA keeps the real, uncompressed KV and does block-level selection with a small top-K. That's how the 1M context window stays tractable. 2. The efficiency win is huge. In our previous generation, ~30% of per-decode wall-clock time went to the attention kernel. With MSA that now drops to ~5%. Big gains for long-context generation. 3. M3 isn't just a coding model. Natively multimodal (image + video in), ability to handle long-horizon agentic tasks, and even operate a desktop computer. People are already throwing game-dev + Minecraft-style builds at it (Unity included) and it's holding its own. 4. M3 can self-evaluate on vision-coding tasks: it builds a website or SVG, browses and inspects its own rendered output, judges it, and iterates - grading work visually. 5. We're also seeing junior-analyst-level performance on finance tasks; something we haven't even showcased publicly yet. 6. What's next: harder long-horizon / multi-file tasks in future releases, scaling data + post-training (RL) compute toward pre-training scale, and going deeper into finance, legal & bio. Thanks to everyone who joined 🙏 Try M3 link in the comments👇

4retweets95likes
View on X

Still wondering? A few quick answers below.

MiniMax Sparse Attention (MSA) is a technical architecture that enables efficient processing for 1-million-token context windows. Unlike other methods that compress data to save memory, MSA keeps the original, uncompressed KV cache and uses block-level selection. This reduces the attention kernel's share of decoding time from 30% in previous models to 5%.

The model uses a self-evaluation loop to improve its coding outputs. When creating a website or SVG, MiniMax-M3 renders the code, browses the resulting output, and visually inspects it. It then judges the quality of its own rendered work and iterates on the code to fix errors or improve the visual result.

MiniMax-M3 is a multimodal model capable of processing text, images, and video. It can autonomously operate a desktop computer and handle long-horizon agentic tasks. While it excels at coding, it also shows junior-analyst-level performance in finance, with future expansions planned for legal and bio applications.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update