MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax

Jun 3, 2026 · Updated Jun 13, 2026

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

MiniMax released technical data for MiniMax-M3, a multimodal model featuring a 1-million-token context window. It uses MiniMax Sparse Attention (MSA), an architecture performing block-level selection while keeping the original, uncompressed KV cache. This design allows the model to handle massive context windows efficiently without the data loss of standard compression methods.

Context Window: 1,000,000 tokens
Attention Kernel Time: 5% of per-decode wall-clock time
Input Modalities: Text, Image, Video
Architecture: MiniMax Sparse Attention

At full context, the attention kernel consumed 30% of per-decode wall-clock time; MiniMax Sparse Attention drops that to 5%. MiniMax-M3 is natively multimodal, accepting image and video inputs, and can operate a desktop computer. It also performs vision-coding tasks by rendering its own output, inspecting the results, and iterating based on its own judgment.

Beyond coding, the model demonstrates junior-analyst-level performance on finance tasks and handles long-horizon agentic workflows. Future releases will focus on multi-file tasks and scaling post-training compute. The model is available via the MiniMax platform, and is also offered through inference providers including SiliconFlow and Together AI.

View the full update on platform.minimax.io

MiniMax (official)

@MiniMax_AIJun 2

We wrapped a live session on M3 yesterday with the @togethercompute team & our researchers @zpysky1125 and @HaohaiSun A few highlights 🧵 1. MSA (MiniMax Sparse Attention) is the star ⭐️. Unlike CSA/HCA, which compress the KV cache, MSA keeps the real, uncompressed KV and does block-level selection with a small top-K. That's how the 1M context window stays tractable. 2. The efficiency win is huge. In our previous generation, ~30% of per-decode wall-clock time went to the attention kernel. With MSA that now drops to ~5%. Big gains for long-context generation. 3. M3 isn't just a coding model. Natively multimodal (image + video in), ability to handle long-horizon agentic tasks, and even operate a desktop computer. People are already throwing game-dev + Minecraft-style builds at it (Unity included) and it's holding its own. 4. M3 can self-evaluate on vision-coding tasks: it builds a website or SVG, browses and inspects its own rendered output, judges it, and iterates - grading work visually. 5. We're also seeing junior-analyst-level performance on finance tasks; something we haven't even showcased publicly yet. 6. What's next: harder long-horizon / multi-file tasks in future releases, scaling data + post-training (RL) compute toward pre-training scale, and going deeper into finance, legal & bio. Thanks to everyone who joined 🙏 Try M3 link in the comments👇

495

View on X

Still wondering? A few quick answers below.

MiniMax Sparse Attention (MSA) is a technical architecture that enables efficient processing for 1-million-token context windows. Unlike other methods that compress data to save memory, MSA keeps the original, uncompressed KV cache and uses block-level selection. This reduces the attention kernel's share of decoding time from 30% in previous models to 5%.

The model uses a self-evaluation loop to improve its coding outputs. When creating a website or SVG, MiniMax-M3 renders the code, browses the resulting output, and visually inspects it. It then judges the quality of its own rendered work and iterates on the code to fix errors or improve the visual result.

MiniMax-M3 is a multimodal model capable of processing text, images, and video. It can autonomously operate a desktop computer and handle long-horizon agentic tasks. While it excels at coding, it also shows junior-analyst-level performance in finance, with future expansions planned for legal and bio applications.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from MiniMax →

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI is now powering inference for MiniMax M3, a multimodal model featuring a 1-million-token context window. The model uses a new sparse attention architecture to process massive datasets with significantly lower computational overhead than previous-generation models.

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AIJun 4

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouterJun 1

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter integrated MiniMax-M3, an open-weight multimodal model featuring a 1-million-token context window and specialized sparse attention. By reducing long-context compute costs by 95%, the model enables persistent agentic workflows across massive codebases and video files.

MiniMax M3 adds persistent memory for long-term agentic workflows

MiniMaxJun 4

MiniMax M3 adds persistent memory for long-term agentic workflows

MiniMax announced a partnership with Mem0 to integrate a persistent memory layer into its new M3 multimodal model. By combining a 1-million-token context window with long-term data retention, the integration allows developers to build autonomous agents that remember user preferences across multiple sessions.

What is MiniMax Sparse Attention?

How does MiniMax-M3 handle vision-coding tasks?

What are the primary capabilities of MiniMax-M3?

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI powers MiniMax M3 with 1M context and sparse attention

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

MiniMax M3 adds persistent memory for long-term agentic workflows

MiniMax M3 adds persistent memory for long-term agentic workflows

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI powers MiniMax M3 with 1M context and sparse attention

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

MiniMax M3 adds persistent memory for long-term agentic workflows

MiniMax M3 adds persistent memory for long-term agentic workflows