HeadsUpAI

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is partnering with MiniMax to provide high-speed inference (running a trained model to generate outputs) for the new MiniMax M3. The model introduces MiniMax Sparse Attention (MSA), a novel architecture for 1-million-token context windows. This system achieves a 15.6x increase in decoding speed at full context.
Decoding Speedup
15.6x at 1M tokens
Attention Overhead
Reduced from 30% to 5%
Prefill Speedup
9.7x at 1M tokens

This update addresses the primary bottleneck in long-context AI: the computational cost of attention. By reducing attention kernel overhead from 30% to 5%, the model maintains uncompressed data without the typical performance penalty. It reaches frontier-grade performance on benchmarks like SWE-Bench Pro, scoring 59.0% for agentic coding tasks.

You can now access MiniMax M3 through Fireworks AI for applications requiring massive data retrieval. The model supports interleaved text, image, and video inputs for workflows like vision-based code evaluation. While weights are restricted, the model will be available to the community once released, following rollouts on SiliconFlow.

Fireworks AI
Fireworks AI
@FireworksAI_HQ
X

MiniMax M3 arrives with MiniMax Sparse Attention (MSA), 15.6x faster decoding at 1M tokens. We're partnering with @MiniMax_AI to power the inference behind this week's launch. Head to https://t.co/kZWnBSmlt0 to take it for a spin. Once the model weights are released, M3 will be available to the Fireworks community.

1retweets18likes
View on X

Still wondering? A few quick answers below.

MiniMax M3 is a multimodal foundation model designed for frontier-grade coding and agentic tasks. It features a 1-million-token context window and was trained on 100 trillion tokens of interleaved text, image, and video data. The model is built to handle production-grade engineering tasks that go beyond simple code generation.

MiniMax Sparse Attention (MSA) is a novel architecture that allows models to scale context windows to 1 million tokens without the exponential computational costs of standard attention. By reducing the attention kernel overhead to just 5% of decoding time, it enables much faster processing of massive datasets and long documents.

Through the partnership with Fireworks AI, MiniMax M3 achieves a 15.6x increase in decoding speed when processing 1 million tokens. It also features a 9.7x speedup during the prefill stage at full context, making it one of the most efficient models for real-time applications involving ultra-long context windows.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update