Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI

Jun 4, 2026 · Updated Jun 12, 2026

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

Fireworks AI is partnering with MiniMax to provide high-speed inference (running a trained model to generate outputs) for the newly launched MiniMax M3. The model introduces MiniMax Sparse Attention (MSA), a novel architecture for 1-million-token context windows, and achieves a 15.6x increase in decoding speed at full context.

Decoding Speedup: 15.6x at 1M tokens
Architecture: MiniMax Sparse Attention (MSA)
Context Window: 1,000,000 tokens
Inputs: Interleaved text, image, video
Availability: Fireworks AI (weights to community on release)

MSA lets the model scale to a 1-million-token context without the exponential computational cost of standard attention, removing the usual speed penalty on long-context work. The model accepts interleaved text, image, and video inputs, supporting multimodal workflows beyond plain text generation.

You can now access MiniMax M3 through Fireworks AI for applications requiring massive context. While the model weights are currently restricted, M3 will be available to the Fireworks community once they are released, following rollouts on inference providers like SiliconFlow.

View the full update on minimax.io

Fireworks AI

@FireworksAI_HQJun 3

MiniMax M3 arrives with MiniMax Sparse Attention (MSA), 15.6x faster decoding at 1M tokens. We're partnering with @MiniMax_AI to power the inference behind this week's launch. Head to https://t.co/kZWnBSmlt0 to take it for a spin. Once the model weights are released, M3 will be available to the Fireworks community.

118

View on X

Still wondering? A few quick answers below.

MiniMax M3 is a multimodal foundation model with a 1-million-token context window that accepts interleaved text, image, and video inputs. It is built on MiniMax Sparse Attention for efficient long-context processing, designed for production-grade agentic and engineering tasks.

MiniMax Sparse Attention (MSA) is a novel architecture that lets models scale context windows to 1 million tokens without the exponential computational cost of standard attention. This enables much faster processing of massive datasets and long documents.

Through the partnership with Fireworks AI, MiniMax M3 achieves a 15.6x increase in decoding speed when processing 1 million tokens, making it one of the most efficient options for real-time applications involving ultra-long context windows.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Fireworks AI →

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI is now powering inference for MiniMax M3, a multimodal model featuring a 1-million-token context window. The model uses a new sparse attention architecture to process massive datasets with significantly lower computational overhead than previous-generation models.

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMaxJun 3

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouterJun 1

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter integrated MiniMax-M3, an open-weight multimodal model featuring a 1-million-token context window and specialized sparse attention. By reducing long-context compute costs by 95%, the model enables persistent agentic workflows across massive codebases and video files.

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

OllamaJun 7

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama has made the MiniMax M3 model available on its Cloud, providing US-based access with zero data retention. This integration offers a frontier-level, open-weight model for agentic coding and multimodal tasks, featuring a 1-million-token context window. It expands access to advanced AI capabilities for complex, autonomous workflows.

What is MiniMax M3?

What is MiniMax Sparse Attention?

How fast is MiniMax M3 on Fireworks AI?

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI powers MiniMax M3 with 1M context and sparse attention

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax M3 drops attention overhead from 30 to 5 percent

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI powers MiniMax M3 with 1M context and sparse attention

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax M3 drops attention overhead from 30 to 5 percent

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context