Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI

Jun 3, 2026 · Updated Jun 13, 2026

Together AI is now powering inference for MiniMax M3, a multimodal model featuring a 1-million-token context window. The model uses a new sparse attention architecture to process massive datasets with significantly lower computational overhead than previous-generation models.

Together AI, a research-optimized platform for model inference, is now hosting inference (running a trained model to generate outputs) for MiniMax M3, a multimodal model with a 1-million-token context window. It uses MiniMax Sparse Attention (MSA) to process massive datasets without the exponential compute costs of full attention.

Context Window: 1,000,000 tokens
Architecture: MiniMax Sparse Attention
Coding Benchmark: 59.0% SWE-Bench Pro
Agent Benchmark: 74.2% MCP Atlas
Input Modalities: Text, Image, Video

MiniMax M3 matches full attention performance across multiple benchmarks while reducing per-token compute to 1/20th of previous generations at a 1-million-token context length. This architecture enables autonomous workflows like CUDA kernel optimization, building on the MiniMax M3 technical highlights. The model's native multimodality allows semantic spaces to merge deeply during training.

Access MiniMax M3 via the MiniMax Code app or the Together AI API, available alongside other providers like SiliconFlow. The model supports "thinking" modes for reasoning and "computer use" for desktop automation. Together AI provides the research-optimized infrastructure required to deploy and scale these models in production.

View the full update on minimax.io

Together AI

@togethercomputeJun 1

MiniMax M3 is live and Together AI is powering its inference 🚀 Tomorrow at 6pm PT we're going live on X Spaces with the teams behind the model and the infrastructure to give you a deep dive. https://t.co/wPayfOWmNg

1670

View on X

Still wondering? A few quick answers below.

MiniMax M3 is a natively multimodal frontier model designed for complex agentic tasks and long-context reasoning. It supports up to 1 million tokens and is optimized for coding, autonomous research, and desktop automation.

MSA uses a pre-filtering stage to partition data into blocks, avoiding the quadratic computational growth of traditional attention mechanisms. This reduces per-token compute to 1/20th of previous models, enabling faster processing of massive datasets.

The model is built for long-horizon tasks like independent paper reproduction, CUDA kernel optimization, and autonomous software engineering. Its native multimodality also enables "computer use" capabilities for automating cross-application workflows.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Together AI →

Keep reading

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMaxJun 3

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouterJun 1

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter integrated MiniMax-M3, an open-weight multimodal model featuring a 1-million-token context window and specialized sparse attention. By reducing long-context compute costs by 95%, the model enables persistent agentic workflows across massive codebases and video files.

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

OllamaJun 7

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama has made the MiniMax M3 model available on its Cloud, providing US-based access with zero data retention. This integration offers a frontier-level, open-weight model for agentic coding and multimodal tasks, featuring a 1-million-token context window. It expands access to advanced AI capabilities for complex, autonomous workflows.

What is MiniMax M3?

How does MiniMax Sparse Attention (MSA) improve performance?

What are the primary use cases for MiniMax M3?

Keep reading

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax M3 drops attention overhead from 30 to 5 percent

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Keep reading

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax M3 drops attention overhead from 30 to 5 percent

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context

Ollama Cloud Adds MiniMax M3 for Frontier Agentic Coding and 1M Context