Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI

Jun 15, 2026

Together AI implemented custom engineering optimizations to serve MiniMax M3 at production scale. The team built a KV-block-major sparse attention kernel, integrated paged attention for MSA, and optimized decode index scoring. These changes, alongside a Rust-based multimodal preprocessing gateway, delivered 81–125% throughput improvements across varying concurrency levels for the 1-million-token context model.

View the full update on together.ai

Together AI

@togethercompute3d ago

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work. Together’s kernel and inference teams built KV-block-major sparse attention, integrated MSA with paged KV cache, optimized decode index scoring, and moved multimodal preprocessing into a Rust gateway before requests reach GPU workers.

340

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Together AI powers MiniMax M3 with 1M context and sparse attention

Together AI is now powering inference for MiniMax M3, a multimodal model featuring a 1-million-token context window. The model uses a new sparse attention architecture to process massive datasets with significantly lower computational overhead than previous-generation models.

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMaxJun 3

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AIJun 4

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouterJun 1

OpenRouter adds MiniMax-M3 with 1M context for multimodal agentic coding

OpenRouter integrated MiniMax-M3, an open-weight multimodal model featuring a 1-million-token context window and specialized sparse attention. By reducing long-context compute costs by 95%, the model enables persistent agentic workflows across massive codebases and video files.