M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work. Together’s kernel and inference teams built KV-block-major sparse attention, integrated MSA with paged KV cache, optimized decode index scoring, and moved multimodal preprocessing into a Rust gateway before requests reach GPU workers.
Together AI Optimizes MiniMax M3 Inference with New Systems Kernels
Together AITogether AI implemented custom engineering optimizations to serve MiniMax M3 at production scale. The team built a KV-block-major sparse attention kernel, integrated paged attention for MSA, and optimized decode index scoring. These changes, alongside a Rust-based multimodal preprocessing gateway, delivered 81–125% throughput improvements across varying concurrency levels for the 1-million-token context model.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





