Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AITogether AI

Together AI implemented custom engineering optimizations to serve MiniMax M3 at production scale. The team built a KV-block-major sparse attention kernel, integrated paged attention for MSA, and optimized decode index scoring. These changes, alongside a Rust-based multimodal preprocessing gateway, delivered 81–125% throughput improvements across varying concurrency levels for the 1-million-token context model.

Together AI
Together AI
@togethercompute
X

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work. Together’s kernel and inference teams built KV-block-major sparse attention, integrated MSA with paged KV cache, optimized decode index scoring, and moved multimodal preprocessing into a Rust gateway before requests reach GPU workers.

3retweets40likes
View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update