Kimi Reveals How It Scaled K2.5 at NVIDIA GTC 2026

Kimi

Mar 21, 2026 · Updated Jun 12, 2026

Kimi CEO Zhilin Yang detailed the training innovations behind Kimi K2.5 at NVIDIA GTC 2026. The session covers the Muon optimizer replacing Adam to double token learning efficiency, AI-native training, and a shift toward linear attention for longer-running agents.

Kimi (Moonshot AI), a Chinese AI lab building frontier language models, released its NVIDIA GTC 2026 session on-demand. CEO Zhilin Yang walked through the engineering decisions behind Kimi K2.5: replacing the Adam optimizer with the Muon optimizer during massive-scale pre-training, which doubles token learning efficiency. The team co-designed model architecture and training infrastructure from Day 0 — on NVIDIA Hopper and Blackwell hardware — to achieve training stability at scale.

The session also covers a shift toward AI-native training, where the model actively participates in its own data synthesis, evaluation, and evolution. Separately, Yang presents the case for linear attention architectures as the foundation for longer-running AI agents — a direction that signals where Kimi's next generation of models is heading.

Watch the full on-demand session to follow the Muon optimizer breakdown, the Day 0 co-design methodology, and Kimi's linear attention roadmap.

View the full update on nvidia.com

Kimi.ai

@Kimi_MoonshotMar 20

Zhilin's full GTC 2026 keynote is here. If you're curious about the "how" behind scaling Kimi’s latest models, this is the session you can't miss. :) https://t.co/rRgPzau6e5

138

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Kimi →

Keep reading

Kimi Open Sources FlashKDA to Double Long Context Prefill Speeds

Kimi open-sourced FlashKDA, a high-performance implementation of its Kimi Delta Attention kernels built on the CUTLASS library. The release achieves up to a 2.22x speedup in prefill processing on H20 GPUs compared to standard linear attention baselines. This allows developers to significantly reduce latency for long-context applications by swapping in a more efficient backend.

Fireworks AI Launches Day-0 Support for Kimi K2.6 Agentic Model

Fireworks AIApr 25

Fireworks AI Launches Day-0 Support for Kimi K2.6 Agentic Model

Fireworks AI added immediate support for Kimi K2.6, a 1-trillion parameter multimodal model optimized for long-horizon agentic coding. The update provides the high-speed inference and fine-tuning infrastructure needed to run the successor to the model that powered Cursor's Composer 2.

Xiaomi MiMo Engineering Breakthrough Cuts Long Context KVCache Costs Sevenfold

MiMoMay 31

Xiaomi MiMo Engineering Breakthrough Cuts Long Context KVCache Costs Sevenfold

Xiaomi MiMo released a full-pipeline optimization for its MiMo-V2.5 series to maximize the efficiency of its hybrid attention architecture. The update reduces KVCache storage requirements by 7x and achieves a 95% hit rate for long-context agentic workflows.

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIAMay 5

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIA integrated higher-order optimizers like Muon into its Megatron Core framework to increase training efficiency for 30B-parameter models. This shift from standard data-parallel methods allows labs to maximize throughput on Blackwell-class hardware for the next generation of reasoning models.