NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIA

May 5, 2026

NVIDIA integrated higher-order optimizers like Muon into its Megatron Core framework to increase training efficiency for 30B-parameter models. This shift from standard data-parallel methods allows labs to maximize throughput on Blackwell-class hardware for the next generation of reasoning models.

NVIDIA updated Megatron Core, its open-source training framework, to support higher-order optimizers including Muon, MOP, and REKLS. These algorithms move beyond standard first-order optimization (the process of adjusting model weights to reduce error) to improve learning efficiency. This update provides end-to-end support.

The move addresses scaling challenges for 30B-parameter models, alongside NVIDIA's Qwen3 FP8 training support and Kimi K2's token learning efficiency. While standard methods hit efficiency ceilings, higher-order optimizers like Muon significantly increase token learning efficiency. This efficiency mirrors NVIDIA's Nemotron 3 Super leaderboard ranking on enterprise benchmarks.

You can now implement these optimizers directly within the Megatron Core workflow to reduce compute required for frontier-scale training. The support is tuned for the Blackwell architecture, enabling near-parity throughput to traditional methods while achieving faster convergence. The updated framework is available via the official NVIDIA developer portal.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIMay 4

Training Kimi K2 and Qwen3 30B-scale models efficiently requires more than standard data-parallel tricks. NVIDIA Megatron Core now provides end-to-end support for emerging higher-order optimizers like Muon, alongside research optimizers such as MOP and REKLS, to push training efficiency on GB300 GPUs and NVL72 systems. Full breakdown 👇 https://t.co/D7E55OnCiK

18125

View on X

Still wondering? A few quick answers below.

Muon is an emerging higher-order optimizer designed to increase the efficiency of large language model training. Unlike standard first-order methods that rely on basic data-parallel techniques, Muon improves how models learn from each token. NVIDIA now provides end-to-end support for Muon within the Megatron Core framework to help developers achieve faster convergence during training.

The new higher-order optimizers, including Muon, MOP, and REKLS, are specifically designed to push training efficiency on NVIDIA Blackwell architecture. This includes the GB300 GPUs and NVL72 systems. By integrating these optimizers into Megatron Core, NVIDIA ensures that its latest high-performance hardware can handle the massive compute requirements of frontier-scale models more effectively.

Standard training often relies on first-order optimization and data-parallel tricks, which can hit efficiency limits as models scale. Higher-order optimizers like Muon, MOP, and REKLS use more complex mathematical approaches to improve learning efficiency. This allows models to reach high performance levels with fewer computational resources, making the training of 30B-parameter models significantly more efficient.

NVIDIA specifically highlights the training of 30B-scale models, such as Kimi K2 and Qwen3, as primary use cases for these new optimizers. These frontier-grade models require advanced optimization to maintain high throughput and learning efficiency. The end-to-end support in Megatron Core allows researchers to apply these advanced algorithms to large-scale Mixture-of-Experts and dense transformer architectures.

Yes, the support for these emerging optimizers is integrated into NVIDIA Megatron Core, which is a performant and scalable open-source training stack. Developers can access these tools through the Megatron-LM repository on GitHub or via the NVIDIA developer portal. This allows the broader AI research community to implement advanced training techniques on their own Blackwell-based infrastructure.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Research integrated speculative decoding into the NeMo-RL training framework to remove the bottleneck of autoregressive rollout generation. By using a vLLM backend to accelerate response generation during reinforcement learning, the system delivers up to a 1.8x throughput increase without altering the model's output distribution.

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

CursorApr 7

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Anysphere rebuilt the Mixture of Experts inference path for NVIDIA Blackwell GPUs, achieving 1.84x faster throughput by assigning GPU warps to individual output neurons. This warp decode approach eliminates the data-shuffling overhead typical of expert-centric models while improving output accuracy by 1.4x.

What is the Muon optimizer in NVIDIA Megatron Core?

Which hardware systems are optimized for these new optimizers?

How do higher-order optimizers differ from standard training methods?

Which AI models benefit from the Megatron Core optimizer update?

Is the support for Muon and REKLS open source?

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs