NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA

May 20, 2026 · Updated May 28, 2026

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.

NVIDIA released Nemotron-Labs-Diffusion, a family of "tri-mode" models that unify standard autoregressive, diffusion, and self-speculation decoding. These 3B to 14B parameter models generate multiple tokens simultaneously by adjusting their attention pattern during inference, which is the process of running a model.

Model scales: 3B, 8B, and 14B parameters
Decoding modes: Autoregressive, Diffusion, and Self-Speculation
Throughput gain: 4x higher on NVIDIA GB200
Tokens per forward pass: 6x vs Qwen3-8B
Variants: Base, Instruct, and Vision-Language
License: NVIDIA Open Model License

Standard models are memory-bound, limited by the speed of moving weights for every token. This release shifts generation toward a compute-bound regime, better utilizing GPUs like the Blackwell architecture. By using diffusion to draft tokens and autoregressive logic to verify them, NVIDIA achieves higher acceptance lengths than existing multi-token prediction methods.

Base, instruct, and vision-language variants are available on Hugging Face under the NVIDIA Open Model License. They are compatible with the SGLang server and NVIDIA Dynamo. For developers building agents, these models maintain the accuracy of Nemotron 3 Super while delivering nearly 6x more tokens per forward pass.

View the full update on huggingface.co

NVIDIA AI

@NVIDIAAIMay 19

Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion language models that take a different approach, generating multiple tokens in parallel within a single model. Rather than committing to each token permanently, these models can revise as they go, resulting in faster inference that better utilizes modern GPUs. The full model family ranges from 3B to 14B, including vision-language variants. Available now: https://t.co/L1Tp2aQDLJ

1921.2k

View on X

Still wondering? A few quick answers below.

Nemotron-Labs-Diffusion is a family of tri-mode language models that unifies three different decoding methods into a single architecture. It supports standard sequential generation, parallel diffusion-based generation, and a hybrid self-speculation mode. These models range from 3B to 14B parameters and include base, instruct, and vision-language variants designed for high-efficiency inference on modern hardware.

The model switches between three modes by changing its attention pattern. In autoregressive mode, it generates tokens sequentially. In diffusion mode, it generates multiple tokens in parallel. In self-speculation mode, the model uses its diffusion pathway to draft tokens and its autoregressive pathway to verify them, increasing the tokens produced per forward pass.

Most language models are limited to generating one token at a time, which often under-utilizes GPU resources. Nemotron-Labs-Diffusion uses a joint training objective that enables parallel token generation. This approach allows the 8B model to decode six times more tokens per forward pass than standard models like Qwen3-8B, resulting in up to four times higher system throughput.

NVIDIA has released the Nemotron-Labs-Diffusion model family as open-weight models available on Hugging Face. The models are released under the NVIDIA Nemotron Open Model License, allowing developers to download and run them. The release includes the training and inference pipeline through Megatron Bridge, making it accessible for integration into existing AI development workflows and research projects.

These models significantly improve inference efficiency by moving from memory-bound to compute-bound generation. On NVIDIA GB200 hardware, the models achieve up to 3.3 times faster speeds than standard autoregressive models. By generating and verifying multiple tokens in a single pass, the system reduces the time and cost required for complex tasks like coding, mathematical reasoning, and long-form content generation.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Ships Nemotron 3 Ultra for 5x Faster, Cheaper AI Agents

NVIDIA has shipped Nemotron 3 Ultra, a 550B Mixture-of-Experts (MoE) open model designed for long-running AI agents. This model delivers 5x faster inference and up to 30% lower cost for complex agentic tasks compared to other open frontier models, aiming to make autonomous workflows more efficient and accessible.

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

OllamaJun 7

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama has made NVIDIA's Nemotron 3 Ultra model available on its cloud. This 550 billion parameter Mixture of Experts (MoE) model is designed for long-running AI agents, delivering 5x faster inference and up to 30% lower costs for complex agentic tasks.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

What is NVIDIA Nemotron-Labs-Diffusion?

How does the tri-mode decoding in Nemotron-Labs-Diffusion work?

How is Nemotron-Labs-Diffusion different from standard language models?

Is Nemotron-Labs-Diffusion open source and where can I find it?

What are the performance benefits of using Nemotron-Labs-Diffusion?

Keep reading

NVIDIA Ships Nemotron 3 Ultra for 5x Faster, Cheaper AI Agents

NVIDIA Ships Nemotron 3 Ultra for 5x Faster, Cheaper AI Agents

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Keep reading

NVIDIA Ships Nemotron 3 Ultra for 5x Faster, Cheaper AI Agents

NVIDIA Ships Nemotron 3 Ultra for 5x Faster, Cheaper AI Agents

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents