HeadsUpAI

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of "tri-mode" models that unify standard autoregressive, diffusion, and self-speculation decoding. These 3B to 14B parameter models generate multiple tokens simultaneously by adjusting their attention pattern during inference, which is the process of running a model.
Model scales
3B, 8B, and 14B parameters
Decoding modes
Autoregressive, Diffusion, and Self-Speculation
Throughput gain
4x higher on NVIDIA GB200
Tokens per forward pass
6x vs Qwen3-8B
Variants
Base, Instruct, and Vision-Language
License
NVIDIA Open Model License

Standard models are memory-bound, limited by the speed of moving weights for every token. This release shifts generation toward a compute-bound regime, better utilizing GPUs like the Blackwell architecture. By using diffusion to draft tokens and autoregressive logic to verify them, NVIDIA achieves higher acceptance lengths than existing multi-token prediction methods.

Base, instruct, and vision-language variants are available on Hugging Face under the NVIDIA Open Model License. They are compatible with the SGLang server and NVIDIA Dynamo. For developers building agents, these models maintain the accuracy of Nemotron 3 Super while delivering nearly 6x more tokens per forward pass.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion language models that take a different approach, generating multiple tokens in parallel within a single model. Rather than committing to each token permanently, these models can revise as they go, resulting in faster inference that better utilizes modern GPUs. The full model family ranges from 3B to 14B, including vision-language variants. Available now: https://t.co/L1Tp2aQDLJ

164retweets1.1klikes
View on X

Still wondering? A few quick answers below.

Nemotron-Labs-Diffusion is a family of tri-mode language models that unifies three different decoding methods into a single architecture. It supports standard sequential generation, parallel diffusion-based generation, and a hybrid self-speculation mode. These models range from 3B to 14B parameters and include base, instruct, and vision-language variants designed for high-efficiency inference on modern hardware.

The model switches between three modes by changing its attention pattern. In autoregressive mode, it generates tokens sequentially. In diffusion mode, it generates multiple tokens in parallel. In self-speculation mode, the model uses its diffusion pathway to draft tokens and its autoregressive pathway to verify them, increasing the tokens produced per forward pass.

Most language models are limited to generating one token at a time, which often under-utilizes GPU resources. Nemotron-Labs-Diffusion uses a joint training objective that enables parallel token generation. This approach allows the 8B model to decode six times more tokens per forward pass than standard models like Qwen3-8B, resulting in up to four times higher system throughput.

NVIDIA has released the Nemotron-Labs-Diffusion model family as open-weight models available on Hugging Face. The models are released under the NVIDIA Nemotron Open Model License, allowing developers to download and run them. The release includes the training and inference pipeline through Megatron Bridge, making it accessible for integration into existing AI development workflows and research projects.

These models significantly improve inference efficiency by moving from memory-bound to compute-bound generation. On NVIDIA GB200 hardware, the models achieve up to 3.3 times faster speeds than standard autoregressive models. By generating and verifying multiple tokens in a single pass, the system reduces the time and cost required for complex tasks like coding, mathematical reasoning, and long-form content generation.

Share this update