NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA

May 2, 2026 · Updated May 10, 2026

NVIDIA Research integrated speculative decoding into the NeMo-RL training framework to remove the bottleneck of autoregressive rollout generation. By using a vLLM backend to accelerate response generation during reinforcement learning, the system delivers up to a 1.8x throughput increase without altering the model's output distribution.

NVIDIA Research introduced a system-integrated approach to speculative decoding (using small models to predict and verify tokens) for reinforcement learning post-training. By implementing this within the NeMo-RL framework using a vLLM backend, researchers can now accelerate the "rollout" phase where a model generates responses for evaluation.

Throughput gain (8B): 1.8x
Projected speedup (235B): 2.5x
Framework: NeMo-RL
Backend: vLLM
Acceleration type: Lossless

This update follows NVIDIA NeMo RL's FP8 precision support as the company targets the primary bottleneck in developing reasoning models. While generating autoregressive rollouts typically consumes most training time, this approach is lossless. It preserves the model's exact output distribution while delivering a 1.8x throughput gain at the 8B scale.

You can implement these speedups in your own training pipelines via the open-source NeMo-RL repository. The system supports various speculation mechanisms and integrates with vLLM's DeepSeek V4 support to handle both synchronous and asynchronous pipelines. Projections suggest a 2.5x end-to-end speedup for massive 235B models.

View the full update on arxiv.org

NVIDIA AI

@NVIDIAAIMay 1

RL post-training is hitting a rollout bottleneck. This new paper from #NVIDIAResearch shows how speculative decoding in NeMo-RL + @vllm_project can accelerate rollouts losslessly, with 1.8x higher throughput at 8B and projected 2.5x end-to-end speedup at 235B. Read the full paper: https://t.co/twR4LEQNmy

86582

View on X

Still wondering? A few quick answers below.

RL post-training rollout acceleration is a method to speed up the generation of model responses during reinforcement learning. In this phase, models must generate massive amounts of data to be evaluated and used for weight updates. NVIDIA uses speculative decoding to make this token-by-token generation process faster, removing a major bottleneck in training complex reasoning models.

Speculative decoding in NeMo-RL works by using a smaller draft model or specialized heads to predict multiple future tokens simultaneously. A larger target model then verifies these predictions in a single parallel step. By integrating this into the vLLM backend, the system generates rollouts faster while ensuring the final output exactly matches what the target model would have produced alone.

Yes, the implementation is available through the NVIDIA NeMo-RL repository on GitHub. It utilizes a vLLM backend to support both synchronous and asynchronous reinforcement learning pipelines. Developers can access the code to apply these acceleration techniques to their own training workloads, supporting various speculation mechanisms like Multi-Token Prediction heads or external draft models such as Eagle3.

NVIDIA Research demonstrated a 1.8x increase in rollout throughput for models at the 8B parameter scale. For larger frontier models, high-fidelity simulations project up to a 2.5x end-to-end training speedup at the 235B scale when combining speculative decoding with asynchronous reinforcement learning. These gains are achieved losslessly, meaning the quality and distribution of the model's outputs remain unchanged.

Unlike many existing efficiency methods that improve throughput by changing the optimization regime or using lower-precision generation, this approach is a lossless acceleration primitive. It preserves the target model's exact output distribution rather than relying on off-policy execution or replay. This ensures that the training trajectory remains identical to standard autoregressive generation while significantly reducing the time required for rollouts.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL now supports end-to-end FP8 precision for reinforcement learning, enabling faster iterations for reasoning-grade models. By using importance sampling to maintain accuracy parity with high-precision training, the update delivers up to a 1.48x speedup on models like Qwen3. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for developers.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

What is RL post-training rollout acceleration?

How does speculative decoding work in NeMo-RL?

Is NVIDIA's RL speculative decoding implementation open source?

What performance gains does speculative decoding provide for RL?

How is this method different from other RL efficiency techniques?

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents