RL post-training is hitting a rollout bottleneck. This new paper from #NVIDIAResearch shows how speculative decoding in NeMo-RL + @vllm_project can accelerate rollouts losslessly, with 1.8x higher throughput at 8B and projected 2.5x end-to-end speedup at 235B. Read the full paper: https://t.co/twR4LEQNmy
NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts
· Updated
NVIDIA Research introduced a system-integrated approach to speculative decoding (using small models to predict and verify tokens) for reinforcement learning post-training. By implementing this within the NeMo-RL framework using a vLLM backend, researchers can now accelerate the "rollout" phase where a model generates responses for evaluation.
- Throughput gain (8B)
- 1.8x
- Projected speedup (235B)
- 2.5x
- Framework
- NeMo-RL
- Backend
- vLLM
- Acceleration type
- Lossless
This update follows NVIDIA NeMo RL's FP8 precision support as the company targets the primary bottleneck in developing reasoning models. While generating autoregressive rollouts typically consumes most training time, this approach is lossless. It preserves the model's exact output distribution while delivering a 1.8x throughput gain at the 8B scale.
You can implement these speedups in your own training pipelines via the open-source NeMo-RL repository. The system supports various speculation mechanisms and integrates with vLLM's DeepSeek V4 support to handle both synchronous and asynchronous pipelines. Projections suggest a 2.5x end-to-end speedup for massive 235B models.
NVIDIA AI
@NVIDIAAI
86retweets582likes
View on XStill wondering? A few quick answers below.
RL post-training rollout acceleration is a method to speed up the generation of model responses during reinforcement learning. In this phase, models must generate massive amounts of data to be evaluated and used for weight updates. NVIDIA uses speculative decoding to make this token-by-token generation process faster, removing a major bottleneck in training complex reasoning models.
Speculative decoding in NeMo-RL works by using a smaller draft model or specialized heads to predict multiple future tokens simultaneously. A larger target model then verifies these predictions in a single parallel step. By integrating this into the vLLM backend, the system generates rollouts faster while ensuring the final output exactly matches what the target model would have produced alone.
Yes, the implementation is available through the NVIDIA NeMo-RL repository on GitHub. It utilizes a vLLM backend to support both synchronous and asynchronous reinforcement learning pipelines. Developers can access the code to apply these acceleration techniques to their own training workloads, supporting various speculation mechanisms like Multi-Token Prediction heads or external draft models such as Eagle3.
NVIDIA Research demonstrated a 1.8x increase in rollout throughput for models at the 8B parameter scale. For larger frontier models, high-fidelity simulations project up to a 2.5x end-to-end training speedup at the 235B scale when combining speculative decoding with asynchronous reinforcement learning. These gains are achieved losslessly, meaning the quality and distribution of the model's outputs remain unchanged.
Unlike many existing efficiency methods that improve throughput by changing the optimization regime or using lower-precision generation, this approach is a lossless acceleration primitive. It preserves the target model's exact output distribution rather than relying on off-policy execution or replay. This ensures that the training trajectory remains identical to standard autoregressive generation while significantly reducing the time required for rollouts.






