RL post-training is hitting a rollout bottleneck. This new paper from #NVIDIAResearch shows how speculative decoding in NeMo-RL + @vllm_project can accelerate rollouts losslessly, with 1.8x higher throughput at 8B and projected 2.5x end-to-end speedup at 235B. Read the full paper: https://t.co/twR4LEQNmy
NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts
NVIDIA· Updated
NVIDIA Research integrated speculative decoding into the NeMo-RL training framework to remove the bottleneck of autoregressive rollout generation. By using a vLLM backend to accelerate response generation during reinforcement learning, the system delivers up to a 1.8x throughput increase without altering the model's output distribution.
- Throughput gain (8B)
- 1.8x
- Projected speedup (235B)
- 2.5x
- Framework
- NeMo-RL
- Backend
- vLLM
- Acceleration type
- Lossless
This update follows NVIDIA NeMo RL's FP8 precision support as the company targets the primary bottleneck in developing reasoning models. While generating autoregressive rollouts typically consumes most training time, this approach is lossless. It preserves the model's exact output distribution while delivering a 1.8x throughput gain at the 8B scale.
You can implement these speedups in your own training pipelines via the open-source NeMo-RL repository. The system supports various speculation mechanisms and integrates with vLLM's DeepSeek V4 support to handle both synchronous and asynchronous pipelines. Projections suggest a 2.5x end-to-end speedup for massive 235B models.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →



