NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIANVIDIA

· Updated

NVIDIA NeMo RL now supports end-to-end FP8 precision for reinforcement learning, enabling faster iterations for reasoning-grade models. By using importance sampling to maintain accuracy parity with high-precision training, the update delivers up to a 1.48x speedup on models like Qwen3. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for developers.

NVIDIA updated NeMo RL, an open-source library for model reinforcement learning, to support end-to-end FP8 precision. While low-precision math typically causes numerical errors during training, the new recipe uses importance sampling to match the accuracy of standard BF16 precision while delivering a 1.48x speedup on Qwen3 models.
Speedup (Qwen3-8B-Base)
1.48x
Throughput increase (Dense models)
15% to 25%
Overall speedup (Linear + KV + Attention)
~48%
Calibration overhead
2% to 3% of total step time
Precision support
End-to-end FP8 (E4M3)
Supported models
Llama 3.1, Qwen3, Moonlight

Building on NVIDIA's focus on inference-time compute, developers can now use FP8 to iterate faster on agentic tool use and multi-step workflows without the hardware overhead of high-precision training. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for teams training custom models.

You can now enable FP8 for linear layers, KV cache, and attention within the NVIDIA NeMo framework. The system handles dynamic recalibration, updating quantization scales at every training step to maintain stability. These recipes are available as open-source configurations on GitHub, supporting models like Llama 3.1 and Qwen3 on Blackwell and Hopper GPUs.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Improve agentic performance with accurate RL post-training on low-precision FP8. 🛠️ NVIDIA NeMo RL, an open-source library within NVIDIA NeMo, supports FP8 to speed up RL workloads by 1.48x on Qwen3-8B-Base—enabling faster iterations for agentic tool use and multi-step workflows. Read ➡️ https://t.co/EjY83vFdNA

11retweets81likes
View on X

Still wondering? A few quick answers below.

NVIDIA NeMo RL is an open-source library within the NVIDIA NeMo framework designed for the reinforcement learning phase of model development. It provides tools and recipes for algorithms like Group Relative Policy Optimization to help developers turn base language models into reasoning-grade agents that can handle complex multi-step tasks and tool use.

To prevent accuracy loss during low-precision training, NeMo RL uses an end-to-end FP8 recipe combined with importance sampling. This technique applies a per-token weight to the loss function, correcting distribution mismatches between the generation and training phases. This approach allows the system to match the validation accuracy of high-precision BF16 training while running significantly faster.

Using FP8 precision in NeMo RL delivers a 1.48x speedup on models like Qwen3-8B-Base. For dense models like Llama 3.1 8B, it provides a 15% to 25% throughput increase. When developers enable FP8 for linear layers, KV cache, and attention simultaneously, the overall rollout performance can improve by approximately 48% compared to standard BF16 baselines.

NeMo RL uses a dynamic recalibration process to manage FP8 for KV cache and attention. Because model weights change at every training step, the system recomputes the optimal quantization scales at the end of each iteration. These updated scales are then synchronized with the inference engine for the next rollout phase, ensuring minimal accuracy degradation during generation.

Yes, the FP8 recipes and configurations are open source and available through the NVIDIA NeMo RL GitHub repository. Developers can access specific configuration maps and example recipes for models like Llama 3.1 8B and Qwen3. The library is designed to work with NVIDIA hardware backends like Megatron Core and inference engines like vLLM.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update