NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

LLM
AI Research
AI Hardware
Performance
Qwen

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision
NVIDIA updated NeMo RL, an open-source library for model reinforcement learning, to support end-to-end FP8 precision. While low-precision math typically causes numerical errors during training, the new recipe uses importance sampling to match the accuracy of standard BF16 precision while delivering a 1.48x speedup on Qwen3 models.

Building on NVIDIA's focus on inference-time compute, developers can now use FP8 to iterate faster on agentic tool use and multi-step workflows without the hardware overhead of high-precision training. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for teams training custom models.

You can now enable FP8 for linear layers, KV cache, and attention within the NVIDIA NeMo framework. The system handles dynamic recalibration, updating quantization scales at every training step to maintain stability. These recipes are available as open-source configurations on GitHub, supporting models like Llama 3.1 and Qwen3 on Blackwell and Hopper GPUs.

Read the full update →

Frequently asked questions

What is NVIDIA NeMo RL?
NVIDIA NeMo RL is an open-source library within the NVIDIA NeMo framework designed for the reinforcement learning phase of model development. It provides tools and recipes for algorithms like Group Relative Policy Optimization to help developers turn base language models into reasoning-grade agents that can handle complex multi-step tasks and tool use.
How does NeMo RL maintain accuracy when using low-precision FP8?
To prevent accuracy loss during low-precision training, NeMo RL uses an end-to-end FP8 recipe combined with importance sampling. This technique applies a per-token weight to the loss function, correcting distribution mismatches between the generation and training phases. This approach allows the system to match the validation accuracy of high-precision BF16 training while running significantly faster.
What speed improvements does FP8 provide for reinforcement learning?
Using FP8 precision in NeMo RL delivers a 1.48x speedup on models like Qwen3-8B-Base. For dense models like Llama 3.1 8B, it provides a 15% to 25% throughput increase. When developers enable FP8 for linear layers, KV cache, and attention simultaneously, the overall rollout performance can improve by approximately 48% compared to standard BF16 baselines.
How does NeMo RL handle FP8 for KV cache and attention?
NeMo RL uses a dynamic recalibration process to manage FP8 for KV cache and attention. Because model weights change at every training step, the system recomputes the optimal quantization scales at the end of each iteration. These updated scales are then synchronized with the inference engine for the next rollout phase, ensuring minimal accuracy degradation during generation.
Is the NVIDIA NeMo RL FP8 recipe open source?
Yes, the FP8 recipes and configurations are open source and available through the NVIDIA NeMo RL GitHub repository. Developers can access specific configuration maps and example recipes for models like Llama 3.1 8B and Qwen3. The library is designed to work with NVIDIA hardware backends like Megatron Core and inference engines like vLLM.