NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA

Apr 24, 2026 · Updated May 2, 2026

NVIDIA NeMo RL now supports end-to-end FP8 precision for reinforcement learning, enabling faster iterations for reasoning-grade models. By using importance sampling to maintain accuracy parity with high-precision training, the update delivers up to a 1.48x speedup on models like Qwen3. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for developers.

NVIDIA updated NeMo RL, an open-source library for model reinforcement learning, to support end-to-end FP8 precision. While low-precision math typically causes numerical errors during training, the new recipe uses importance sampling to match the accuracy of standard BF16 precision while delivering a 1.48x speedup on Qwen3 models.

Speedup (Qwen3-8B-Base): 1.48x
Throughput increase (Dense models): 15% to 25%
Overall speedup (Linear + KV + Attention): ~48%
Calibration overhead: 2% to 3% of total step time
Precision support: End-to-end FP8 (E4M3)
Supported models: Llama 3.1, Qwen3, Moonlight

Building on NVIDIA's focus on inference-time compute, developers can now use FP8 to iterate faster on agentic tool use and multi-step workflows without the hardware overhead of high-precision training. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for teams training custom models.

You can now enable FP8 for linear layers, KV cache, and attention within the NVIDIA NeMo framework. The system handles dynamic recalibration, updating quantization scales at every training step to maintain stability. These recipes are available as open-source configurations on GitHub, supporting models like Llama 3.1 and Qwen3 on Blackwell and Hopper GPUs.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIApr 22

Improve agentic performance with accurate RL post-training on low-precision FP8. 🛠️ NVIDIA NeMo RL, an open-source library within NVIDIA NeMo, supports FP8 to speed up RL workloads by 1.48x on Qwen3-8B-Base—enabling faster iterations for agentic tool use and multi-step workflows. Read ➡️ https://t.co/EjY83vFdNA

1181

View on X

Still wondering? A few quick answers below.

NVIDIA NeMo RL is an open-source library within the NVIDIA NeMo framework designed for the reinforcement learning phase of model development. It provides tools and recipes for algorithms like Group Relative Policy Optimization to help developers turn base language models into reasoning-grade agents that can handle complex multi-step tasks and tool use.

To prevent accuracy loss during low-precision training, NeMo RL uses an end-to-end FP8 recipe combined with importance sampling. This technique applies a per-token weight to the loss function, correcting distribution mismatches between the generation and training phases. This approach allows the system to match the validation accuracy of high-precision BF16 training while running significantly faster.

Using FP8 precision in NeMo RL delivers a 1.48x speedup on models like Qwen3-8B-Base. For dense models like Llama 3.1 8B, it provides a 15% to 25% throughput increase. When developers enable FP8 for linear layers, KV cache, and attention simultaneously, the overall rollout performance can improve by approximately 48% compared to standard BF16 baselines.

NeMo RL uses a dynamic recalibration process to manage FP8 for KV cache and attention. Because model weights change at every training step, the system recomputes the optimal quantization scales at the end of each iteration. These updated scales are then synchronized with the inference engine for the next rollout phase, ensuring minimal accuracy degradation during generation.

Yes, the FP8 recipes and configurations are open source and available through the NVIDIA NeMo RL GitHub repository. Developers can access specific configuration maps and example recipes for models like Llama 3.1 8B and Qwen3. The library is designed to work with NVIDIA hardware backends like Megatron Core and inference engines like vLLM.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Research integrated speculative decoding into the NeMo-RL training framework to remove the bottleneck of autoregressive rollout generation. By using a vLLM backend to accelerate response generation during reinforcement learning, the system delivers up to a 1.8x throughput increase without altering the model's output distribution.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain2d ago

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena4d ago

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai has integrated NVIDIA's Nemotron 3 Ultra model into its Agent Mode, enabling users to run the model for complex, multi-step tasks. These sessions contribute to the new Agent Arena leaderboard, which evaluates agentic AI models on real-world performance using tools like web search and terminal. This expands the range of frontier models available for practical agentic workflows and provides new data for understanding their capabilities in autonomous tasks.

What is NVIDIA NeMo RL?

How does NeMo RL maintain accuracy when using low-precision FP8?

What speed improvements does FP8 provide for reinforcement learning?

How does NeMo RL handle FP8 for KV cache and attention?

Is the NVIDIA NeMo RL FP8 recipe open source?

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation