We trained Llama 3 8B and 405B with NVFP4 precision on the NVIDIA Blackwell platform. Here's what we found: 1.31–1.73× faster than FP8, with zero accuracy loss. https://t.co/rucDeruMGD
NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision
NVIDIANVIDIA trained Llama 3 8B and 405B models on its Blackwell platform using NVFP4 precision. This achieved a 1.31–1.73x speedup compared to FP8 precision, with no loss in accuracy. The update demonstrates how specialized hardware and precision formats can significantly boost the efficiency of large language model development.
- Speedup over FP8
- 1.31–1.73x
- Models Trained
- Llama 3 8B, Llama 3.1 405B
- Hardware Used
- NVIDIA Blackwell (GB200, GB300)
- Accuracy Impact
- Zero loss
- Llama 3 8B / GB200 Throughput (NVFP4)
- 2017 TFLOP/s
- Llama 3.1 405B / GB300 Throughput (NVFP4)
- 3633 TFLOP/s
This advancement highlights NVIDIA's focus on optimizing the training process for large language models (LLMs). Faster training with maintained accuracy directly reduces the computational resources and time required for model development.
The benchmarks, full recipe breakdown, and a MaxText example are available for developers. This allows teams to explore how NVFP4 precision on the Blackwell platform can lead to more efficient and cost-effective development cycles for large-scale AI models.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





