NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIANVIDIA

NVIDIA trained Llama 3 8B and 405B models on its Blackwell platform using NVFP4 precision. This achieved a 1.31–1.73x speedup compared to FP8 precision, with no loss in accuracy. The update demonstrates how specialized hardware and precision formats can significantly boost the efficiency of large language model development.

NVIDIA trained Llama 3 8B and 405B models using NVFP4 precision on its Blackwell platform. This training method delivered a 1.31–1.73x speedup over FP8 precision, with zero accuracy loss across both Llama 3 model configurations.
Speedup over FP8
1.31–1.73x
Models Trained
Llama 3 8B, Llama 3.1 405B
Hardware Used
NVIDIA Blackwell (GB200, GB300)
Accuracy Impact
Zero loss
Llama 3 8B / GB200 Throughput (NVFP4)
2017 TFLOP/s
Llama 3.1 405B / GB300 Throughput (NVFP4)
3633 TFLOP/s

This advancement highlights NVIDIA's focus on optimizing the training process for large language models (LLMs). Faster training with maintained accuracy directly reduces the computational resources and time required for model development.

The benchmarks, full recipe breakdown, and a MaxText example are available for developers. This allows teams to explore how NVFP4 precision on the Blackwell platform can lead to more efficient and cost-effective development cycles for large-scale AI models.

NVFP4 precision delivers up to 1.73x higher per-GPU throughput compared to FP8 baseline across Llama model configurations.
NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

We trained Llama 3 8B and 405B with NVFP4 precision on the NVIDIA Blackwell platform. Here's what we found: 1.31–1.73× faster than FP8, with zero accuracy loss. https://t.co/rucDeruMGD

51retweets661likes
View on X

Still wondering? A few quick answers below.

NVFP4 is a precision format used to train Llama 3 models on the NVIDIA Blackwell platform. This method resulted in 1.31–1.73x faster training compared to FP8, with no accuracy loss.

NVIDIA achieved a 1.31–1.73x speedup in training Llama 3 models using NVFP4 precision compared to using FP8 precision. This speedup was accomplished with no loss in the models' accuracy.

NVIDIA applied this training method to the Llama 3 8B and Llama 3.1 405B models. These models were trained on NVIDIA's Blackwell platform, including both GB200 and GB300 hardware configurations.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update