NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIA

Jun 9, 2026 · Updated Jun 20, 2026

NVIDIA trained Llama 3 8B and 405B models on its Blackwell platform using NVFP4 precision. This achieved a 1.31–1.73x speedup compared to FP8 precision, with no loss in accuracy. The update demonstrates how specialized hardware and precision formats can significantly boost the efficiency of large language model development.

NVIDIA trained Llama 3 8B and 405B models using NVFP4 precision on its Blackwell platform. This training method delivered a 1.31–1.73x speedup over FP8 precision, with zero accuracy loss across both Llama 3 model configurations.

Speedup over FP8: 1.31–1.73x
Models Trained: Llama 3 8B, Llama 3.1 405B
Hardware Used: NVIDIA Blackwell (GB200, GB300)
Accuracy Impact: Zero loss
Llama 3 8B / GB200 Throughput (NVFP4): 2017 TFLOP/s
Llama 3.1 405B / GB300 Throughput (NVFP4): 3633 TFLOP/s

This advancement highlights NVIDIA's focus on optimizing the training process for large language models (LLMs). Faster training with maintained accuracy directly reduces the computational resources and time required for model development.

The benchmarks, full recipe breakdown, and a MaxText example are available for developers. This allows teams to explore how NVFP4 precision on the Blackwell platform can lead to more efficient and cost-effective development cycles for large-scale AI models.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIJun 8

We trained Llama 3 8B and 405B with NVFP4 precision on the NVIDIA Blackwell platform. Here's what we found: 1.31–1.73× faster than FP8, with zero accuracy loss. https://t.co/rucDeruMGD

53698

View on X

Still wondering? A few quick answers below.

NVFP4 is a precision format used to train Llama 3 models on the NVIDIA Blackwell platform. This method resulted in 1.31–1.73x faster training compared to FP8, with no accuracy loss.

NVIDIA achieved a 1.31–1.73x speedup in training Llama 3 models using NVFP4 precision compared to using FP8 precision. This speedup was accomplished with no loss in the models' accuracy.

NVIDIA applied this training method to the Llama 3 8B and Llama 3.1 405B models. These models were trained on NVIDIA's Blackwell platform, including both GB200 and GB300 hardware configurations.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL now supports end-to-end FP8 precision for reinforcement learning, enabling faster iterations for reasoning-grade models. By using importance sampling to maintain accuracy parity with high-precision training, the update delivers up to a 1.48x speedup on models like Qwen3. This shift makes the compute-intensive process of building agentic reasoning capabilities significantly more efficient for developers.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

What is NVFP4 precision?

What performance improvement did NVIDIA achieve?

Which models were trained using this method?

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Keep reading

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

NVIDIA NeMo RL Accelerates Reasoning Model Training With End to End FP8 Precision

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot