Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity

May 12, 2026 · Updated Jun 8, 2026

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Perplexity, an AI-powered answer engine providing real-time cited responses, published research on serving Qwen3 235B on NVIDIA GB200 NVL72 Blackwell racks. Blackwell is a major upgrade over Hopper for high-throughput inference (running a trained model to generate outputs) on large Mixture-of-Experts (MoE) models (AI models that activate only a fraction of their parameters).

All-reduce latency (H200): 586.1 microseconds
All-reduce latency (GB200): 313.3 microseconds
MoE prefill latency (H200): 730.1 microseconds
MoE prefill latency (GB200): 438.5 microseconds
Model tested: Qwen3 235B
Hardware: NVIDIA GB200 NVL72 Blackwell

Benchmarks reveal Blackwell's rack-scale NVLink architecture addresses the primary bottleneck for massive MoE models: data-shuffling latency between GPUs. By nearly halving all-reduce latency, Perplexity can deliver faster answers at lower cost. This follows the development of Perplexity's ROSE inference engine to optimize Blackwell hardware.

For organizations deploying trillion-parameter models, these results validate the shift toward prefill/decode disaggregation and Blackwell-native quantization. While the research focuses on internal infrastructure, it signals a broader industry move toward Blackwell-optimized inference paths. Review the full technical paper for specific kernel optimizations.

Perplexity

@perplexity_aiMay 12

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform. https://t.co/yYZuPRXWzr

769

View on X

Still wondering? A few quick answers below.

Perplexity's research shows that NVIDIA's GB200 Blackwell architecture outperforms the previous Hopper generation for large model inference. Specifically, all-reduce latency dropped from 586.1 microseconds on H200 to 313.3 microseconds on GB200. This reduction in communication overhead allows for higher throughput and faster response times when serving massive Mixture-of-Experts models at scale.

Perplexity uses several advanced techniques to maximize performance on Blackwell racks. These include prefill and decode disaggregation, which separates different stages of the inference process, and Blackwell-native quantization to reduce model size without losing accuracy. They also developed custom kernels and utilized rack-scale NVLink to minimize the time it takes for different GPUs to share data.

In Mixture-of-Experts models like Qwen3 235B, only a subset of the model's parameters is activated for each request. This requires frequent data shuffling between GPUs, a process known as all-reduce. High all-reduce latency creates a bottleneck that slows down the entire system. By reducing this latency by nearly half, Blackwell hardware enables much faster processing for these complex architectures.

For an answer engine like Perplexity, the GB200 NVL72 provides a high-throughput platform that reduces the cost of serving frontier-grade models. The hardware's ability to sustain high token speeds during the decoding phase means users receive answers more quickly. This efficiency allows the platform to run larger, more capable models while maintaining the real-time responsiveness required for search.

Yes, Perplexity has published the full technical paper detailing their findings and methodology for serving post-trained Qwen3 235B models on Blackwell racks. The research covers their benchmarks, the specific latency improvements observed over the Hopper generation, and the architectural optimizations used to achieve high-throughput performance. Developers and researchers can access the paper to understand these infrastructure improvements.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Perplexity →

Keep reading

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity developed a custom inference engine called ROSE and a domain-specific language to build specialized GPU kernels for NVIDIA hardware. By moving down the stack, the company can achieve peak performance on Blackwell chips and reduce latency for massive trillion-parameter models.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

NebiusApr 1

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

Nebius secured 10 first-place finishes in the MLPerf Inference v6.0 benchmarks using the latest NVIDIA Blackwell and Blackwell Ultra systems. These results demonstrate linear performance scaling for frontier models like DeepSeek R1, providing a verified blueprint for high-throughput production AI infrastructure.

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIAMay 8

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research developed Guess-Verify-Refine, a hardware-aware algorithm that speeds up the selection of important data points during AI reasoning. By reusing patterns from previous steps, the system reduces latency for long-context models on Blackwell GPUs without sacrificing mathematical accuracy.

What is the performance difference between NVIDIA Blackwell and Hopper for AI inference?

How does Perplexity optimize Qwen3 235B for NVIDIA Blackwell hardware?

Why is all-reduce latency important for Mixture-of-Experts models?

What is the benefit of using GB200 NVL72 for AI search engines?

Is the Perplexity research on Blackwell inference publicly available?

Keep reading

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

Keep reading

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

Nebius Sets New Inference Performance Standards for Blackwell and Blackwell Ultra

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference