HeadsUpAI

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity, an AI-powered answer engine providing real-time cited responses, published research on serving Qwen3 235B on NVIDIA GB200 NVL72 Blackwell racks. Blackwell is a major upgrade over Hopper for high-throughput inference (running a trained model to generate outputs) on large Mixture-of-Experts (MoE) models (AI models that activate only a fraction of their parameters).
All-reduce latency (H200)
586.1 microseconds
All-reduce latency (GB200)
313.3 microseconds
MoE prefill latency (H200)
730.1 microseconds
MoE prefill latency (GB200)
438.5 microseconds
Model tested
Qwen3 235B
Hardware
NVIDIA GB200 NVL72 Blackwell

Benchmarks reveal Blackwell's rack-scale NVLink architecture addresses the primary bottleneck for massive MoE models: data-shuffling latency between GPUs. By nearly halving all-reduce latency, Perplexity can deliver faster answers at lower cost. This follows the development of Perplexity's ROSE inference engine to optimize Blackwell hardware.

For organizations deploying trillion-parameter models, these results validate the shift toward prefill/decode disaggregation and Blackwell-native quantization. While the research focuses on internal infrastructure, it signals a broader industry move toward Blackwell-optimized inference paths. Review the full technical paper for specific kernel optimizations.

Perplexity
Perplexity
@perplexity_ai
X

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform. https://t.co/yYZuPRXWzr

7retweets69likes
View on X

Still wondering? A few quick answers below.

Perplexity's research shows that NVIDIA's GB200 Blackwell architecture outperforms the previous Hopper generation for large model inference. Specifically, all-reduce latency dropped from 586.1 microseconds on H200 to 313.3 microseconds on GB200. This reduction in communication overhead allows for higher throughput and faster response times when serving massive Mixture-of-Experts models at scale.

Perplexity uses several advanced techniques to maximize performance on Blackwell racks. These include prefill and decode disaggregation, which separates different stages of the inference process, and Blackwell-native quantization to reduce model size without losing accuracy. They also developed custom kernels and utilized rack-scale NVLink to minimize the time it takes for different GPUs to share data.

In Mixture-of-Experts models like Qwen3 235B, only a subset of the model's parameters is activated for each request. This requires frequent data shuffling between GPUs, a process known as all-reduce. High all-reduce latency creates a bottleneck that slows down the entire system. By reducing this latency by nearly half, Blackwell hardware enables much faster processing for these complex architectures.

For an answer engine like Perplexity, the GB200 NVL72 provides a high-throughput platform that reduces the cost of serving frontier-grade models. The hardware's ability to sustain high token speeds during the decoding phase means users receive answers more quickly. This efficiency allows the platform to run larger, more capable models while maintaining the real-time responsiveness required for search.

Yes, Perplexity has published the full technical paper detailing their findings and methodology for serving post-trained Qwen3 235B models on Blackwell racks. The research covers their benchmarks, the specific latency improvements observed over the Hopper generation, and the architectural optimizations used to achieve high-throughput performance. Developers and researchers can access the paper to understand these infrastructure improvements.

Share this update