Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity

May 7, 2026 · Updated Jun 5, 2026

Perplexity developed a custom inference engine called ROSE and a domain-specific language to build specialized GPU kernels for NVIDIA hardware. By moving down the stack, the company can achieve peak performance on Blackwell chips and reduce latency for massive trillion-parameter models.

Perplexity, an AI-powered answer engine, developed its own inference engine (the process of running a trained model to generate outputs) called ROSE. The Runtime-Optimized Serving Engine handles models ranging from embeddings to trillion-parameter LLMs. It integrates CuTeDSL, a domain-specific language that accelerates the creation of specialized GPU kernels.

Inference engine: ROSE
Kernel language: CuTeDSL
Target hardware: NVIDIA Hopper and Blackwell GPUs
Model capacity: Up to trillion-parameter LLMs
Research focus: Search, reasoning, agents, and systems

This shift toward custom infrastructure allows Perplexity to bypass generic libraries and tune directly for NVIDIA Hopper and Blackwell architectures. It mirrors industry trends where companies launch NVIDIA Dynamo 1.0 to act as an inference operating system. By owning the kernel layer, Perplexity can squeeze peak performance from the latest hardware.

While this is an internal update, it provides the technical foundation for the platform's complex Perplexity search agent research and follows the release of Perplexity Finance Search for developers. You will likely see lower latency across Perplexity’s Pro and Max tiers as these optimizations roll out. The research team plans to continue advancing their mission through frontier systems research.

View the full update on research.perplexity.ai

Perplexity

@perplexity_aiMay 6

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

103945

View on X

Still wondering? A few quick answers below.

ROSE, which stands for Runtime-Optimized Serving Engine, is a proprietary inference system developed by Perplexity to run large language models. It is designed to handle a wide range of model sizes, from small embedding models to massive systems with over a trillion parameters, ensuring they run efficiently on modern hardware.

CuTeDSL is a domain-specific language integrated into Perplexity's inference engine that allows engineers to build specialized GPU kernels more quickly. These kernels are the low-level programs that manage how data moves through a chip, and optimizing them helps models reach their peak performance levels on the latest NVIDIA hardware.

Perplexity's new infrastructure is specifically optimized for NVIDIA's Hopper and Blackwell GPU architectures. By using CuTeDSL to create custom kernels, the company can take full advantage of the advanced processing capabilities in these specific chips, which are currently the industry standard for high-performance artificial intelligence workloads.

The ROSE engine is built for extreme scalability, supporting everything from small embedding models used for search retrieval to frontier-class large language models with a trillion parameters. This flexibility allows Perplexity to use a single, optimized serving layer for all the different AI components that power its answer engine.

Perplexity developed ROSE and CuTeDSL to gain deeper control over its hardware performance and reduce latency for users. By building its own stack, the company can create specialized GPU kernels faster than traditional methods allow, ensuring that its models are perfectly tuned for the specific demands of real-time search and reasoning.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Perplexity →

Keep reading

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek FoundationMay 7

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIAMay 8

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research developed Guess-Verify-Refine, a hardware-aware algorithm that speeds up the selection of important data points during AI reasoning. By reusing patterns from previous steps, the system reduces latency for long-context models on Blackwell GPUs without sacrificing mathematical accuracy.

What is the Perplexity ROSE inference engine?

How does CuTeDSL improve Perplexity model performance?

Which NVIDIA GPUs are supported by Perplexity's new inference engine?

What size models can the Perplexity ROSE engine handle?

Why did Perplexity build its own inference engine?

Keep reading

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

Keep reading

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference