HeadsUpAI

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

· Updated

Perplexity, an AI-powered answer engine, developed its own inference engine (the process of running a trained model to generate outputs) called ROSE. The Runtime-Optimized Serving Engine handles models ranging from embeddings to trillion-parameter LLMs. It integrates CuTeDSL, a domain-specific language that accelerates the creation of specialized GPU kernels.
Inference engine
ROSE
Kernel language
CuTeDSL
Target hardware
NVIDIA Hopper and Blackwell GPUs
Model capacity
Up to trillion-parameter LLMs
Research focus
Search, reasoning, agents, and systems

This shift toward custom infrastructure allows Perplexity to bypass generic libraries and tune directly for NVIDIA Hopper and Blackwell architectures. It mirrors industry trends where companies launch NVIDIA Dynamo 1.0 to act as an inference operating system. By owning the kernel layer, Perplexity can squeeze peak performance from the latest hardware.

While this is an internal update, it provides the technical foundation for the platform's complex Perplexity search agent research and follows the release of Perplexity Finance Search for developers. You will likely see lower latency across Perplexity’s Pro and Max tiers as these optimizations roll out. The research team plans to continue advancing their mission through frontier systems research.

Perplexity
Perplexity
@perplexity_ai
X

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

103retweets945likes
View on X

Still wondering? A few quick answers below.

ROSE, which stands for Runtime-Optimized Serving Engine, is a proprietary inference system developed by Perplexity to run large language models. It is designed to handle a wide range of model sizes, from small embedding models to massive systems with over a trillion parameters, ensuring they run efficiently on modern hardware.

CuTeDSL is a domain-specific language integrated into Perplexity's inference engine that allows engineers to build specialized GPU kernels more quickly. These kernels are the low-level programs that manage how data moves through a chip, and optimizing them helps models reach their peak performance levels on the latest NVIDIA hardware.

Perplexity's new infrastructure is specifically optimized for NVIDIA's Hopper and Blackwell GPU architectures. By using CuTeDSL to create custom kernels, the company can take full advantage of the advanced processing capabilities in these specific chips, which are currently the industry standard for high-performance artificial intelligence workloads.

The ROSE engine is built for extreme scalability, supporting everything from small embedding models used for search retrieval to frontier-class large language models with a trillion parameters. This flexibility allows Perplexity to use a single, optimized serving layer for all the different AI components that power its answer engine.

Perplexity developed ROSE and CuTeDSL to gain deeper control over its hardware performance and reduce latency for users. By building its own stack, the company can create specialized GPU kernels faster than traditional methods allow, ensuring that its models are perfectly tuned for the specific demands of real-time search and reasoning.

Share this update