We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.
Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance
· Updated
- Inference engine
- ROSE
- Kernel language
- CuTeDSL
- Target hardware
- NVIDIA Hopper and Blackwell GPUs
- Model capacity
- Up to trillion-parameter LLMs
- Research focus
- Search, reasoning, agents, and systems
This shift toward custom infrastructure allows Perplexity to bypass generic libraries and tune directly for NVIDIA Hopper and Blackwell architectures. It mirrors industry trends where companies launch NVIDIA Dynamo 1.0 to act as an inference operating system. By owning the kernel layer, Perplexity can squeeze peak performance from the latest hardware.
While this is an internal update, it provides the technical foundation for the platform's complex Perplexity search agent research and follows the release of Perplexity Finance Search for developers. You will likely see lower latency across Perplexity’s Pro and Max tiers as these optimizations roll out. The research team plans to continue advancing their mission through frontier systems research.
Still wondering? A few quick answers below.





