Perplexity Open Sources Rebuilt Tokenizer to Slash CPU Latency by Five Times

Perplexity

May 27, 2026 · Updated Jun 4, 2026

Perplexity open-sourced a rebuilt Unigram tokenizer that reduces CPU utilization by five to six times compared to standard implementations. While GPU inference often gets the focus, this update targets the hidden bottleneck of CPU-side tokenization for fast models like rerankers.

Perplexity, an AI-powered answer engine, open-sourced a rebuilt Unigram tokenizer that reduces CPU utilization by up to six times. The implementation targets the XLM-RoBERTa vocabulary and achieves zero heap allocations during the encoding process, eliminating a common source of processing stalls.

CPU utilization reduction: 5-6x
Latency (514 tokens): 63 µs
Speed vs Hugging Face: 5x faster
Speed vs SentencePiece (C++): 2x faster
Memory allocation: Zero heap allocations

This optimization addresses a growing bottleneck in retrieval-augmented generation (RAG) pipelines. While GPU compute for small rerankers is fast, CPU-side tokenization often accounts for a significant share of total latency. The release follows the launch of Perplexity's ROSE GPU inference engine to maximize hardware efficiency across the stack.

You can access the source code in the pplx-garden GitHub repository. The engine is written in Rust and outperforms the Hugging Face tokenizers crate by five times. It is designed for production environments where shaving double-digit milliseconds off reranker latency provides a measurable competitive advantage.

View the full update on research.perplexity.ai

Perplexity

@perplexity_aiMay 27

We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x. Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency. https://t.co/QUnHeiho56 https://t.co/Oh29f1lo51

83731

View on X

Still wondering? A few quick answers below.

The Perplexity Unigram tokenizer is an open-source reimplementation of the Unigram algorithm designed to convert text into numerical IDs for AI models. It was built from scratch to optimize CPU performance for the XLM-RoBERTa vocabulary, which is a common standard used in ranking, retrieval, and similarity tasks within modern AI stacks.

The tokenizer achieves high performance by eliminating heap allocations during the encoding process and replacing standard hash-map lookups with a double-array trie structure. It also uses bitmap and inline packing to fit critical data into a single cache line and leverages 2MB huge pages to reduce the overhead of page-table walks during execution.

Yes, Perplexity has open-sourced the Rust implementation of this tokenizer. Developers can access the source code through the pplx-garden repository on GitHub. The release is intended to help the community reduce CPU utilization in inference stacks, particularly for small models like rerankers where preprocessing often accounts for a significant share of total latency.

In production benchmarks, the Perplexity engine is roughly five times faster than the Hugging Face tokenizers crate. It also outperforms other major implementations, running twice as fast as the native C++ SentencePiece library and 1.5 times faster than the IREE C tokenizer. At a standard length of 514 tokens, it completes encoding in 63 microseconds.

Perplexity rebuilt the tokenizer to address a hidden bottleneck where CPU-side tokenization was consuming a meaningful share of total request latency. While GPU passes for small models like rerankers finish in single-digit milliseconds, standard tokenizers were too slow. This optimization reduced CPU utilization in their inference stack by five to six times.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Perplexity →

Keep reading

Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

Perplexity launched a new context compression system that surgically extracts query-relevant text from web pages before passing it to its answer models. By culling ads, navigation, and metadata, the system reduces input tokens by up to 70 percent while increasing the density of vital information.

OpenAIApr 24

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

Perplexity built an internal tool in under an hour using GPT-5.5 within the Codex platform. The model completed complex computer-use tasks with 56% fewer tokens, significantly reducing latency and improving feedback loops for end users.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

NVIDIAMay 20

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.

What is the Perplexity Unigram tokenizer?

How does the Perplexity tokenizer achieve faster speeds?

Is the Perplexity Unigram tokenizer open source?

How much faster is the Perplexity tokenizer compared to Hugging Face?

Why did Perplexity rebuild the Unigram tokenizer?

Keep reading

Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

Keep reading

Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation