Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent

Q: What is Perplexity query-aware context compression?

Perplexity query-aware context compression is a new system that surgically extracts specific text spans from web pages relevant to a user's query. Unlike standard retrieval that passes entire documents to an AI model, this system identifies and preserves only the vital evidence while aggressively culling irrelevant distractors like ads, navigation menus, and metadata.

Q: How does Perplexity context compression improve search accuracy?

The system improves accuracy by reducing context rot, which occurs when irrelevant information impairs a model's ability to reason. By increasing the density of vital information per snippet by 63 percent, the answer model can focus on precise evidence. This results in higher benchmark scores and more reliable grounding for the final generated response.

Q: Does Perplexity context compression increase search latency?

No, the system is designed to reduce overall latency. While the compression model itself adds a small processing step, it runs in under 20 milliseconds. This is offset by significant speed gains during the reasoning phase, as the downstream answer model has up to 70 percent fewer tokens to process before generating a final response.

Q: Who can use the new Perplexity context compression models?

Perplexity has deployed these models across its entire production stack, meaning they are active for users of the standard search applications. Additionally, the technology is integrated into the Perplexity API Platform, allowing developers using the Agent API to benefit from improved context precision and reduced token costs in their own applications.

Q: How is Perplexity context compression different from summarization?

Unlike generative summarization, which rewrites source text and can introduce hallucinations, Perplexity uses extractive compression. This approach identifies and keeps original spans of text verbatim from the source. This method ensures citation fidelity and traceability, making it easier for users to verify that the AI's answer is accurately grounded in the source material.

Perplexity

May 20, 2026 · Updated May 28, 2026

Perplexity launched a new context compression system that surgically extracts query-relevant text from web pages before passing it to its answer models. By culling ads, navigation, and metadata, the system reduces input tokens by up to 70 percent while increasing the density of vital information.

Perplexity productionized a query-aware context compression model that surgically extracts relevant text spans from web pages before they reach the answer model. This system replaces traditional snippet generation with a distilled 17-layer pplx-diffusion model—a bidirectional encoder that identifies sentences essential to a user's request.

Token reduction: Up to 70%
Vital content per snippet: 63% increase
Compression ratio: 50x (SimpleQA)
Inference latency: <20ms (p99)
Model architecture: Distilled 17-layer pplx-diffusion

The update addresses context rot, where irrelevant noise in massive context windows degrades a model's reasoning. By removing distractors like ads and navigation text, Perplexity increases the proportion of vital evidence per snippet by 63 percent. This mirrors a broader industry shift toward precision, similar to OpenRouter's reranker API launch.

The compression engine is now live across Perplexity's consumer applications and the Perplexity Agent API's search tools. It operates in under 20 milliseconds, making it fast enough to sit in the real-time serving path. For developers, this translates to higher precision in RAG (grounding AI responses in external data) and significantly lower token consumption.

View the full update on research.perplexity.ai

Perplexity

@perplexity_aiMay 20

We've productionized query-aware compression for faster, cleaner, more-accurate search. Better context is better than more context. Our system cuts context tokens up to 70% while improving answer quality. https://t.co/gmVr3oZRl9

549

View on X

Still wondering? A few quick answers below.

Perplexity query-aware context compression is a new system that surgically extracts specific text spans from web pages relevant to a user's query. Unlike standard retrieval that passes entire documents to an AI model, this system identifies and preserves only the vital evidence while aggressively culling irrelevant distractors like ads, navigation menus, and metadata.

The system improves accuracy by reducing context rot, which occurs when irrelevant information impairs a model's ability to reason. By increasing the density of vital information per snippet by 63 percent, the answer model can focus on precise evidence. This results in higher benchmark scores and more reliable grounding for the final generated response.

No, the system is designed to reduce overall latency. While the compression model itself adds a small processing step, it runs in under 20 milliseconds. This is offset by significant speed gains during the reasoning phase, as the downstream answer model has up to 70 percent fewer tokens to process before generating a final response.

Perplexity has deployed these models across its entire production stack, meaning they are active for users of the standard search applications. Additionally, the technology is integrated into the Perplexity API Platform, allowing developers using the Agent API to benefit from improved context precision and reduced token costs in their own applications.

Unlike generative summarization, which rewrites source text and can introduce hallucinations, Perplexity uses extractive compression. This approach identifies and keeps original spans of text verbatim from the source. This method ensures citation fidelity and traceability, making it easier for users to verify that the AI's answer is accurately grounded in the source material.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Perplexity →

Keep reading

Perplexity Open Sources Rebuilt Tokenizer to Slash CPU Latency by Five Times

Perplexity open-sourced a rebuilt Unigram tokenizer that reduces CPU utilization by five to six times compared to standard implementations. While GPU inference often gets the focus, this update targets the hidden bottleneck of CPU-side tokenization for fast models like rerankers.

OpenAIApr 24

OpenAI Reports 56 Percent Token Efficiency Gain for GPT-5.5 in Perplexity Workflows

Perplexity built an internal tool in under an hour using GPT-5.5 within the Codex platform. The model completed complex computer-use tasks with 56% fewer tokens, significantly reducing latency and improving feedback loops for end users.