We've productionized query-aware compression for faster, cleaner, more-accurate search. Better context is better than more context. Our system cuts context tokens up to 70% while improving answer quality. https://t.co/gmVr3oZRl9
Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent
Perplexity· Updated
Perplexity launched a new context compression system that surgically extracts query-relevant text from web pages before passing it to its answer models. By culling ads, navigation, and metadata, the system reduces input tokens by up to 70 percent while increasing the density of vital information.
pplx-diffusion model—a bidirectional encoder that identifies sentences essential to a user's request.- Token reduction
- Up to 70%
- Vital content per snippet
- 63% increase
- Compression ratio
- 50x (SimpleQA)
- Inference latency
- <20ms (p99)
- Model architecture
- Distilled 17-layer pplx-diffusion
The update addresses context rot, where irrelevant noise in massive context windows degrades a model's reasoning. By removing distractors like ads and navigation text, Perplexity increases the proportion of vital evidence per snippet by 63 percent. This mirrors a broader industry shift toward precision, similar to OpenRouter's reranker API launch.
The compression engine is now live across Perplexity's consumer applications and the Perplexity Agent API's search tools. It operates in under 20 milliseconds, making it fast enough to sit in the real-time serving path. For developers, this translates to higher precision in RAG (grounding AI responses in external data) and significantly lower token consumption.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

