We've productionized query-aware compression for faster, cleaner, more-accurate search. Better context is better than more context. Our system cuts context tokens up to 70% while improving answer quality. https://t.co/gmVr3oZRl9
Perplexity Productionizes Query-Aware Compression to Slash Search Token Usage by 70 Percent
pplx-diffusion model—a bidirectional encoder that identifies sentences essential to a user's request.- Token reduction
- Up to 70%
- Vital content per snippet
- 63% increase
- Compression ratio
- 50x (SimpleQA)
- Inference latency
- <20ms (p99)
- Model architecture
- Distilled 17-layer pplx-diffusion
The update addresses context rot, where irrelevant noise in massive context windows degrades a model's reasoning. By removing distractors like ads and navigation text, Perplexity increases the proportion of vital evidence per snippet by 63 percent. This mirrors a broader industry shift toward precision, similar to OpenRouter's reranker API launch.
The compression engine is now live across Perplexity's consumer applications and the Perplexity Agent API's search tools. It operates in under 20 milliseconds, making it fast enough to sit in the real-time serving path. For developers, this translates to higher precision in RAG (grounding AI responses in external data) and significantly lower token consumption.
Still wondering? A few quick answers below.


