Details
- Perplexity has deployed a production-grade, query-aware context compression system to improve search speed and answer quality.
- The system reduces context tokens by up to 70%, meaning far less text is sent to the answer-generation model per query.
- Compression is query-aware: it focuses on content relevant to the user’s question, rather than blindly shortening documents.
- Non-essential elements such as ads, navigation UI, metadata, and other unhelpful content are removed before being passed to the answer model.
- Perplexity reports that vital content per snippet increases by 63%, indicating denser, more information-rich inputs to the LLM.
- On the SimpleQA benchmark, the system achieves a 50x compression ratio while maintaining what Perplexity describes as frontier-level performance.
- The company emphasizes that, unlike some prior RAG compression methods, their approach preserves citations so answers can still be grounded in identifiable sources.
- A linked research blog details how the method is engineered to be fast enough for orchestration in real-time search workflows.
- Perplexity frames this work as an evolution of existing RAG compression techniques, optimizing specifically for query-awareness, citation integrity, and production latency.
Impact
By making compression query-aware and citation-preserving, Perplexity is addressing a key bottleneck in large-context RAG systems: the trade-off between cost, latency, and grounding quality. This move positions Perplexity more competitively against general-purpose LLM APIs from OpenAI, Anthropic, and Google, particularly for web-scale, citation-heavy search and answer experiences where efficient context use is critical.
