Perplexity launches production query-aware context compression for faster, higher-precision search

Details

Perplexity has deployed a production-grade, query-aware context compression system to improve search speed and answer quality.
The system reduces context tokens by up to 70%, meaning far less text is sent to the answer-generation model per query.
Compression is query-aware: it focuses on content relevant to the user’s question, rather than blindly shortening documents.
Non-essential elements such as ads, navigation UI, metadata, and other unhelpful content are removed before being passed to the answer model.
Perplexity reports that vital content per snippet increases by 63%, indicating denser, more information-rich inputs to the LLM.
On the SimpleQA benchmark, the system achieves a 50x compression ratio while maintaining what Perplexity describes as frontier-level performance.
The company emphasizes that, unlike some prior RAG compression methods, their approach preserves citations so answers can still be grounded in identifiable sources.
A linked research blog details how the method is engineered to be fast enough for orchestration in real-time search workflows.
Perplexity frames this work as an evolution of existing RAG compression techniques, optimizing specifically for query-awareness, citation integrity, and production latency.

Impact

By making compression query-aware and citation-preserving, Perplexity is addressing a key bottleneck in large-context RAG systems: the trade-off between cost, latency, and grounding quality. This move positions Perplexity more competitively against general-purpose LLM APIs from OpenAI, Anthropic, and Google, particularly for web-scale, citation-heavy search and answer experiences where efficient context use is critical.

Perplexity launches production query-aware context compression for faster, higher-precision search

Details

Impact

Social

CONTENT

INFO