Details
- Perplexity released new research detailing its post-training pipeline using supervised fine-tuning (SFT) followed by on-policy reinforcement learning (RL) to enhance AI models for search-augmented answers.
- The pipeline improves search accuracy, citation quality, instruction following, and efficiency, applied to Qwen models from Alibaba to match or exceed GPT models in factuality at lower cost.
- First stage fine-tunes models to follow instructions, adhere to guardrails, and maintain consistent language.
- Second stage uses RL with a reward design combining correctness, user preference, and efficiency; preferences only factor in for correct answers to avoid optimizing flawed responses.
- This explains why the same base model delivers more accurate, better-cited, and efficient answers within Perplexity compared to unmodified versions.
Impact
Perplexity's SFT + RL pipeline positions it to challenge OpenAI and Anthropic by achieving GPT-level factuality on Qwen base models at reduced costs, potentially accelerating adoption of cost-efficient search-augmented AI in enterprise research tools. Benchmarks like 93.9% SimpleQA accuracy for related Deep Research features outpace ChatGPT's 87.6% citation validity, narrowing the gap with frontier models while emphasizing verifiable citations over raw generation. This operational edge could pressure rivals to prioritize post-training for retrieval tasks, shifting market dynamics toward hybrid search-LLM systems amid rising demands for trustworthy AI outputs in professional settings.
