Details
- Perplexity released new research detailing how it serves post-trained Qwen3 235B mixture-of-experts (MoE) models on NVIDIA GB200 NVL72 Blackwell racks for high-throughput inference.
- GB200 outperforms prior Hopper-generation hardware (e.g., H200) significantly, excelling in both prefill (compute-bound) and decode (latency/memory-bound) phases of LLM inference.
- Key improvements include Blackwell Tensor Cores, higher memory bandwidth, NVLink interconnects, and SHARP in-memory compression for prefill; rack-scale NVLink enables decode parallelism unavailable on Hopper.
- Benchmarks: NVLink all-reduce latency drops from 586.1µs on H200 to 313.3µs on GB200; MoE prefill combine time falls from 730.1µs to 438.5µs; GB200 sustains higher throughput at peak token speeds.
- Optimizations feature prefill/decode disaggregation, Blackwell-native quantization, custom kernels, and rack-scale NVLink, yielding faster responses and lower serving costs.
- Full technical paper linked in thread for detailed methodology and results.
Impact
Perplexity's benchmarks validate NVIDIA GB200 NVL72 as a leading platform for large-scale LLM inference, narrowing performance gaps with Hopper while cutting token costs through rack-scale NVLink and optimizations. This pressures rivals like AMD's MI300X and custom inference platforms (e.g., Grok's Colossus, AWS Trainium), reinforcing NVIDIA's dominance in production AI serving. Lower latency and higher throughput accelerate real-time applications like search and agents, widening access for enterprises while aligning with data center trends toward Blackwell for cost-efficient scaling over prior generations.
