Cursor Achieves 1.84x Faster MoE Inference with Warp Decode on Blackwell

Details

Cursor rebuilt token generation for Mixture-of-Experts (MoE) models on NVIDIA Blackwell GPUs, introducing a technique called warp decode that reorganizes GPU parallelism around outputs rather than experts.
Performance gains reach 1.84x throughput improvement on Blackwell while simultaneously improving accuracy, with outputs 1.4x closer to full FP32 reference precision.
Warp decode eliminates five traditional bookkeeping steps by assigning each GPU warp (32 parallel processing lanes) exactly one output value, which streams weight data directly from memory and aggregates results across all eight routed experts without intermediate staging or buffers.
The entire MoE compute layer is compressed into two kernels, avoiding padding overhead, scatter-combine operations, and intermediate activation quantization that previously introduced rounding errors across model layers.
These inference improvements directly accelerate Composer model training and deployment cycles, enabling Cursor to ship improved model versions more frequently.

Impact

Warp decode represents a meaningful optimization for inference workloads on modern GPUs, combining throughput gains with measurable accuracy improvements—a rare achievement in kernel design. The 1.84x speedup and elimination of intermediate quantization degradation positions Cursor to iterate faster on Composer while reducing computational overhead. This approach may influence how other teams optimize MoE inference on Blackwell architecture, particularly as MoE models become more prevalent in production. The dual benefit of speed and accuracy addresses two often-competing objectives, making this relevant to organizations scaling language model inference at scale.

Cursor Achieves 1.84x Faster MoE Inference with Warp Decode on Blackwell

Details

Impact

Social

CONTENT

INFO