NVIDIA Research Unveils Guess-Verify-Refine Algorithm for 1.88x Faster Sparse-Attention Decoding on Blackwell

Details

NVIDIA Research introduced Guess-Verify-Refine (GVR), a data-aware exact Top-K algorithm optimized for sparse-attention decoding in TensorRT-LLM on NVIDIA Blackwell GPUs.
GVR exploits temporal patterns from the previous decode step's Top-K indices to predict and refine the threshold for the next step's top-2048 tokens from tens or hundreds of thousands of indexer scores.
The four-phase process includes Guess (secant interpolation for threshold estimation), Verify (candidate collection above threshold using cached counts), and Refine (histogram-based exact selection in shared memory).
Key optimizations reduce global memory passes to approximately 3N/P + 2C/P, eliminating redundant scans and achieving 2.2–4.7 snap iterations on average for refinement.
Designed for DeepSeek Sparse Attention (DSA), it addresses Top-K as a latency bottleneck in long-context LLM serving, delivering 1.88x speedup in Top-K attention.

Impact

GVR advances NVIDIA's edge in efficient LLM inference on Blackwell, outpacing generic Top-K methods like radix select by minimizing memory accesses and leveraging temporal data reuse specific to sparse attention. This pressures competitors like AMD's MI300X and Intel's Gaudi3, which lack equivalent Blackwell-optimized sparse decoding. By slashing decode latency in long-context scenarios, it accelerates adoption of models like DeepSeek, lowering costs for high-throughput serving and widening access to extended context windows amid rising demands for agentic AI.

NVIDIA Research Unveils Guess-Verify-Refine Algorithm for 1.88x Faster Sparse-Attention Decoding on Blackwell

Details

Impact

Social

CONTENT

INFO