HeadsUpAI

NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

· Updated

NVIDIA Research introduced Guess-Verify-Refine (GVR), a hardware-aware algorithm designed to accelerate sparse-attention decoding on Blackwell GPUs. The system targets the Top-K selection stage (identifying the most relevant data points for each token) which becomes a bottleneck as context windows expand. GVR achieves a 1.88x speedup for this specific operator.
Top-K operator speedup
1.88x
End-to-end latency improvement
9.3%
Target hardware
NVIDIA Blackwell
Validated model
DeepSeek-V3.2
Software integration
TensorRT-LLM

As models like DeepSeek-V3.2 push toward massive context lengths, managing attention indices can outweigh actual computation. This update follows NVIDIA's Dynamo inference stack rebuild and DeepSeek-V4-Pro's Blackwell performance milestones. By exploiting temporal correlation—reusing patterns across decode steps—GVR delivers bit-exact results with significantly lower latency.

You can access GVR through the latest TensorRT-LLM integration, optimized for DeepSeek Sparse Attention workloads. The algorithm improves end-to-end latency by up to 9.3% in low-latency serving for tasks involving 100K tokens or more. While validated on Blackwell, the principle may extend to other sparse-attention decoders.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

What if every decode step gave the next one a head start? Meet Guess-Verify-Refine — a new hardware-aware sparse-attention algorithm from NVIDIA Research. Built for TensorRT LLM on Blackwell, it reuses temporal patterns across decode steps for: → 1.88x faster Top-K attention → 9.3% better end-to-end latency in low-latency serving Dive into the paper: https://t.co/quu7wX9sCh

24retweets139likes
View on X

Still wondering? A few quick answers below.

Guess-Verify-Refine is a hardware-aware algorithm developed by NVIDIA Research to optimize sparse-attention decoding. It specifically targets the Top-K selection stage, which identifies the most important data points for an AI model to process. By making this stage more efficient, the algorithm reduces latency bottlenecks that typically occur when serving large language models with long context windows.

The algorithm improves speed by exploiting temporal correlation, which means it reuses the attention patterns from previous decoding steps to predict the next one. It uses a three-step process: guessing the important data points based on previous results, verifying those candidates through a fast counting method, and refining the final selection within the GPU's shared memory for exact accuracy.

On NVIDIA Blackwell hardware, the Guess-Verify-Refine algorithm delivers a 1.88x average speedup for the Top-K operator compared to standard production kernels. In real-world serving scenarios, this translates to a 9.3% improvement in end-to-end latency. These gains are particularly noticeable in long-context workloads, such as those involving 100,000 tokens or more, where selection overhead is highest.

Yes, NVIDIA validated the Guess-Verify-Refine algorithm using DeepSeek-V3.2 workloads. The algorithm is specifically designed to work with the DeepSeek Sparse Attention indexer, which uses specialized mathematical structures to manage data. By integrating with these structures, GVR maintains bit-exact mathematical accuracy while significantly reducing the time required to process the model's complex attention mechanisms.

Developers can access the Guess-Verify-Refine algorithm through NVIDIA TensorRT-LLM, where it has been integrated into the DeepSeek Sparse Attention stack. It is currently optimized for the NVIDIA Blackwell architecture to take advantage of specific hardware features. While the initial implementation focuses on Blackwell and DeepSeek, the underlying principles of temporal correlation may eventually apply to other sparse-attention decoders.

Share this update