What if every decode step gave the next one a head start? Meet Guess-Verify-Refine — a new hardware-aware sparse-attention algorithm from NVIDIA Research. Built for TensorRT LLM on Blackwell, it reuses temporal patterns across decode steps for: → 1.88x faster Top-K attention → 9.3% better end-to-end latency in low-latency serving Dive into the paper: https://t.co/quu7wX9sCh
NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference
· Updated
Top-K selection stage (identifying the most relevant data points for each token) which becomes a bottleneck as context windows expand. GVR achieves a 1.88x speedup for this specific operator.- Top-K operator speedup
- 1.88x
- End-to-end latency improvement
- 9.3%
- Target hardware
- NVIDIA Blackwell
- Validated model
- DeepSeek-V3.2
- Software integration
- TensorRT-LLM
As models like DeepSeek-V3.2 push toward massive context lengths, managing attention indices can outweigh actual computation. This update follows NVIDIA's Dynamo inference stack rebuild and DeepSeek-V4-Pro's Blackwell performance milestones. By exploiting temporal correlation—reusing patterns across decode steps—GVR delivers bit-exact results with significantly lower latency.
You can access GVR through the latest TensorRT-LLM integration, optimized for DeepSeek Sparse Attention workloads. The algorithm improves end-to-end latency by up to 9.3% in low-latency serving for tasks involving 100K tokens or more. While validated on Blackwell, the principle may extend to other sparse-attention decoders.
Still wondering? A few quick answers below.





