NVIDIA Research Unveils GVR Algorithm for 1.88x Faster Blackwell Inference

NVIDIA

May 8, 2026 · Updated Jun 7, 2026

NVIDIA Research developed Guess-Verify-Refine, a hardware-aware algorithm that speeds up the selection of important data points during AI reasoning. By reusing patterns from previous steps, the system reduces latency for long-context models on Blackwell GPUs without sacrificing mathematical accuracy.

NVIDIA Research introduced Guess-Verify-Refine (GVR), a hardware-aware algorithm designed to accelerate sparse-attention decoding on Blackwell GPUs. The system targets the Top-K selection stage (identifying the most relevant data points for each token) which becomes a bottleneck as context windows expand. GVR achieves a 1.88x speedup for this specific operator.

Top-K operator speedup: 1.88x
End-to-end latency improvement: 9.3%
Target hardware: NVIDIA Blackwell
Validated model: DeepSeek-V3.2
Software integration: TensorRT-LLM

As models like DeepSeek-V3.2 push toward massive context lengths, managing attention indices can outweigh actual computation. This update follows NVIDIA's Dynamo inference stack rebuild and DeepSeek-V4-Pro's Blackwell performance milestones. By exploiting temporal correlation—reusing patterns across decode steps—GVR delivers bit-exact results with significantly lower latency.

You can access GVR through the latest TensorRT-LLM integration, optimized for DeepSeek Sparse Attention workloads. The algorithm improves end-to-end latency by up to 9.3% in low-latency serving for tasks involving 100K tokens or more. While validated on Blackwell, the principle may extend to other sparse-attention decoders.

View the full update on arxiv.org

NVIDIA AI

@NVIDIAAIMay 7

What if every decode step gave the next one a head start? Meet Guess-Verify-Refine — a new hardware-aware sparse-attention algorithm from NVIDIA Research. Built for TensorRT LLM on Blackwell, it reuses temporal patterns across decode steps for: → 1.88x faster Top-K attention → 9.3% better end-to-end latency in low-latency serving Dive into the paper: https://t.co/quu7wX9sCh

24139

View on X

Still wondering? A few quick answers below.

Guess-Verify-Refine is a hardware-aware algorithm developed by NVIDIA Research to optimize sparse-attention decoding. It specifically targets the Top-K selection stage, which identifies the most important data points for an AI model to process. By making this stage more efficient, the algorithm reduces latency bottlenecks that typically occur when serving large language models with long context windows.

The algorithm improves speed by exploiting temporal correlation, which means it reuses the attention patterns from previous decoding steps to predict the next one. It uses a three-step process: guessing the important data points based on previous results, verifying those candidates through a fast counting method, and refining the final selection within the GPU's shared memory for exact accuracy.

On NVIDIA Blackwell hardware, the Guess-Verify-Refine algorithm delivers a 1.88x average speedup for the Top-K operator compared to standard production kernels. In real-world serving scenarios, this translates to a 9.3% improvement in end-to-end latency. These gains are particularly noticeable in long-context workloads, such as those involving 100,000 tokens or more, where selection overhead is highest.

Yes, NVIDIA validated the Guess-Verify-Refine algorithm using DeepSeek-V3.2 workloads. The algorithm is specifically designed to work with the DeepSeek Sparse Attention indexer, which uses specialized mathematical structures to manage data. By integrating with these structures, GVR maintains bit-exact mathematical accuracy while significantly reducing the time required to process the model's complex attention mechanisms.

Developers can access the Guess-Verify-Refine algorithm through NVIDIA TensorRT-LLM, where it has been integrated into the DeepSeek Sparse Attention stack. It is currently optimized for the NVIDIA Blackwell architecture to take advantage of specific hardware features. While the initial implementation focuses on Blackwell and DeepSeek, the underlying principles of temporal correlation may eventually apply to other sparse-attention decoders.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Research integrated speculative decoding into the NeMo-RL training framework to remove the bottleneck of autoregressive rollout generation. By using a vLLM backend to accelerate response generation during reinforcement learning, the system delivers up to a 1.8x throughput increase without altering the model's output distribution.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

CursorApr 15

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Cursor partnered with NVIDIA to apply a multi-agent system to CUDA kernel optimization, achieving a 38 percent geomean speedup on Blackwell GPUs. This demonstrates that autonomous agents can solve complex hardware engineering tasks that previously required months of manual effort from human experts.

What is the NVIDIA Guess-Verify-Refine algorithm?

How does the Guess-Verify-Refine algorithm improve inference speed?

What performance gains does GVR provide on NVIDIA Blackwell GPUs?

Is the Guess-Verify-Refine algorithm compatible with DeepSeek models?

How can developers use the Guess-Verify-Refine algorithm?

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Keep reading

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

NVIDIA Accelerates Reasoning Model Training With Speculative Decoding Rollouts

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup