NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA

Apr 24, 2026

NVIDIA reported that DeepSeek-V4-Pro achieves over 150 tokens per second on Blackwell Ultra hardware. This performance level makes 1.6-trillion parameter models viable for real-time autonomous agents. Future software updates like Dynamo and NVFP4 are expected to push these speeds even higher.

NVIDIA released performance benchmarks for DeepSeek-V4-Pro, the 1.6-trillion parameter flagship model. The results follow the launch of the V4 preview by DeepSeek. Running on Blackwell Ultra hardware, the model achieves over 150 tokens per second (TPS) for user interactivity using the vLLM inference framework (the process of running a trained model).

Model: DeepSeek-V4-Pro
Parameters: 1.6 trillion
Context window: 1 million tokens
Hardware: NVIDIA Blackwell Ultra
Throughput: 150+ tokens per second
Software support: vLLM, LMSYS

High-speed inference is the primary bottleneck for agentic workflows requiring multiple reasoning steps. Achieving 150 TPS on a model of this scale proves that massive Mixture-of-Experts architectures can power responsive, autonomous systems. This performance level is part of a broader shift toward day-zero support for frontier models in open inference frameworks.

You can start building with these models today through the LMSYS and vLLM projects. Performance will increase as NVIDIA extends its software stack with Dynamo 1.0 and NVFP4 (a 4-bit floating-point format). These optimizations will further reduce the compute overhead for the model's million-token context window.

NVIDIA AI

@NVIDIAAIApr 24

✨ DeepSeek-V4 is here — a million-token context, 1.6T parameter powerhouse optimized for agentic workflows. Out of the box, on DeepSeek-V4-Pro, NVIDIA Blackwell Ultra delivers over 150 TPS/user interactivity for agentic workflows. And we’re just getting started. Expect these performance figures to climb higher as we implement Dynamo, NVFP4, and advanced parallelization techniques. Start building today with @lmsysorg and @vllm_project

15230

View on X

Still wondering? A few quick answers below.

DeepSeek-V4-Pro achieves over 150 tokens per second per user on NVIDIA Blackwell Ultra hardware. This high level of interactivity is available out of the box and is specifically optimized for agentic workflows, which require multiple steps of reasoning and tool use to complete complex tasks autonomously without constant human direction.

DeepSeek-V4-Pro is a large-scale Mixture of Experts model featuring 1.6 trillion parameters and a massive one-million-token context window. It is designed for high-performance reasoning and agentic tasks, where the model must process vast amounts of information and execute multi-step actions to achieve goals independently rather than just responding to prompts.

NVIDIA expects performance to improve beyond the initial 150 tokens per second by implementing several advanced technologies. These include Dynamo, an open-source inference operating system, and NVFP4, a specialized 4-bit floating-point data type. Advanced parallelization techniques through the vLLM project will also be used to further accelerate the model's total throughput.

Yes, developers can begin building with DeepSeek-V4-Pro immediately. The model is supported through the vLLM inference framework and the LMSYS project. These platforms allow users to deploy the 1.6-trillion parameter model on compatible hardware to take advantage of its million-token context window and optimized performance for building autonomous AI agent applications.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIAMay 6

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIA's Vera Rubin platform uses a co-designed stack of seven specialized chips to solve the high-latency and cost bottlenecks of autonomous AI agents. By integrating dedicated hardware for token generation and tool execution, the system maintains high interactivity for trillion-parameter models while reducing token costs by 90 percent compared to previous architectures.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Together AI6h ago

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Together AI now ranks first on Artificial Analysis for DeepSeek V4 Pro inference, delivering 211.9 tokens per second. This performance lead across 11 providers stems from inference systems optimizations, including custom KV cache management, prefix reuse, and kernel tuning on NVIDIA HGX B200 hardware. The deployment achieves the lowest latency and highest output speed for the model.

What is the performance of DeepSeek-V4-Pro on NVIDIA Blackwell Ultra?

What is DeepSeek-V4-Pro?

How does NVIDIA plan to increase DeepSeek-V4-Pro inference speeds?

Is DeepSeek-V4-Pro available for developers to use?

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks