NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIANVIDIA

NVIDIA reported that DeepSeek-V4-Pro achieves over 150 tokens per second on Blackwell Ultra hardware. This performance level makes 1.6-trillion parameter models viable for real-time autonomous agents. Future software updates like Dynamo and NVFP4 are expected to push these speeds even higher.

NVIDIA released performance benchmarks for DeepSeek-V4-Pro, the 1.6-trillion parameter flagship model. The results follow the launch of the V4 preview by DeepSeek. Running on Blackwell Ultra hardware, the model achieves over 150 tokens per second (TPS) for user interactivity using the vLLM inference framework (the process of running a trained model).
Model
DeepSeek-V4-Pro
Parameters
1.6 trillion
Context window
1 million tokens
Hardware
NVIDIA Blackwell Ultra
Throughput
150+ tokens per second
Software support
vLLM, LMSYS

High-speed inference is the primary bottleneck for agentic workflows requiring multiple reasoning steps. Achieving 150 TPS on a model of this scale proves that massive Mixture-of-Experts architectures can power responsive, autonomous systems. This performance level is part of a broader shift toward day-zero support for frontier models in open inference frameworks.

You can start building with these models today through the LMSYS and vLLM projects. Performance will increase as NVIDIA extends its software stack with Dynamo 1.0 and NVFP4 (a 4-bit floating-point format). These optimizations will further reduce the compute overhead for the model's million-token context window.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

✨ DeepSeek-V4 is here — a million-token context, 1.6T parameter powerhouse optimized for agentic workflows. Out of the box, on DeepSeek-V4-Pro, NVIDIA Blackwell Ultra delivers over 150 TPS/user interactivity for agentic workflows. And we’re just getting started. Expect these performance figures to climb higher as we implement Dynamo, NVFP4, and advanced parallelization techniques. Start building today with @lmsysorg and @vllm_project

15retweets230likes
View on X

Still wondering? A few quick answers below.

DeepSeek-V4-Pro achieves over 150 tokens per second per user on NVIDIA Blackwell Ultra hardware. This high level of interactivity is available out of the box and is specifically optimized for agentic workflows, which require multiple steps of reasoning and tool use to complete complex tasks autonomously without constant human direction.

DeepSeek-V4-Pro is a large-scale Mixture of Experts model featuring 1.6 trillion parameters and a massive one-million-token context window. It is designed for high-performance reasoning and agentic tasks, where the model must process vast amounts of information and execute multi-step actions to achieve goals independently rather than just responding to prompts.

NVIDIA expects performance to improve beyond the initial 150 tokens per second by implementing several advanced technologies. These include Dynamo, an open-source inference operating system, and NVFP4, a specialized 4-bit floating-point data type. Advanced parallelization techniques through the vLLM project will also be used to further accelerate the model's total throughput.

Yes, developers can begin building with DeepSeek-V4-Pro immediately. The model is supported through the vLLM inference framework and the LMSYS project. These platforms allow users to deploy the 1.6-trillion parameter model on compatible hardware to take advantage of its million-token context window and optimized performance for building autonomous AI agent applications.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update