NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

LLM
AI Agent
AI Hardware
Performance

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second
NVIDIA released performance benchmarks for DeepSeek-V4-Pro, the 1.6-trillion parameter flagship launched as a preview by DeepSeek. Running on Blackwell Ultra hardware, the model achieves over 150 tokens per second (TPS) for user interactivity. This throughput is achieved using the vLLM inference framework (the process of running a trained model).

High-speed inference is the primary bottleneck for agentic workflows requiring multiple reasoning steps. Achieving 150 TPS on a model of this scale proves that massive Mixture-of-Experts architectures can power responsive, autonomous systems. This performance mirrors the scaling standards for Blackwell seen in recent industry benchmarks.

You can start building with these models today through the LMSYS and vLLM projects. Performance will increase as NVIDIA integrates Dynamo 1.0 and NVFP4 (a 4-bit floating-point format). These optimizations will further reduce the compute overhead for the model's million-token context window.

Frequently asked questions

What is the performance of DeepSeek-V4-Pro on NVIDIA Blackwell Ultra?
DeepSeek-V4-Pro achieves over 150 tokens per second per user on NVIDIA Blackwell Ultra hardware. This high level of interactivity is available out of the box and is specifically optimized for agentic workflows, which require multiple steps of reasoning and tool use to complete complex tasks autonomously without constant human direction.
What is DeepSeek-V4-Pro?
DeepSeek-V4-Pro is a large-scale Mixture of Experts model featuring 1.6 trillion parameters and a massive one-million-token context window. It is designed for high-performance reasoning and agentic tasks, where the model must process vast amounts of information and execute multi-step actions to achieve goals independently rather than just responding to prompts.
How does NVIDIA plan to increase DeepSeek-V4-Pro inference speeds?
NVIDIA expects performance to improve beyond the initial 150 tokens per second by implementing several advanced technologies. These include Dynamo, an open-source inference operating system, and NVFP4, a specialized 4-bit floating-point data type. Advanced parallelization techniques through the vLLM project will also be used to further accelerate the model's total throughput.
Is DeepSeek-V4-Pro available for developers to use?
Yes, developers can begin building with DeepSeek-V4-Pro immediately. The model is supported through the vLLM inference framework and the LMSYS project. These platforms allow users to deploy the 1.6-trillion parameter model on compatible hardware to take advantage of its million-token context window and optimized performance for building autonomous AI agent applications.