vLLM Enables Efficient 1M Token Inference with DeepSeek V4 Day-0 Support

Agentic Coding
LLM
AI Research
Performance

vLLM Enables Efficient 1M Token Inference with DeepSeek V4 Day-0 Support
vLLM, an inference engine for serving large language models, added Day-0 support for the DeepSeek V4 family. This includes the V4 Pro and V4 Flash models, which are Mixture-of-Experts (MoE) architectures (models that activate only a fraction of their parameters per token) supporting 1 million tokens.

This update follows the pattern seen in the DeepSeek V4 launch and addresses the hardware barriers of million-token windows. The implementation uses compressed sparse attention to reduce the KV cache (memory used to store conversation history) to 10% of what previous generations required. This mirrors the shift toward specialized kernels.

You can now deploy these models on vLLM to process massive datasets, like entire codebases, without prohibitive compute costs. The V4 Pro model offers frontier-level performance, while V4 Flash provides a faster alternative. Both are available as open-weight models for immediate self-hosting or integration into agentic workflows.

Read the full update →

Frequently asked questions

What is DeepSeek V4?
DeepSeek V4 is a new generation of Mixture-of-Experts language models designed for massive context tasks. It includes two main versions: the flagship V4 Pro and the faster V4 Flash. Both models natively support context windows of up to 1 million tokens, making them suitable for processing large documents or entire codebases in a single interaction.
How does DeepSeek V4 handle 1 million tokens efficiently?
DeepSeek V4 uses a new attention mechanism that combines Compressed Sparse Attention and Heavily Compressed Attention. This architecture significantly reduces hardware requirements compared to previous models. For a 1 million token context, the V4 Pro model requires only 10 percent of the KV cache and 27 percent of the inference compute power used by the earlier DeepSeek V3.
What are the differences between DeepSeek V4 Pro and V4 Flash?
DeepSeek V4 Pro is the larger flagship model with 1.6 trillion total parameters, activating 49 billion per token. DeepSeek V4 Flash is a smaller, more efficient version with 284 billion total parameters and 13 billion active parameters. While both support 1 million tokens, the Flash version is optimized for lower latency and significantly reduced serving costs.
Is DeepSeek V4 open source?
DeepSeek V4 is released as an open-weight model family, meaning the trained parameters are publicly available for download and self-hosting. Developers can run these models on their own infrastructure using inference engines like vLLM. This allows organizations to maintain data privacy and control while utilizing frontier-level long-context capabilities without relying solely on external API providers.
How does vLLM support DeepSeek V4?
vLLM provides Day-0 support for DeepSeek V4 by implementing the model's unique long-context attention mechanism directly into its high-throughput inference engine. This integration allows users to serve V4 Pro and Flash models with optimized memory efficiency. Alongside the code release, vLLM published a technical walkthrough explaining the first-principles implementation of this new attention architecture.