DeepSeek V4 built on NVIDIA Blackwell GPUs
DeepSeek has officially launched its fourth-generation flagship AI models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, designed to enable highly efficient inference with context windows up to one million tokens. This release marks a strategic shift toward complex agentic workflows, long-document analysis, and advanced coding tasks that require processing massive amounts of data in a single pass. The DeepSeek-V4-Pro is the larger of the two models, featuring 1.6 trillion total parameters with 49 billion active parameters. It is optimized for advanced reasoning and handling long-context agents. In contrast, the DeepSeek-V4-Flash is a more compact model with 284 billion total parameters and 13 billion active parameters, targeting high-speed efficiency for chat, routing, and summarization. Both models operate under an MIT license and support a maximum output length of 384,000 tokens via the DeepSeek API. Architecturally, the V4 family leverages a modified Mixture of Experts design with a new hybrid attention system. This innovation combines several attention mechanisms to significantly reduce computational costs. Compared to its predecessor, the DeepSeek-V3, the new architecture achieves a 73% reduction in per-token inference floating-point operations and a 90% reduction in key-value cache memory requirements. These improvements address critical bottlenecks in agentic systems, which must maintain extensive context including system instructions, tool outputs, memory traces, and multi-step reasoning data. NVIDIA has highlighted the strong synergy between these models and its Blackwell GPU platform. Out-of-the-box tests on the NVIDIA GB200 NVL72 system demonstrated that DeepSeek-V4-Pro can deliver over 150 tokens per second per user. The Blackwell architecture provides the necessary scale and low-latency performance required to deploy trillion-parameter models at the frontier of intelligence. NVIDIA is continuing to optimize this stack through tools like Dynamo, NVFP4, and advanced CUDA kernels to further enhance performance. Developers can integrate DeepSeek-V4 through various channels. The models are available on NVIDIA's build.nvidia.com platform via GPU-accelerated endpoints, allowing for quick prototyping before moving to self-hosted deployments. Additionally, the models can be downloaded and deployed using NVIDIA NIM microservices, which support familiar API patterns for building long-context applications. For more complex deployments, the models are compatible with leading open-source serving frameworks. SGLang offers specialized recipes for low-latency, high-throughput, and prefill-decode disaggregation scenarios on both Blackwell and Hopper GPUs. Similarly, vLLM provides single-node and multi-node serving options, supporting configurations that scale across more than 100 GPUs. These tools facilitate critical features such as tool calling, speculative decoding, and robust reasoning capabilities. By focusing on infrastructure strategy alongside model selection, enterprises can now deploy these high-performance open models with optimized token costs. DeepSeek-V4 positions itself as a versatile foundation for the next generation of AI agents, offering flexibility for organizations to test new capabilities through open-source channels like Hugging Face or to deploy enterprise-grade solutions through NVIDIA's ecosystem.
