Details
- NVIDIA announced DeepSeek-V4, featuring 1 million-token context and 1.6T parameters, optimized for agentic workflows.
- On DeepSeek-V4-Pro with NVIDIA Blackwell Ultra, it achieves over 150 tokens per second (TPS) per user for interactive agentic tasks.
- DeepSeek-V4-Pro has 1.6T total parameters with 49B active; DeepSeek-V4-Flash is smaller at 284B total/13B active for faster, cost-effective use.
- Key innovations include novel token-wise compression attention and DeepSeek Sparse Attention (DSA) for efficient 1M context handling.
- NVIDIA's NVFP4 quantization and TensorRT-LLM on Blackwell enable up to 2.5x lower latency and 16x more users per GPU vs. H200.
- DeepSeek-V4 excels in agentic coding, math/STEM, world knowledge, rivaling top closed-source models while being open-source.
Impact
NVIDIA's optimization of DeepSeek-V4 on Blackwell Ultra delivers over 150 TPS/user, leveraging NVFP4 for 2.5x latency reductions and 16x user capacity per GPU over H100/H200 predecessors, pressuring rivals like AMD's MI300X in AI inference efficiency. This widens Blackwell's lead in agentic workflows, lowering costs for hyperscalers deploying million-token MoE models and accelerating open-source adoption against closed giants like OpenAI. Energy gains up to 50x on Blackwell Ultra support scalable, sustainable AI infrastructure amid rising compute demands.
