HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Blackwell Boosts MoE Inference Performance with TensorRT-LLM, NVFP4, and MTP for Faster, More Efficient AI

NVIDIA has achieved significant performance advancements in inference for mixture-of-experts (MoE) models using its Blackwell architecture, demonstrating substantial gains in token throughput and efficiency. These improvements are driven by deep hardware-software co-design across GPUs, CPUs, networking, power delivery, cooling, and software stacks—enabling AI platforms to generate more tokens per watt, which lowers the cost per million tokens. A key highlight is the latest update to NVIDIA TensorRT-LLM, the open-source library for optimizing large language model inference. Running on the NVIDIA GB200 NVL72 rack-scale platform—which integrates 72 Blackwell GPUs via fifth-generation NVLink and NVLink Switch chips with 1,800 GB/s of bidirectional bandwidth—these enhancements have boosted DeepSeek-R1 inference performance by up to 2.8x per GPU over just three months. DeepSeek-R1, a 671-billion-parameter sparse MoE model that activates 37 billion parameters per token, benefits greatly from this architecture. The NVL72 platform’s high-bandwidth interconnects are optimized for MoE models, which require frequent communication between experts during token generation. Additionally, the Blackwell GPU introduces hardware acceleration for NVFP4, a custom four-bit floating-point format designed by NVIDIA that maintains higher accuracy compared to other FP4 alternatives. Further performance gains come from disaggregated serving, where prefill operations run on one GPU set and decoding on another—leveraging the NVL72’s scalable design and NVLink Switch technology. This enables superior throughput across various sequence lengths, including 8K/1K and 1K/1K, with notable improvements observed in both throughput and interactivity. On the HGX B200 platform—eight Blackwell GPUs linked via fifth-generation NVLink—two innovations drive performance leaps: multi-token prediction (MTP) and NVFP4 utilization. MTP significantly increases throughput across all tested input/output configurations, while NVFP4 enhances compute efficiency without sacrificing accuracy. Together, these technologies allow the HGX B200 to achieve higher throughput at lower latency, enabling greater interactivity even in air-cooled deployments. TensorRT-LLM’s PyTorch-native architecture supports rapid experimentation and customization, empowering developers to optimize inference workflows. The full NVIDIA software stack—including TensorRT-LLM and TensorRT Model Optimizer—ensures NVFP4 is implemented efficiently and accurately. These advancements underscore NVIDIA’s commitment to continuous performance optimization across its data center platform. By combining next-generation hardware with iterative software improvements, NVIDIA is delivering higher value from existing GPU infrastructure, helping cloud service providers, model builders, and enterprises extend the life and utility of their AI systems. The result is a more efficient, scalable, and cost-effective AI ecosystem—capable of serving more users with faster, higher-quality responses. For detailed performance metrics, including benchmarks across different model sizes and configurations, visit the NVIDIA Data Center Deep Learning Product Performance page.

Related Links